Pages

Saturday, November 29, 2014

Machine Learning vs Statistics (Statistical Learning)

In a sense, the difference between Machine Learning and Statistics (Statistical Learning) is what is the difference between a computer scientist and a statistician. In my humble opinion, they are becoming more similar as time goes by. On the other hand, Machine Learning is probably more empirical and works on problems by utilizing various optimization algorithms while Statistics (Statistical Learning) is more emphasizing on the assumptions and model validation -- more rigorous in mathematics and more often to talk about VC dimensions, KL divergence, conjugate prior, and etc. In practice, ML uses Matlab or Python while SL favors R. 

Rob Tibshirani, the author of "The Elements of Statistical Learning" gave a 'comparison of machine learning and statistics, here is a screenshot for your convenience:
In useR! 2004, Brian D. Ripley, another statistician, said: 'machine learning is statistics minus any checking of models and assumptions'.
A lecturer at Udacity gives a good comparison in one sentence, "Statistics focuses on analyzing the existing data and drawing valid conclusions, while Machine Learning focuses on making prediction and less worrying about the assumption as long as it makes good predictions".
Since 2001, the comparison become less  member the things have changed when William Cleveland introduced "data Science" as an independent discipline. As  said, "Machine learning and statistics may be the stars, but data science orchestrates the whole show."

In David Smith's blog, you can see CMU machine learning students "protest" at the G20 summit in Pittsburg, September 25 2009.

Monday, November 17, 2014

Network Analysis with NetworkX

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

NetworkX has been integrated with PyGraphviz, which is a Python interface to Graphviz -- the excellent open-source drawing tools initiated by AT&T Labs Research).

Here is the notes to build and integrate the Graphviz with NetworkX.

1, Install NetworkX

    I have WinPython 2.7, so installing NetworkX with WinPython's control panel is simple.

2, Install Graphviz

    Install Graphviz in "C:\graphviz"
    Add "C:\graphviz\bin" in PATH environment variable.
    Copy "C:\graphviz\include\graphviz" to "\python-2.7.x\include\"
    Copy cdt.lib and cgraph.lib in "C:\graphviz\lib\release\lib" to "\python-2.7.x\libs\"

3. Configure Visual C++ Compiler

    set VS90COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\Tools\

4. Build and Install PyGraphviz

    Open command prompt in WinPython and run
    pip install pygraphviz-1.3rc2.zip

Now import pygraphviz to test and start to use it with NetworkX.

Sunday, October 12, 2014

Java Style Guide

Checkstyle is a tool to facilitate the use of coding standard. It allows us to define a coding style, and it can then check if our code follows the style automatically.

Not surprisingly, there exists an Eclipse plugin for Checkstyle. The URL of Checkstyle update site is:

     http://eclipse-cs.sf.net/update/

This is a sample checkstyle.xml of Google's Java Style Guidegoogle_checks.xml

Note that google_checks.xml is created for CheckStyle(CS) 5.8, not for CS 5.7. Since the Eclipse Plugin for CheckStyle is integrated with CS 5.7 (as of 10/12/2014), we may need to wait until the plugin is migrated to CS 5.8.

Thursday, October 2, 2014

HTTP Proxy settings for R

1) As a non-root user, I want to install my own version of R and its packages. One major problem is how to make it work with a proxy server. Here is what I did on Linux:

- Build and Install R (v3.1.1) on /my_dir/
- $export PATH=/my_dir/bin:$PATH
- $export http_proxy=http://usr:pwd@my.proxy.org:8080
- $R
- >options(download.file.method="wget")
- >install.packages("PerformanceAnalytics")
- >install.packages("QRM")
- >install.packages("RQuantLib")
- >install.packages("quantmod")
- >require(quantmod)
- >getSymbols("SPY", src="google")
- >getQuote("GOOG")
- >getSymbols("CPIAUCNS", src="FRED") #load CPI from Fed Reserve
- >getSymbols("GS10", src="FRED")        #load 10y Treasury from Fed
- >getSymbols("SP500", src="FRED")       #load S&P500 from Fed

2) The folowing blog helps me to solve my problem ('407 Proxy Authentication Required') when I called getQuote("SPY") of quantmod.
For the details of downloading file in R, please refer to:

Thursday, September 25, 2014

Which sorting algorithm is used when the DMBS executes an 'ORDER BY" SQL statement?

The answer to my question may depend on which database we are using. My bet is that we can not find the answer from the official documentation of RDBMS, such as Oracle and SQL Server. This is somewhat consistent with SQL as a 4th generation language.

Another similar question is "how can I sort a 10G of data while I have only 4G of memory"?

The answer should be "correlated" to "External Sorting" and "Internal Sorting", and usually beyond the in-memory sorting taught in general algorithm textbook.

Here are the links to two articles:

How things works : SQL Order By Clause

"Implementing Sorting in Database Systems" by GOETZ GRAEFE from Microsoft provides more detailed discussion on the sorting algorithm in DBMS.


Sunday, September 21, 2014

Fun with LeetCode?

I got the link to Leetcode from my colleagues this Thursday and planned to give it a try this weekend. The listed problems are pretty fun to solve. Here are two examples(with the highest AC rate and the lowest AC rate) I tried today.

Max Points on a Line:

package my_java;

public class MaxNumPoint {
    public int maxPoints(Point[] points) {
    int maxnum = 0;
        int n = points.length;
        int n2 = n*(n-1)/2;
        double[] theta = new double[n2];
        double[] alpha = new double[n2];
        int k = 0;

        for (int i=0; i<n; i++) {
            for(int j=i+1; j<n; j++) {
                Point pt1 = points[i];
                Point pt2 = points[j];
                if (pt1.x != pt2.x) {
                    theta[k] = 1.0*(pt2.y - pt1.y)/(pt2.x - pt1.x);
                    alpha[k] = pt1.y - pt1.x*theta[k];
                } else {
                    theta[k] = Double.MAX_VALUE;
                    alpha[k] = pt1.x;
                }
                k++;
            }
        }

        for (int i = 0; i < n2; i++) {
        k = 0;
    if (theta[i] == Double.MAX_VALUE) {
            for (int j = 0 ; j < n; j++) {
            if (Math.abs(points[j].x - alpha[i]) < 0.01) {
                k++;
                if (k > maxnum) {
                maxnum = k;
                }            
            }
            }
    } else {
            for (int j = 0 ; j < n; j++) {
            double y = alpha[i] + theta[i]*points[j].x;
                if (Math.abs(y - points[j].y) < 0.01) {
                k++;
                if (k > maxnum) {
                maxnum = k;
                }
                }
            }    
    } // end if
        }
        if (maxnum == 0 && points != null && points.length == 1) {
        maxnum = 1;
        }
        return maxnum;
    }

    public static void main(String[] args) {
    Point[] points = new Point[4];
    points[0] = new Point(0, 2);
    points[1] = new Point(0, 2);
    points[2] = new Point(3, 10);
    points[3] = new Point(3, 10);
    int k = new MaxNumPoint().maxPoints(points);
    System.out.println(k);
    }
}

/**
 Definition for a point.
 */
class Point {
    int x;
    int y;
    Point() { x = 0; y = 0; }
    Point(int a, int b) { x = a; y = b; }
}

The following is what I submitted for Single Number (This is my first submission @LeetCode. It had the highest AC rate, then I thought it's somewhat harder at least, but it's just opposite!
So I find a lowest AC rate, which is Max Points on a Line:).

public class Solution {
    public int singleNumber(int[] A) {
        int r = 0;
        for (int i = 0; i < A.length; i++) {
            r ^= A[i];
        }
        return r;
    }
}


Wednesday, September 10, 2014

A bug in Java's Binary Search method

Lucky #6:

Binary Search may look simple, but a bug on it could live for a decade.


In 2006, Joshua Bloch posted "Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken".

The following is the code he posted:

1:     public static int binarySearch(int[] a, int key) {
2:         int low = 0;
3:         int high = a.length - 1;
4:
5:         while (low <= high) {
6:             int mid = (low + high) / 2;
7:             int midVal = a[mid];
8:
9:             if (midVal < key)
10:                 low = mid + 1
11:             else if (midVal > key)
12:                 high = mid - 1;
13:             else
14:                 return mid; // key found
15:         }
16:         return -(low + 1);  // key not found.
17:     }


The bug is in line 6 and it's fixed in Java 6.

Joshua Bloch proposed two approaches,

               int mid = low + ((high - low) / 2);
or
               int mid = (low + high) >>> 1;

In Java 6's src.zip, I can see the second approach is chosen.

Thursday, September 4, 2014

Top 5 Theorems/Inequalities in Functional Analysis

Top 5 Theorems in Functional Analysis

1. Bounded inverse theorem (a bijective bounded linear operator T from one Banach space to another has bounded inverse )

2. Hahn-Banach theorem (to extend bounded linear functions on a subspace/v.s. to the whole space)

3. Hellinger-Toeplitz theorem (...symmetric operator on a Hilbert space is bounded)

4. Riesz representation theorem (to map Hilbert space H towards its dual space H*)
        Fréchet-Riesz theorem: Φ: HH* defined by Φ(x) = φx is an isometric (anti-) isomorphism.        Quantum Mechanics: every ket |\psi\rangle has a corresponding bra \langle\psi|,

5. Banach-Steinhaus theorem (Uniform boundedness principle)
         For continuous linear operators whose domain is a Banach space, pointwise boundedness is
         equivalent to uniform boundedness in operator norm.
=================================================================

Top 5 Theorems in Real Analysis and Measure Theory

1. Bolzano–Weierstrass theorem (each bounded sequence in R^n has a convergent subsequence)

2. Fatou–Lebesgue theorem (a chain of inequalities relating  Lebesgue integrals)
         Dominated convergence theorem (a special case of Fatou–Lebesgue theorem.)

3. Fundamental theorem of calculus, Mean value theorem, Taylor's theorem (for Engineers like me):

           \int_a^b f(x)\,dx = F(b) - F(a).   
                  
             f ' (c) = \frac{f(b) - f(a)}{b - a}.

              f(x) = f(a) + f'(a)(x-a) + h_1(x)(x-a), \qquad \lim_{x\to a}h_1(x)=0.

4. Heine–Borel theorem (A subset of a metric space is compact iff it is complete and totally bounded)

5. Uniform limit theorem (the uniform limit of any sequence of continuous functions is continuous.)
        
==================================================================

Top 5 Inequalities in Real Analysis (with minimum words:)

1. Hölder's inequality: \|fg\|_1 \le \|f\|_p \|g\|_q.

           (A special case) Cauchy–Schwarz inequality:   |\langle x,y\rangle| \leq \|x\| \cdot \|y\|.\,


2. Minkowski inequality: \|f+g\|_p \le \|f\|_p + \|g\|_p
    
           (A special case) Triangle inequality:  \displaystyle \|x + y\| \leq \|x\| + \|y\| \quad \forall \, x, y \in V


3. Chebyshev's inequality: 
    \Pr(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}.

4. Inequality of arithmetic and geometric means: \frac{x_1+x_2+\cdots+x_n}{n} \ge \sqrt[n]{x_1x_2 \cdots x_n}
           
 5. Jensen's inequality: \varphi\left(\int_\Omega g\, d\mu\right) \le \int_\Omega \varphi \circ g\, d\mu.


Friday, April 18, 2014

JSTL/DataSources in Tomcat 6.0

1, Install the JDBC drivers in TC's lib/ directory.

    Note: JDBC drivers are usually packaged in jar files. For example, derby*.jar for
    Java DB, ojdbc6.jar for Oracle 11g, and sqljdbc4.jar for SQL Server 2008.
  

2, Add <Resource> in /META-INF/context.xml    Follow the instructions from TC website:
    http://tomcat.apache.org/tomcat-6.0-doc/jndi-resources-howto.html#JDBC_Data_Sources

    <Resource name="jdbc/jvdb" auth="Container" type="javax.sql.DataSource"
                    maxActive="100" maxIdle="3" maxWait="10000" username="user1" password=""
                    driverClassName="org.apache.derby.jdbc.EmbeddedDriver"
                    url="jdbc:derby:/www/db/jvdb"/>
   </Context>


3, Add resource-ref in /WEB-INF/web.xml
    For example,

   <resource-ref >
       <description >DB Connection Pool</description >
       <res-ref-name >jdbc/jvdb</res-ref-name >
       <res-type >javax.sql.DataSource </res-type >
       <res-auth >Container</res-auth >
   </resource-ref >

4, Use the datasource in JSTL(v1.2)
    For example, if you are using Derby, you can query Derby's SYSTABLES in a jsp as follows. Of course, we can always change the SQL to query the user-defined tables.

<%@ taglib uri="http://java.sun.com/jsp/jstl/sql" prefix="sql" %>   
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>   
  
  <html> 
  <head>
    <SQL:QUERY var="systables" datasource="jdbc/jvdb">   
       select * from sys.SYSTABLES
    </SQL:QUERY>
...

Reference:


Wednesday, March 26, 2014

Install Windows 8.1 on VMWare Workstation 8

1) VMWare Workstation 8 doesn't support Win 8 directly, so you may need some small tricks:

To keep it possible, I will note only two key steps:

a) Choose "I will install the operating system later."
b) Choose "Microsoft Windows" > "Windows 7".

Recomendations for better performance: Set the memory = about 1/2 of physical memory. Consider  enabling Virtulaization in host BIOS.

2) If you didn't allocate enough space for the VM, use Windows's DiskPart to extend the partition.

a) VM Settings > Hard Disk > Utilities > Extend...
b) In CMD prompt, C:\diskpart
    DISKPART> list volume
    DISKPART> select volume 2
    DISKPART> extend
    Check the space avaliable in C:\ drive.

Refer to "Extending partitions in Windows using DiskPart (1007266)" for details.

Saturday, January 18, 2014

GNU Octave -- MATLAB-like Language

Compiling & Installing Octave on GNU/Linux:

GNU Octave is a high-level programming language, primarily for Numerical Computations. It is an active GNU project, and this article will discuss the major steps to install Octave 3.8.0 on GNU/Linux (x86-64).

At first, I have downloaded the source code of the following software projects(all are free):

    Octave-3.8.0, ATLAS-3.10.1, LPACK-3.4.1, PCRE-8.34, Ncurses-5.9 and FFTW-3.3.3.

Secondly, follow the next 5 steps to compile and install Octave.

1, Build PCRE(Perl Compatible Regular Expressions):
      $cd pcre-8.34

      $configure --prefix=/my/usr/local
      $make
      $make install
2, Build Ncurse(New curses):
      $export CFLAGS='-fPIC'
      $configure --prefix=/my/usr/local
      $make
      $make install
3, Build FFTW(for discrete FFT):
      $configure --prefix=/my/usr/local
      $make
      $make install
4, Build ATLAS(Automatically Tuned Linear Algebra Software) and LAPACK(Linear Algebra Package) :
      $mkdir atlas-build
      $cd altas-build;
      $../ATLAS/configure -Fa acg '-fPIC' -shared --prefix=/my/usr/local --with-netlib-lapack-tarfile=/path/to/lapack-43.4.2.gz

       $make
       $make install

5, Build Octave:
       $cd octave-3.8.0
       $configure --prefix=/my/usr/local
       $make
       $make install
6, Let's go!
        $octave
         octave:1>magic(3)
         ans = 
              8  1  6
              3  5  7
              4  9  2

Note: Octave-Forge  project is a community-maintained set of packages for GNU Octave. It's worth to give it try, although it's optional.

============================================================

If the GCC on your Unix Server is outdated, you may need to build and install GCC v4.8.2:

1, Download GCC source tar file from http://gcc.gnu.org/ and extract it into gcc-4.8.2/.
2, $cd gcc-4.8.2/contrib
    $download_prerequisites; [set http_proxy if necessary]
3, $mkdir gcc-build
    $cd gcc-build;
4, $../gcc-4.8.2/configure --prefix=/my/usr/local --disable-multilib
5, $make
6, $make install

Note: make sure that the Fortran Csompiler is also installed.