Saturday, July 18, 2015

Machine Learning and Statistics: a unified perspective | 0. Introduction

Today I am going to start a series of article with the theme of “Machine Learning and Statistics: a Unified Perspective”. The goal of this series of articles is to demonstrate the concepts of machine learning, and its fundamental connections with statistics.

Machine learning has becoming increasingly important with the explosion of data in this modern age. There are many video tutorials, books, free source codes online regarding the education of machine learning, but for almost all of these resources that I see, there is a problem of separation between theory, code and data. You may find an excellent Youtube video about machine learning, but if you play with some of the algorithms in the tutorial, you may not know when to start. You may find free sources somewhere else online, but the data set from the code is likely not the same that the online tutorial is discussing and you may have a hard time reproduce the results in the video tutorial. This series of articles is going to be written with theory and code in the same place using data set available online, so that you are guaranteed to be able to reproduce any result in the articles.

Another goal I hope to achieve with this series of articles is that I could hopefully make the reader easier to understand the concepts of machine learning. Machine learning seems complicated and requires advanced math skills, but the ideas behind machine learning is really not that difficult to comprehend. I hope I could combine theory, code and data in a good way and help the readers of my articles comprehend the concepts of machine learning in a better way.

With all these things said, let’s first define the problem of machine learning. There are too many machine learning algorithms, and there doesn’t seem to be one simple mathematical definition on machine learning. For example, Wikipedia explains machine learning as “explores the construction and study of algorithms that can learn from and make predictions on data”. But is there a unified definition on how this prediction by machine learning is made? Here I am trying to make my own unified defition of the goal of machine learning, which is also the same as what statisticians are trying to achieve:

Given data set (\(\mathbf{y, X}\)), learning in general is about defining a function \(\hat{y}\) = \(f(\mathbf{\theta, x})\), such that by choosing appropriate value of \(\mathbf{\theta}\), some loss function of \(L(\theta) = L(\mathbf{y}, f(\mathbf{\theta, X}))\) can be minimized, subject to certain constraint of the parameters \(Constrait(\theta)\).

Thus there are two parts to the learning process. First is about choose the function \(\hat{y}\) = \(f(\mathbf{\theta, x})\), which is usually referred to as model selection. Second is the minimization of the loss function by choosing appropriate \(\theta\) value, which is usually referred to as parameter extraction. Usually the loss function is defined based on some sort of distance measure between the estimated value and the target value, i.e., \(L(\mathbf{y}, f(\mathbf{\theta, X})) = \sum_i Dist(y^{(i)}, f(\theta, \mathbf{x}^{(i)}))\). The \(Constrait(\theta)\) is usually used to limit the complexity of the model. If you use enough number of parameters, the Loss function can always become smaller. In the extreme case that the number of parameters K equals the number of data points N, an exact solution can be found and Loss function would be come 0. So some constraint on the parameter \(\theta\) is usually useful.

This definition may sound abstract, but it should look obvious when we look at some examples. In the case of multiple least square linear regression, we usually have a set of data points \(\mathbf{y} = (y^{(1)} ... y^{(N)})\) and \(\mathbf{X} = [(x^{(1)}_1 ... x^{(N)}_1), ..., (x^{(1)}_K ... x^{(N)}_K)]\), and for N data points \(i = 1...N\) and K features \(j = 1...K\). \(\mathbf{X}\) is a N x K matrix with each feature as a N data point column vector. We model the relationship between y and x as \(\hat{y} = f(\theta, \mathbf{x}) = \theta^T \mathbf{x}\), and in this case \(\theta = (\theta^1, ..., \theta_K)\). Note my way of notation of symbols: normal font low case like y represents scalars, bold and lower case like \(\mathbf{y}\) represents vectors, and bold and upper case like \(\mathbf{X}\) represents matrix. The superscript \(^{(i)}\) represents the i-th data point, and subscript represents the j-th feature of data. In the plain vanilla least square linear regression, loss function is defined as the sum of the Eucledian distance of estimated data and target data: \(L(\mathbf{y}, f(\mathbf{\theta, X})) = \sum_{i=1}^N (y^{(i)} - \sum_{j=1}^K \theta_j x^{(i)}_j)^2\), and the goal of linear regression is to find \(\beta\) and \(\alpha\) so that the loss function is minimized.

Recognizing hand-written digits by advanced machine learning algorithms may seem much more complicated than linear regression, but in terms of the definition of learning process is really the same. An example of the hand-written digits are shown below. In this case, data set \(\mathbf{y}\) is the digits in the range of 0~9, and \(\mathbf{X}\) is the digital images each with pixels of \(K * K\). Similarly, asking machine to interprets these digital images is to define a function \(\hat{y} = f(\theta, \mathbf{x})\), which is a function that maps the pixels of the digital images to digits in the range of 0~9, and then find the optimum parameter \(\theta\) so that some defition of loss function \(L(\mathbf{y}, f(\mathbf{\theta, X}))\) can be minimized, in the hope that the computer can correctly interpret the new digital image using the function \(\hat{y} = f(\theta, \mathbf{x})\) with the optimized \(\theta\). Of course there are many ways of defining the function \(f(\theta, \mathbf{x})\) and the loss function \(L(\mathbf{y}, f(\mathbf{\theta, X}))\), and that is where many different machine learning algorithms coming from, such as Support Vector Machine, Random Forest, etc.

From the definition above, we can see that the learning process is all about 1). modeling, which is to define \(\hat{y} = f(\theta, \mathbf{x})\) and \(L(\mathbf{y}, f(\mathbf{\theta, X}))\), and 2). optimization, which is to find the \(\theta\) that minimizes \(L(\mathbf{Y}, f(\mathbf{\theta, X}))\) with the given data set, and potentially analyzing how trustworthy is the optimum parameter \(\theta\). In terms of technical knowledge, you mainly need some math knowledge regarding linear algebra and optimization.

Please note that this definition of learning is only related to the supervised learning, which means that in a given train data set, both target values \(\mathbf{y}\) and feature values \(\mathbf{X}\) are given. In contrast to supervised learning, unsupervised learning tries to extract certain characteristics of a given feature data set \(\mathbf{X}\) without the information of target value \(\mathbf{y}\). We will deal with such unsurpervised learning later in the articles series.

For the rest of the articles, I will mainly use the programming language R to demonstrate the concepts of machine learning and related statistics, and sometimes Python as well. Please note the version of R that I use.

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          2.1                         
## year           2015                        
## month          06                          
## day            18                          
## svn rev        68531                       
## language       R                           
## version.string R version 3.2.1 (2015-06-18)
## nickname       World-Famous Astronaut

No comments:

Post a Comment