Thursday, March 2, 2017

Turn photo into painting: style interpolation

Google recently demonstrated that we can do style interpolation by squeezing multiple styles into one deep neural net model. However, I personally found this approach to be not flexible because you have to squeeze a pre-defined amount of styles into one model, and you can only do style interpolation within those pre-defined styles. However, style interpolation using deep neural net is quite easy. I was able to simply use linearly weight models that is used in to achieve style interpolation. Here is an example. The style Van Gogh Night Market and Carribean are linearly interpolated to stylize the image Chicago, and the interpolation seems quite smooth.

Style1: Van Gogh Night Market
 Style2: Carribean
Content image:

Carribean: 0, Night Market: 1
 Carribean: 0.2, Night Market: 0.8
 Carribean: 0.4, Night Market: 0.6

Carribean: 0.6, Night Market: 0.4
Carribean: 0.8, Night Market: 0.2
 Carribean: 1, Night Market: 0

Monday, February 27, 2017

Turn photo into painting: Using image blending for foreground and background

In the previous post we discussed how to use different mixing level for foreground and background, and the example worked really nicely. However, when the foreground and background has a huge contrast, mixing foreground and background can be challenging. Let's take a look at an example. This is a wedding photo taken at sunset.

We want to make it more artistic, and we chose this style from

With low mixing level, the result looks like this. It looks quite nice at first sight, but if you look more carefully, you will see that there are some noisy patterns in the people and in the dark ground. Those noisy patterns don't look nice.

We actually want to use the people and the dark ground in the original photo, so we created the following mask:
By simply overlaying the original photo and the stylized photo using the mask above, we get the following image:

Unfortunately because the contrast and color in the original image and the stylized image look too different, you can see that there is a clear "cut line" near the dark ground, which looks awkward. Fortunately, we can use the image pyramid technique to create a smooth transition between the original photo and the stylized image. We are using this script to make the image-pyramid based blending. The result looks like this. This actually looks quite nice and smooth!

Sunday, February 26, 2017

Turn photo into painting: Using different mixing levels for foreground and background

Most of the pictures are taken with people in some kind of background scene. We usually want to make the background scene artistic, but we may not always want to stylize the people that much because we still want the people in the picture to look like the original people. In this post we demonstrate the technique of using two different mixing levels in to create two different stylized images, and use grabCut to separate the foreground and background and then combine the low mixing foreground and the high mixing background. Here we start from the raw picture:
To stylize this wedding photo, we used the following style:
We created two stylized images with a high mixing level and a low mixing level:
high mixing
low mixing

We can see that the image with high mixing level looks quite artistic, but the people in the picture is more distorted than we want them to be. The people in the low mixing level image look nice and real, but the background is not too different from the raw picture. To combine the two images, we use grabCut algorithm to create a mask like below. The exact code which created this mask can be found here.

 Then combing the high mixing and low mixing image using the mask, we get the following image. See how the background is very artistic while the foreground (the people) are still nice and real.
Below is another example using similar technique. The details of the process is omitted due to similarity.
Original photo


High Mixing

Low Mixing



Thursday, February 23, 2017

Turn photo into painting: How to choose style

There are more than 90 styles in How should we choose the most appropriate style for a photo. Honestly, I think the answer is just try as many different ones as possible. It is hard to guess what a stylized image would look like for a photo with a given style. Here is an example. The original photo is just a portrait of a girl. There are a few portrait styles in, but there is no such a rule that you are supposed to use a portrait style for a portrait photo. Below shows 5 different stylized images using 5 different styles that has completely nothing to do with portrait, but yet the results looks very good. So the conclusion is: just try it out!

Wednesday, February 22, 2017

Turn photo into painting: create a Panorama with

We have previously introduced One advantage of this site is that it allows users to create high resolution stylized images, and this is particularly good for creating panorama images. Panorama photos are very popular nowadays due to the popularity of smart phones, and here is an example of stylized panorama image created with Below is the original photo that I took in Hawaii:
After stylizing the photo with this style image:
The result look like this. How pretty! prisma has a similar style, but it cuts the photo into a square shape and it is difficult to make panorama photo with it.

Below is another example of Panorama photo. The original picture looks like this:

Using the following style:

The result image look like this:

Turn photo into painting: introduction to

Ever since the publication of neural style, there are a few different websites and smart phone apps leveraging this technology to stylize photo into artistic paints. is based on the optimization approach, which is takes tens of minutes or even hours to render high resolution pictures. The advantage is that user can upload any style image they want and create a painting with the style they upload. The popular app prisma, on the other hand, is based on the fast neural style approach. It can render stylized images in seconds, but the style model can take 6 hours or more to train on a high performance GPU computer., however, is based on a technology that combines the advantage of both approaches. It can render high resolution stylized images in seconds, but it can also create new style models in a reasonably short time, around 1 hour. This technology allows to provide a large number of style models (around 90 style models for now), and it also allow users to submit their new style images to, and the site administrator will train and add the new style in a day or so.

Another feature of is that it allows users to render high resolution images with the largest side of the image up to 2500px. High resolution is particularly important when there are people in the photo. Consider the following example. The original photo look like this:

When using the non-premium service of, which result in a image with 768px on the largest side, the stylized images look like this:

We can clearly see that the face of the mother and the child is blurred due to the low resolution. When using premium service, which allows image with largest side up to 2500px, the photo rendered with the same style look like this. We can see how clearly the details of the faces, and the people in the image look much nicer comparing with the low resolution image.

In the following few blog post, I am going to demonstrate the capabilities of, and some useful tips on how to choose appropriate styles and resolutions when stylizing your photos into paintings.

Tuesday, August 4, 2015

Machine Learning and Statistics: a unified perspective | 4 Decision Trees

Previously we have discussed using Generalized Linear Model(GLM) to estimate continuous variables (linear regressions) and categorical variables (logistical regression). GLM minimizes Loss function with respect to parameters \(\theta\), and such \(\theta\) is global in the sense that \(\theta\) doesn’t change with the value of variable \(\mathbf{x}\). GLM models are less prone to overfitting when effective degrees of freedom is not large, but it may not deal with local nonlinearities properly. Consider a k-nearest-neighbor (KNN) algorithm: Data point \((y, \mathbf{x})\)’s target value y will be assigned to the mean of the nearest k data point to \(\mathbf{x}\). This algorithm can adapt to high local nonlinearities, but it can also have high variance and significantly overfit the data.

For models using adaptive local functions, we can write the model in the following form: \[f(\mathbf{x}) = E[y|\mathbf{x}] = \sum_{m=1}^M w_mI(\mathbf{x} \in R_m) = \sum_m w_m\phi(\mathbf{x; v}_m)\] where \(R_m\) is the m-th region, \(w_m\) is the mean response weight, and more generally, \(\phi(\mathbf{x; v_m})\) is a local basis function which is not necessarily \(I(\mathbf{x} \in R_m)\). For decision trees, \(\mathbf{v}_m\) encodes the choice of variable to split on, and the threshold value. This makes it clear that a decision tree based model is just a an adaptive basis-function model, where the basis functions define the regions, and the weights specify the response value in each region.

Build a Tree

Finding the optimal partitioning of the data is NP-complete (Hyafil and Rivest 1976), so it is common to use the greedy procedure locally optimal MLE. The split procedure is the following: \[ (j^*, t^*) = arg min_{j \in 1...K} min_{t \in T_j} [cost({\mathbf{x}, y : x_{i,j} \leq t}) + cost({\mathbf{x}, y : x_{i,j} \geq t})]\]

For cost functions, we have the following considerations:

  • Regression: \(cost(\mathcal{D}) = \sum_{i \in \mathcal{D}}(y^{(i)} - \bar{y})^2\)
  • Classification: if dependent variable y has C classes, we define the class-conditional probabilities as \(\pi_c = \frac{1}{|\mathcal{D}|}\sum_{i \in \mathcal{D}}I(y^{(i)} = c)\), where \(|\mathcal{D}|\) is the number of data points in data set \(\mathcal{D}\). There are a few similar measures of cost:
    • Misclassification rate: define most probable class label as \(\hat{y}_c = argmax_c \hat{\pi}_c\), then the corresponding error rate is \(1 - \pi_{\hat{y}}\)
    • Entropy: \(H(\pi) = -\sum_{c=1}^C \pi_clog(\pi_c)\). Note that minimizing entropy is the same as maximizing information gain.
    • Gini index, or the expected error rate: \(\sum_{c=1}^C \pi_c(1-\pi_c)\)

The 3 measures for classification error are similar. Let’s use two class case as an example. Assume the probability of class 1 is p, then:

  • misclassifcation rate = \(1-max(p, 1-p)\)
  • Entropy = \(-plog(p) - (1-p)log(1-p)\)
  • Gini index = \(2p(1-p)\)

All 3 measures is 0 when p = 0 or p = 1, and maximizes when p = 1/2.

Let’s take an example of how to build a tree. Here we use the R package rpart and its example data Kyphosis. First we plot the data of kyphosis below. Data set Kyphosis has a dependent variable kyphosis of two categories: present and absent, and 3 varaibles Age, Number and Start. For more details, please refer to the (rpart documentation)[].

# Classification Tree with rpart
library(rpart); library(ggplot2)
plt <- ggplot(kyphosis, aes(x=Age, y=Start)) + 
     ggtitle("kyphosis data") +
     geom_point(aes(fill=factor(Kyphosis)), size=3, pch=21)

We build a tree use all 3 parameters. To avoid overfitting, we need to set some limits on the complexity of the tree. In the rpart.control function below, minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. cp is the complexity parameter which means that any split that does not decrease the overall lack of fit by a factor of cp is not attempted. We can see a decision tree is built using the code below.

fit <- rpart(Kyphosis ~ Age + Number + Start,
    method="class", data=kyphosis, control = rpart.control(minsplit = 20, cp = 0.01))
plot(fit, uniform=TRUE, 
    main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

To illustrate how cp limits the depth of the tree, we print how the cp parameter changes as the split number goes up. As shown below, the last split, which corresponds to a tree of 4 splits, has a cp of 0.01, which is the limit we set in rpart.control. Thus the split stops at 4. Optimum value of cp can be determined by cross-validation.

printcp(fit) # display the results 
## Classification tree:
## rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
##     method = "class", control = rpart.control(minsplit = 20, 
##         cp = 0.01))
## Variables actually used in tree construction:
## [1] Age   Start
## Root node error: 17/81 = 0.20988
## n= 81 
##         CP nsplit rel error  xerror    xstd
## 1 0.176471      0   1.00000 1.00000 0.21559
## 2 0.019608      1   0.82353 0.94118 0.21078
## 3 0.010000      4   0.76471 1.00000 0.21559

Another criteria that can be used to prune is tree is to define a cost complexity criterion like the following. Let T be the number of nodes in a tree: \[N_t = number\{x_i \in R_t\}\] \[\hat{c}_t = 1/N_t\sum_{x_i \in R_t} y_i\] \[Q_t = \sum_{x_i \in R_t} (y_i - \hat{c}_t)^2\] Then the cost complexity is: \[ C_{\alpha} = \sum_{t=1}^T Q_t + \alpha T\] where \(\alpha\) is a coefficient which can be determined by cross-validation

The advantanges of decisions trees include:

  • they are easy to interpret based on the split results
  • they can easily handle mixed discrete and continuous inputs
  • they are insensitive to monotone transformations of the inputs because the split points are based on ranking the data points
  • they perform automatic variable selection
  • they are relatively robust to outliers
  • they scale well to large data sets

However, decision tree have an important disadvantage: overfitting due to the greedy nature of the tree construction algorithm, even with careful pruning. A related problem is that trees areunstable: small changes to the input data can have large effects on the structure of the tree. Solution to this problem will be discussed in the following section.

Bagging and Random Forest

An important solution to the problem of simple decision tree is to using boosting approach to create a large amount of sample data, create a decision tree for each of them, and average among all the trees. This technique is called bagging, which stands for “bootstrap aggregating”.

Unfortunately, simply re-running the same learning algorithm on different subsets of the data can result in highly correlated predictors, which limits the amount of variance reduction that is possible. To decorrelate the base learners, a randomly chosen subset of input data sets as well as a randomly chosen subset of the features are combined to create decision trees. This technique is called Random Forest. Let’s take a look at an example of Random forest as shown below using the same data set of Kyphosis:

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)
print(fit) # view results 
## Call:
##  randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
##         OOB estimate of  error rate: 20.99%
## Confusion matrix:
##         absent present class.error
## absent      60       4   0.0625000
## present     13       4   0.7647059

Unfortunately, methods that use multiple trees lose their nice interpretability properties. Fortunately, various post-processing measures can be applied, and we can still see the relative importance of features from Random Forest.

importance(fit) # importance of each predictor
##        MeanDecreaseGini
## Age            8.837115
## Number         5.305533
## Start          9.797556


Boosting is a greedy algorithm for fitting adaptive basis-function by applying the weak learner sequentially to weighted versions of the data, where more weight is given to examples that were misclassified by earlier rounds. In math terms, boosting is a forward stepwise procedure: \[ \hat{\theta}_m = argmin_{\theta_m} \sum_{i=1}^N L(y^{(i)}, f_{m-1}(x^{(i)}) + T(x^{(i)}, \theta_m))\]

If the Loss function is an exponential function \(L(y, f(x)) = exp(-yf(x))\), the solution is particularly simple, and this stepwise method using exponential loss function is referred to as Adaboost. It can be relatively easily prooven that the solution for Adaboost is: \(T(x^{(i)}, \theta_m) = \alpha_m k_m(x^{(i)})\) where \(k_m(x) = argmin_{k_m}\sum_i w_m^{(i)}exp(-y^{(i)}k_m(x^{(i)}))\) is the decision tree that minimize the loss of data set \((\mathbf{y, X})\) with weight \(w_m^{(i)} = exp(-y^{(i)}f_{m-1}(x^{(i)}))\). A realization of Adaboost algorithm is shown below. The in-sample error for data set kyphosis reduce to 0 when iteration is 10.

adaboost <- function(y, X, ntree = 100) {
  y = sapply(y, function(x){if (x == levels(x)[1]) return(-1) else return(1)})
  N = length(y)
  f = rep(0, N)
  in_sample_error = c()
  for (ii in 1:ntree) {
    w = exp(-y * f); w = w / sum(w)
    fit = rpart(y ~ ., X, w, method = "class")
    g = -1 + 2*(predict(fit, X)[,2] > 0.5)
    e = sum(w * ((y * g) < 0))
    alpha = 0.5*log((1-e)/e)
    f = f + alpha*g
    in_sample_error = c(in_sample_error, sum((y*f) < 0)/N)

y = kyphosis$Kyphosis; X = kyphosis[, 2:4]
in_sample_error = adaboost(y, X, 20)
plot(1:20, in_sample_error, xlab = "Iteration", ylab = "In sample error")

Exponential loss function is easy to solve because it is analogous to the square loss in least square regression. However, it is prone to overfitting because it puts too much weight out the outlier. More robust loss functions such as Huber loss or Multinomial Deviance can be used, but numerical solutions are needed for more general loss functions. Gradient boost algrithm is developed to solve the problem: induce a tree \(T(\mathbf{x}, \theta)\) at the m-th iteration whose predictions \(\mathbf{t}_m\) are as close as possible to the negative gradient of each data point of the loss function \(g_{im} = [\frac{\partial L(y^{(i)}, f(x^{(i)}))}{\partial f(x^{(i)})}]\): \[ \hat{\theta_m} = argmin_{\theta}\sum{i=1}^N(-g_{im}-T(x^{(i)}, \theta))^2\]

Since the loss function in Gradient Boost can be of many different kinds, Gradient Boost can be used for either classification or regression. R package gbm implements Gradient Boost.

Comparison with Linear Regression

Tree based methods have the following Pros and Cons:


  • handling of mixed data types
  • hanlding of nonlinearity in response
  • Insensitive to monotone transformations of inputs
  • automatic feature selection / ranking


  • Can not generate linear combinations of features, and can not interpret the sign of correlation between variable and response
  • Prone to overfitting and needs proper choice of tree parameters

Linear Regression has the following Pros and Cons:


  • Less prone to overfitting due to its simple linearity nature
  • Easy interpretation on the sign of correlation between variable and response.


  • Relative importance of variables in multiple linear regression can only be achieved with normalized input data and same expected sign of correlation with response.
  • Can not capture nonlinearity