Friday, April 24, 2015

Combine the Power of Python and R: a Cointegration Test Example

Combine the Power of Python and R

Motivation

Both Python an R are very powerful data analysis tools, and each has its own strength. The best approach would be to combine the power of both tools and use whichever one when it is appropriate. In this article we will use an example of Cointegrating test to demonstrate how to seamlessly combine Python and R in the IPython Notebook environment. Cointegration test is an important method to determine if two stocks are good for pair trading

Below we import the necessary Python packages for data processing

In [109]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import pandas as pd
import numpy as np
import datetime

The key package to enble the communication below Python and R is rpy2. The subpackage rpy2.ipython will allow user to easily run R scripts in IPython Notebook and send data back and forth between Python and R easily

In [18]:
%load_ext rpy2.ipython

Cointegration test using both Python and R

In this example we download data from Yahoo Finance using Python

In [14]:
import pandas.io.data as web
class DataObj(object):
    def __init__(self, symbols = ['SPY', 'VGK', 'FXI', 'TLT'], 
                 start_time = datetime.datetime(2010, 1, 1),
                 end_time = datetime.datetime.now().date() - \
                 datetime.timedelta(days = 1)
                ):
        self.data = {}; self.returns = pd.DataFrame()
        self.adjClose = pd.DataFrame()
        for s in symbols:
            try:
                self.data[s] = web.DataReader(s, 'yahoo', 
                                              start_time, end_time)
                self.adjClose[s] = self.data[s]['Adj Close']
                self.returns[s] = self.data[s].diff()['Adj Close'].dropna() \
                    / self.data[s]['Adj Close'][:-1].values
            except:
                pass

We choose GLD and IAU as the example in this article because they are two different ETFs tracking gold price. They are closely related to each and is expected to be cointegrated

In [110]:
figsize(10, 8)
time_span_days  = 365*4
today = datetime.date.today()
start_date  = today + datetime.timedelta(-time_span_days)
StockData = DataObj(symbols = ['GLD', 'IAU'], start_time = start_date, 
                  end_time = today)
print "Got %s Days of data" % str(len(StockData.adjClose))
StockData.adjClose.plot()
Got 1007 Days of data
Out[110]:
<matplotlib.axes.AxesSubplot at 0xad0fd78c>

Using the command below, we can send the obtained Pandas DataFrame to R and do the Johansen test in R

In [111]:
X = StockData.adjClose.values
%Rpush X
In [112]:
%%R
library('urca')
coRes = ca.jo(data.frame(X),type = "trace", K = 2, 
              ecdet = "none", spec = "longrun")
summary(coRes)
###################### 
# Johansen-Procedure # 
###################### 

Test type: trace statistic , with linear trend 

Eigenvalues (lambda):
[1] 0.0505859866 0.0007503123

Values of teststatistic and critical values of test:

          test 10pct  5pct  1pct
r <= 1 |  0.75  6.50  8.18 11.65
r = 0  | 52.92 15.66 17.95 23.52

Eigenvectors, normalised to first column:
(These are the cointegration relations)

          X1.l2     X2.l2
X1.l2   1.00000  1.000000
X2.l2 -10.05733 -8.885641

Weights W:
(This is the loading matrix)

         X1.l2        X2.l2
X1.d 1.1828547 -0.018068989
X2.d 0.1310969 -0.001774551

Moreover, we want to let R do the Johansen test and get the results back to Python, and then we can continue our work with Python. So instead of using the %%R to run an R Script, we define the following Python function to call the Johansen test function in R, and send the results back to Python in the format of Numpy arrays

In [113]:
def isCointegrated(X, verbose = False):
    %R -i X coRes = ca.jo(data.frame(X),type = "trace", K = 2, ecdet = "none", spec = "longrun")
    testStat = %R slot(coRes, "teststat")
    cVal = %R slot(coRes, "cval")
    if verbose:
        print "Test Statics: ", testStat
        print "Critical Values: "
        print cVal
    testStat = np.array(testStat)
    cVal = np.matrix(cVal)
    coIntFlag = testStat[-1] > cVal[-1, 1]
    if verbose:
        print "Cointegrate?: ", coIntFlag
    eigenVec = %R slot(coRes, "V")
    if verbose:
        print "Eigen vectors: "
        print eigenVec
    eigenVec = np.array(eigenVec)
    return coIntFlag, eigenVec[:, 0]
In [114]:
coIntFlag, eigenVec = isCointegrated(StockData.adjClose.values, 
                                     verbose = True)
Test Statics:  [1]  0.7543469 52.9242112

Critical Values: 
         10pct  5pct  1pct
r <= 1 |  6.50  8.18 11.65
r = 0  | 15.66 17.95 23.52

Cointegrate?:  True
Eigen vectors: 
          X1.l2     X2.l2
X1.l2   1.00000  1.000000
X2.l2 -10.05733 -8.885641

Johansen test estimates the rank (r) of given matrix of time series with confidence level. In our example we have two time series, therefore Johansen tests null hypothesis of r=0 < (no cointegration at all), r<1 (till n-1, where n=2 in our example). In this case, when r = 0, the test Statistics is larger than the 5% (even 1%) of the critical value. So the pair is cointegrated. Once a cointegration is established, eigenvector (normalized first column) would be used as weight for a portfolio. Below we use the obtained weight to check if the residual of the pair is really stationary

In [115]:
resid = pd.Series(index = StockData.adjClose.index); resid = 0
for ii, s in enumerate(StockData.adjClose.columns):
    resid = resid + eigenVec[ii] * StockData.adjClose[s]
resid.plot()
Out[115]:
<matplotlib.axes.AxesSubplot at 0xad09c86c>

Clearly, using the obtained weight, the residual of the pair formed a quite stationary time series

In [ ]: