Data Science for Hackers: March 2015

Outline

R data types, functions and classes
basic R packages and functions
R quantitative trading packages:

time series: xts, zoo, TSA(arima), forecast (auto.arima)
quantitative trading: quantmod, quantstrat, PerformanceAnalytics, factorAnalytic (depends on xts and zoo).
machine learning: glm, nnet, kmeans, princomp

1. R data types, functions, classes/objects and packages

1.1 data types

vector (atomic): 1d sequential data container in which data are of the same type (numeric, bool, etc). vectors are simply generated by c(...) where everything in c() is of the same type. vector slice: vec[c(1,5)].

Note: vec[-k] means excluding kth element, not the last but kth element.

Note: R data containers are always 1 indexed, not 0 indexed.

list (recursive): 1d sequential data container in which data can be of different types. In list, each variable can have a name. List can be created by calling list(...) or use c(...)

Note: for list member accessing, use list[[index]] or list[["key"]] or list$key, NOT list[index] or list[key]

matrix(data = NA, nrow = 1, ncol = 1, dimnames = NULL): 2d data container, specially designed for matrix operations. Operators: %*% for matrix multiplication (must be (n, k) %*% (k, m)). If use *, it is element by element multiplication. Matrix inverse: solve(A). Indexing: matrix[ii, jj]

array(data = NA, dim = c(...), dimnames = NULL): variable dimension data container.
Indexing: array[ii, jj, kk, ...]
subsetting: if dimnames(array) != NULL, array can be subset through its dimnames: array[c(...), c(...), ...]. Note that if certain dimension only has 1 dimension, subsetting can lead to reduced dimensions. To ensure that the dimension is not changes, new array with desired dimension may need to be defined because array assignment doesn't change dimension.

data.frame(key1 = c(...), key2 = c(...), ...): same as Pandas dataframe. Indexing: df$col, df[["col"]]

related: dplyr package, data processing functions for dataframe, such as group_by()

formula: my_formula = as.formula("y ~ x + z + w + I(z*w) - 1")

1.2 functions

for loops:

for (i in 1:10) {

cat(i, '\n');

}

define functions:

var <- function(x, y, ...) {

...

return (a);

}

introspect:

class(var);

mode(var);

names(var);

is.numeric(var), is.vector(var), ...

attributes(obj): show all attributes of an objects (class instance)
attrib(x, "attribute"): equivalent to getattr() in Python
search(): show the search paths
browser(): like a break point
.libPaths(): show / set library paths
find(f): find which environment function f comes from
get(f): get the function named f
For S4 class instance:
slotNames(var)
slot(var, "attr")
do.call(funcName, funcParamList): note that this kind of dynamic function calling can significantly lower the performance

random variables and distributions:

rnorm: general one random variable with normal (beta, poisson, etc) distribution

qnorm: inverse cumulative distribution function

pnorm: cumulative distribution function

dnorm: distribution density function

correlations:
corr: correlation coefficients
var: when input is a matrix, it returns a covariance matrix
cov: covariance

hypothesis test:
parametric:
t.test(formula, data, ...): e.g., t.test(extra ~ group, data = sleep)
var.test(formula, data, ...): compare the variances of two samples from normal populations.
non-parametric to test goodness of fit:
chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), ...): performs chi-squared contingency table tests and goodness-of-fit tests. Test if x has a distribution of F with segment probability p, or if x and y are independent
ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"), exact = NULL): y is either a numeric vector of data values, or a character string naming a cumulative distribution function or an actual cumulative distribution function such as pnorm. Only continuous CDFs are valid.
fisher.test(x, y = NULL,...): Performs Fisher's exact test for testing the null of independence of rows and columns in a contingency table with fixed marginals.
mcnemar.test(x, y = NULL, ...): Performs McNemar's chi-squared test for symmetry of rows and columns in a two-dimensional contingency table. The null is that the probabilities of being classified into cells [i,j] and [j,i] are the same.
binom.test(x, n, p = 0.5,...): Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment. Can be used to test the sign of a random variable, or if two distributions have significant difference
rank statistics: independence test
cor.test(x, y, alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"), conf.level = 0.95, ...): Test for association between paired samples, using one of Pearson's product moment correlation coefficient, Kendall's tau or Spearman's rho.
wilcox.test(x, ...): is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ
normality test:
sharpiro.test()
jarque.bera.test()
stationarity test:
adf.test()
ca.co: johansen test (requires library "urca")
autocorrelation test:
Box.test()

bootstrap:
sample(x, size, replace = FALSE, prob = NULL): takes a sample of the specified size from the elements of x using either with or without replacement.
boot(data, statistic, R, ...): statistic is a function which when applied to data returns a vector containing the statistic(s) of interest. R is the number of bootstrap replicates.

regression:
lm(formula, data, subset, ...): return fit results. Use summary(x) to generate summary and use coef(x) to retrieve fitting coefficients; use anova(x) for variance analysis and use predict(x)/fitted(x) to get the fitted value; use kappa(x) to see if X has co-linearity; use residuals to get the residuals
summary on related functions: add1 coef effects kappa predict residuals alias deviance family labels print step anova drop1 formula plot proj summary
summary on regression analysis: influence.measures rstandard rstudent dffits cooks.distance dfbeta dfbetas covratio hatvalues hat
avo(formula, data, ...): similar to lm + anova, but the focus is on the effects of factors.
test normality of residuals: shapiro.test(); homogeneity: bartlett.test()
QQ plot: plot(model, 2)
polynomial regression: lm.pol<-lm(y~1+poly(x,2),data=alloy)

glm: logistic regression, etc.
nls(formula, data, ...): Determine the nonlinear (weighted) least-squares estimates of the parameters of a nonlinear model.
nlm(f, p, ...): This function carries out a minimization of the function f using a Newton-type algorithm.

model selection:
step(object, ...): Select a formula-based model by AIC.
drop1(object, ...) / add1(object, ...): add / drop a variable to the model
models with built-in feature selection: ada, bagEarth, bagFDA, bstLs, bstSm, C5.0,C5.0Cost, C5.0Rules, C5.0Tree, cforest, ctree, ctree2, cubist, earth, enet, evtree, extraTrees, fda, gamboost, gbm,gcvEarth, glmnet, glmStepAIC, J48, JRip, lars, lars2, lasso, LMT, LogitBoost, M5, M5Rules, nodeHarvest,oblique.tree, OneR, ORFlog, ORFpls, ORFridge, ORFsvm, pam, parRF, PART, penalized, PenalizedLDA, qrf, relaxo, rf,rFerns, rpart, rpart2, rpartCost, RRF, RRFglobal, smda, sparseLDA
Apart from models with built-in feature selection, most approaches for reducing the number of predictors can be placed into two main categories.

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance. caret has wrapper methods based on recursive feature elimination, genetic algorithms, and simulated annealing. E.g., caret's rfe function (recursive feature selection algorithm). There is risk of overfitting
Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion. caret has a general framework for using univariate filters. This method may choose redundant features and eliminate features that seems irrelevant to target but can be important if combined with other variables.
Bayesian: use to_include = pm.Bernoulli("to_include", 0.5, size=n_variable) for each variable.

time series analysis:
arima(x, order = c(0L, 0L, 0L), seasonal = list(order = c(0L, 0L, 0L), period = NA), ...): in order, the three integer components (p, d, q) are the AR order, the degree of differencing, and the MA order.
determine the order: AIC/BIC, pacf
auto.arima(x, d=NA, D=NA, max.p=5, max.q=5, ...): choose ARIMA model automatically
garch(x, order = c(1, 1), ...): garch uses a Quasi-Newton optimizer to find the maximum likelihood estimates of the conditionally normal model.
more impact on the negative return: use nonlinear GARCH model
tar(y, p1, p2, d, ...): two regime Threshold Autoregressive (TAR) model. Model is given by the following formula: Y_t = φ_{1,0}+φ_{1,1} Y_{t-1} +…+ φ_{1,p} Y_{t-p_1} +σ_1 e_t, \mbox{ if } Y_{t-d}≤ r; Y_t = φ_{2,0}+φ_{2,1} Y_{t-1} +…+φ_{2,p_2} Y_{t-p}+σ_2 e_t, \mbox{ if } Y_{t-d} > r. p1: AR order for lower regime.

optimization:
optimize():
optim(par, fn, gr = NULL, ...,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = -Inf, upper = Inf,
control = list(), hessian = FALSE): general purpose optimization
solve.QP(Dmat, dvec, Amat, bvec, meq=0, factorized=FALSE): solving quadratic programming problems of the form min(-d^T b + 1/2 b^T D b) with the constraints A^T b >= b_0.
DEoptim(fn, lower, upper, control = DEoptim.control(), ...): differential evolutionary algorithm to optimize complicated functions (minimize max drawn down, etc)
portfolio.optim(X, shorts=TRUE, ...): Computes an efficient portfolio from the given return series x in the mean-variance sense.

convert numeric to factor:
factor(x = character(), levels, labels = levels, ...): encode a vector as a factor
tapply(X, INDEX, FUN = NULL, ...): Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors. E.g.: tapply(height, sex, mean)
gl(n, k, length = n*k, labels = 1:n, ordered = FALSE): generate levels 1:n with k repetitions for each level
table(x): count occurance, similar to Python's Collection.count
cut(x, breaks, labels = NULL,
include.lowest = FALSE, right = TRUE, ...): need to specify breaks
quantile(x, ...): for ways of choosing breaks of roughly equal content (rather than length).
split(x, f, drop = FALSE, ...)

read and write data:

from csv: read.table(file, header = FALSE, sep = "", ...)

to csv: write.csv(file), write.table(x, file = "", ...)

from excel: readWorkSheet(...), requires XL Connect package

from database: requires(RODBC) or requires(RMySQL)

plotting:

plot(x, y, type = "l", sub, xlab, ylab)

par(mfrow = c(2,2))

other common functions:

stats calculations: mean, sd, median, range, cumsum, sum, diff, length

seq, table, apply

1.3 classes/objects

inherent attributes: mode (numeric, logical, character, etc) and length
use names(x) or attributes(x) to retrieve the attribute of object x
S4 class

1.4 packages
Machine learning packages: http://cran.r-project.org/web/views/MachineLearning.html
neuralnet: Neural network. E.g., net.sqrt = neuralnet(Output~Input,trainingdata, hidden=10, threshold=0.01)
clustering:

mahalanobis(x, center, cov, inverted=FALSE, ...): calculate Mahalanobis distance
hierarchical clustering method: hclust(d, method = "complete", members=NULL)
principle component analysis: princomp(formula, data = NULL, subset, na.action, ...)

rugarch: includes various GARCH models. For variance model, valid models (currently implemented) are “sGARCH”, “fGARCH”, “eGARCH”, “gjrGARCH”, “apARCH” and “iGARCH” and “csGARCH”. submodel If the model is “fGARCH”, valid submodels are “GARCH”, “TGARCH”, “AVGARCH”, “NGARCH”, “NAGARCH”, “APARCH”,“GJRGARCH” and “ALLGARCH”. mean model: List containing the mean model specification. armaOrder The autoregressive (ar) and moving average (ma) orders (if any).

1.5 Systems

.lib.Path(): path for libraries
system.file(package=...): check package file location
R CMD build/INSTALL: build/install packages

Data Science for Hackers

Sunday, March 8, 2015

A cheatsheet for R