Speed comparison of python-pandas and R for some spearman correlations
Today I have had to do a number of spearman correlations for work.
As yesterday, I watched the talk from Wes McKinney on pandas at pycon 2012 (great talk if you don't know it!), I thought it would be cool to have a look at python-pandas as a) I like python b) I don't know pandas c) pandas can do spearman correlation.
So I started to write some code in python using pandas
#!/usr/bin/python #-*- coding: UTF-8 -*- import pandas def main(inputfile='binary_matrix.csv'): data = pandas.read_csv(inputfile, index_col=0) data = data.transpose() cnt_core = 0 for cnt in range(0, len(data.columns)): correlation = data[data.columns[cnt - 1]].corr( data[data.columns[cnt]], method='spearman') cnt_core = cnt_core + 1 if not pandas.isnull(correlation): print("%s correlates with %s with coefficient %f" % ( data.columns[cnt - 1], data.columns[cnt], correlation)) print("%i correlations performed" % cnt_core) if __name__ == '__main__': main('binary_matrix.csv')
Nothing fancy, just a simple main function looping over the columns to correlate each with the previous one.
Then I wanted to check the results, so since I know R, I wrote the similar code in R:
data <- read.table('binary_matrix.csv', row.names=1, header=TRUE, sep=",", quote = "\"'") data <- as.data.frame(t(data)) cnt_core <- 0 for (cnt in 2:ncol(data)){ correlation <- cor(data[cnt - 1], data[cnt], method='spearman', use='pairw') cnt_core <- cnt_core + 1 if (! is.na(correlation)){ print(sprintf("%s correlates with %s with coefficient %f", colnames(data)[cnt -1], colnames(data)[cnt], correlation)) } } print(sprintf("%i correlations performed", cnt_core))
For the record, the input is a matrix of 36 columns by 35483 rows (which is translated just after reading).
And of course I timed the output:
$ time python spearman_correlation.py [...] 35482 correlations performed real 0m20.379s user 0m20.112s sys 0m0.178s
and
$ time Rscript spearman_correlation.R [...] [1] "35482 correlations performed" real 0m32.907s user 0m32.549s sys 0m0.182s
Note: although I do not show the results here, trust me, they were equal.
So, over 35,482 correlations python was ~37% faster, I find that quite impressive.