Using R on Hadoop with Rhipe
I spent a while this week getting Rhipe, a java package that integrates the R environment with Hadoop, to work. Forward are pretty heavy users of Hadoop and it’s supporting ecosystem so R will be another way for the devs to interrogate the huge (and rapidly growing!) datasets we have.
Installing R
Adding the repositry
Create a new file at /etc/sources.list.d/R.list
#R repositry
deb http://rh-mirror.linux.iastate.edu/CRAN/bin/linux/ubuntu hardy/
(we are still using hardy, with the Cloudera packages)
Add the gpg keys for the repository
gpg --keyserver pgp.mit.edu --recv-key E2A11821
gpg -a --export E2A11821 | sudo apt-key add -
Install and update R
Easy:
$ sudo apt-get install r-base r-base-dev pkg-config littler
$ sudo R
> update.packages()
Set environment variables for Rhipe
Add to bottom of /etc/environment
HADOOP=/usr
create it for current session
$ export HADOOP=/usr
install protobuff
# wget http://protobuf.googlecode.com/files/protobuf-2.3.0.tar.bz2
# tar jxf protobuf-2.3.0.tar.bz2
# cd protobuf-2.3.0
# ./configure
# make
# make install
# ldconfig
install Rhipe
# wget http://www.stat.purdue.edu/~sguha/rhipe/dn/Rhipe_0.64.tar.gz
# R CMD INSTALL Rhipe_0.64.tar.gz
So all is well except that the test code here is a bit off.
For me today
> library(Rhipe)
Only works as root. It seems that
> rhwrite(list(1,2,3),"/tmp/x")
should be:
> rhwrite(list(1,2,3),"/tmp/x",1)
then
> rhread("/tmp/x")
works properly.
Also in the longer example
map <- expression({
lapply(seq_along(map.values),function(r){
x <- runif(map.values[[r]])
rhcollect(map.keys[[r]],c(n=map.values[[r]],mean=mean(x),sd=sd(x)))
})
})
## Create a job object
z <- rhmr(map, ofolder="/tmp/test", inout=c('lapply','sequence'),
N=10,mapred=list(mapred.reduce.tasks=0),jobname='test')
## Submit the job
rhex(z)
## Read the results
res <- rhread('/tmp/test/p*')
colres <- do.call('rbind', lapply(res,"[[",2))
colres
n mean sd
[1,] 1 0.4983786 NA
[2,] 2 0.7683017 0.2937688
[3,] 3 0.5936899 0.3425441
[4,] 4 0.3699087 0.2666379
[5,] 5 0.5179839 0.4060244
[6,] 6 0.6278925 0.2952608
[7,] 7 0.4920088 0.2785893
[8,] 8 0.4592598 0.2674592
[9,] 9 0.5734197 0.1928496
[10,] 10 0.4942676 0.2989538
Where line 16 has been changed from the original
res <- rhread('/tmp/test')
Thanks to Saptarshi Guha, the author of Rhipe for so quickly responding to my query in the group and also the authors of this discussion on setting up R in Ubuntu