Statistical compute capability in HP Vertica community edition leveraging R language integration with UDTF(User defined transform function)

HP_Vertica_Logo_0Rlanguage

We already saw the vertica database setup in a virtual box locally. Now we look at setup of Vertica-R UDTF integration in RHEL6. R is a popular statistical computing language and vertica is a good columnar store database. The combination of both can be a great advantage for data analyst who want to bring their analytical algorithms to an enterprise platform as a solution.

1. Lets get to Installation of Vertica-R

Dependencies:
sudo rpm -ivh libgomp-4.4.7-3.el6.x86_64.rpm
sudo rpm -ivh compat-libgfortran-41-4.1.2-39.el6.x86_64.rpm
sudo rpm -ivh libtool-ltdl-2.2.6-15.5.el6.x86_64.rpm
sudo rpm -ivh libgfortran-4.4.7-3.el6.x86_64.rpm
sudo rpm -ivh blas-3.2.1-4.el6.x86_64.rpm
sudo rpm -ivh lapack-3.2.1-4.el6.x86_64.rpm
sudo rpm -ivh vertica-R-lang-7.0.1-0.x86_64.RHEL5.rpm

2. Setup JAGS library

tar xvf JAGS-3.3.0.tar.gz
sudo ln -s /usr/lib64/libblas.so.3.2.1 /usr/lib64/libblas.so
sudo ln -s /usr/lib64/liblapack.so.3.2.1 /usr/lib64/liblapack.so
LDFLAGS=-L/usr/lib64/ ./configure --with-blas=lblas --with-lapack=llapack 
sudo make install

Jags libraries are located in /usr/local/lib/libjags*
sudo ln -s /usr/local/lib/libjags.so.3 /usr/lib64/libjags.so.3

3. Environment Variables

export PATH=$PATH:/opt/vertica/bin:/opt/vertica/R/bin
export R_HOME=/opt/vertica/R
export JAVA_HOME=/usr/lib/jvm/java-1.7.0/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64:/usr/local/lib/JAGS/modules-3:/usr/lib/jvm/java-1.7.0/jre/lib/amd64/server/

4. Test using the R prompt

 /opt/vertica/R/bin/R

5. Install R packages in R prompt from CRAN R repository

 sudo /opt/vertica/R/bin/R
 Set proxy -> Sys.setenv(http_proxy="http://garlic.xxxxxx.com:8080")
install.packages("coda_0.16-1.tar.gz",repos = NULL, type = "source")
install.packages("rjags_3-10.tar.gz", repos = NULL, configure.args="--with-jags-include=/usr/local/include/JAGS --with-jags-lib=/usr/local/lib --with-jags-modules=/usr/local/lib/JAGS/modules-3", type = "source")
install.packages("abind_1.4-0.tar.gz",repos = NULL, type = "source") 
install.packages("R2WinBUGS_2.1-19.tar.gz",repos = NULL, type = "source") 
install.packages("R2jags_0.03-09.tar.gz",repos = NULL, type = "source")
install.packages("plyr_1.7.tar.gz",repos = NULL, type = "source")
install.packages("digest_0.6.2.tar.gz",repos = NULL, type = "source")
install.packages("memoise_0.1.tar.gz",repos = NULL, type = "source")
install.packages("stringr_0.6.tar.gz",repos = NULL, type = "source")
install.packages("lubridate_1.2.0.tar.gz",repos = NULL, type = "source")

6. RJDBC, Configure R with JAVA

sudo /opt/vertica/R/bin/R CMD javareconf
 
Install R-JAVA from R prompt
Make sure that the Java home is pointing to a JDK
sudo /opt/vertica/R/bin/R
install.packages("rJava_0.9-4.tar.gz",repos = NULL, type = "source")
install.packages("DBI_0.2-7.tar.gz",repos = NULL, type = "source")
install.packages("RJDBC_0.2-1.tar.gz",repos = NULL, type = "source")

——————————————————————————————————
————————-Execution and Testing of the setup—————————
——————————————————————————————————

A. R LANGUAGE: Test RJDBC from R prompt

> library(RJDBC)
   Loading required package: DBI
   Loading required package: rJava
> vDriver = JDBC(driverClass="com.vertica.jdbc.Driver",classPath="/opt/vertica/java/lib/vertica-jdbc.jar")
> vertica = dbConnect(vDriver, "jdbc:vertica://localhost:5433/rhldb", "dbadmin", "rhldb")
> mytestresults = dbGetQuery(vertica, "select * from user_info")
> print(mytestresults)
   user_id user_name first_name last_name email password
   1 1 guest guest guest raul_36@yahoo.com guest

B. Test JAGS compute in R

PRE-REQUISITE: Create a file testjagsmodel.bug and dump the following content in order to create a model file
model {
 for (i in 1:N) {
 x[i] ~ dnorm(mu, tau)
 }
 mu ~ dnorm(0, .0001)
 tau

JAGS Test at R prompt

> library('rjags')
Loading required package: coda
Loading required package: lattice
Linked to JAGS 3.3.0
Loaded modules: basemod,bugs
> N x jags
Initializing model
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
> update(jags, 1000)
 |**************************************************| 100%
> jags.samples(jags, c('mu', 'tau'), 1000)
 |**************************************************| 100%
$mu
mcarray:
[1] 0.05442777
Marginalizing over: iteration(1000),chain(4)
$tau
mcarray:
[1] 0.0433561
Marginalizing over: iteration(1000),chain(4)

————————————————–RESULT———————————————

When you call jags.sample(), you’ll see the output provides proposed values for mu and tau. These should be close to 0 and 0.04 if JAGS is working properly.

Since those were the mean and precision values we used to create our simulation data. (At the risk of being pedantic: we used a standard deviation of 5, which gives a variance of 25 and a precision of 1 / 25 = 0.04.) Of course, they’ll be even closer to the sample mean mean(x) and the sample precision 1 / var(x), so you should not forget to compare the inferred values to these values. The sample size, 1,000, isn’t large enough to guarantee that the mean will be all that close to 0.

———————————————————————————————————-

I would continue to post more on the UDTF and statistical R compute related programming for generating analytics based code.

Advertisements

One thought on “Statistical compute capability in HP Vertica community edition leveraging R language integration with UDTF(User defined transform function)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s