Adding and implementing R statistical analytics capability in Vertica using User Defined Transform Function (UDTF) as Vertica-R-UDX

Illustrated in the vertica forums. I am adding my own examples that got me kickstarted in the R-UDX area.

We shall start simple with a UDF that does multiplication operation on the data provided in the query. The rhlTestAlgebra.R is the UDF that will take two inputs and return their product.

1. Write the R-UDF code for the multiplication operation

# Write a UDTF named rhlTestAlgebra.R #
# A test UDTF for testing
rhlTestAlgebraFactory <- function()
# Takes i
name = multiplyValues,
udxtype = c("transform"),
intype = c("int","int"),
outtype = c("int"),
outnames = c("product")

# INPUT num1, num2
multiplyValues <- function(input)
# Initialize libraries
# initialize the inputs
num1 <- input[,1]
num2 <- input[,2]
num3 <- (num1 * num2)
dataResult <- data.frame(num3)
}, error = function(errStr)
# Format and log error in DB
}, finally =
# Close DB connection

# Function to load necessary packages
loadPackages <- function()
# Use RJDBC for connecting to database
# For logging time
# Use R2JAGS for calling JAGS from R
# binomial library
# plot library
# Set the random number generator seed
}, error = function(e)

2. Compile and install this library within Vertica

CREATE LIBRARY algebralib AS '/home/dbadmin/share/workspace/rhlTestAlgebra.R' LANGUAGE 'R';
CREATE TRANSFORM FUNCTION multiplyValues AS LANGUAGE 'R' NAME 'rhlTestAlgebraFactory' LIBRARY algebralib;

3. Execute the newly written UDTF for the results

SELECT multiplyValues(9,74) OVER();




HP Vertica community edition Setup for the dummies, Not really !!

——————————————————————————————————————————– HP_Vertica_Logo_0 ——————————————————————————————————————————–   PHEW !!!! Here I have noted the steps involved in bringing a single node of HP Vertica alive on a local Virtual box VM. I hope it covers all the pre-requisites that may ruin the fun.

A. Register and download the Vertica 7 rpms from here
B. Configure the virtual box so that we dont lose the IP. 
Memory provisioned is 4GB RAM, 2 processors and 50GB of disk storage 
I have two network adapters enabled in the virtual box VM settings.
The first network adapter configured as Bridged-Adapter and can connect to internet
The second network adapter is attached to Host-only-Adapter which generates a local IP like
C. Installation of a CentOS 6.5 x64 machine to serve as the host to Vertica 7. Download CentOS
LVM is not supported by Vertica 7, need to have EXT3 or EXT4 partitions 
Create a non-LVM partition with ext4 mounted at root / and a swap of 2GB
D. Add relevant proxy and variables if needed in your development environment
 export http_proxy=
 export HTTP_PROXY=
 export proxy=
 export TZ="Asia/Kolkata"
In case bridged and local IP is not reachable then use dhclient for the ethernet address discoveryby configuring the DHCP
E. Setup a SAMBA share for file transfer to and from this virtual machine 
I had to setup the EPEL repository so that I have the a test updated for download (
I had to configure proxy in ~/.bashrc and yum.conf for making sure my VM is connecting to the internet
$ sudo yum install samba
$ mkdir /home/dbadmin/share
Configure as in
$ vim /etc/samba/smb.conf
Add the following in the end of conf where Share Definitions are mentioned
        path = /home/dbadmin/share
        writeable = yes
        browseable = yes
        valid users = dbadmin
Any user that requires to connect to this share needs to be given a samba password
$ sudo smbpasswd -a dbadmin
Start the samba server
$ sudo service smb restart
$ sudo service nmb restart
 Make sure you have a user created for samba for this share 
F. Install Vertica (vertica-7.0.1-0.x86_64.RHEL5.rpm) Database, The installation is being done as a super user here.

PREFACE: Just to scare the administrator I am starting with the horror I witnessed,when I just went ahead with the usual rpm and install_vertica route

Pre-Requisites and if not adhered to is served by these beautiful errors
In case pre-requisite are not adhered to then following error occur

Failures during local (OS) configuration for verify-
 HINT (S0151): 
 These disks do not have known IO schedulers: '/dev/mapper/VolGroup-
 lv_root' ('') = ''
 HINT (S0231):
 dbadmin user ('dbadmin') primary group is not verticadba ('verticadba')
 HINT (S0305):
 TZ is unset for dbadmin. Consider updating .profile or .bashrc
 WARN (S0170):
 lvscan (LVM utility) indicates some active volumes.
 FAIL (S0150):
 These disks do not have 'deadline' or 'noop' IO scheduling: '/dev/sda1'
 ('sda') = 'cfq'
 FAIL (S0020):
 Readahead size of sda (/dev/sda1) is too low for typical systems: 256 <
 2048 Readahead size of (/dev/mapper/VolGroup-lv_root) is too low for
 typical systems: 256 < 2048
 FAIL (S0030):
 ntpd process is not running: ['ntpd', 'ntp']
 FAIL (S0310):
 Transparent hugepages is set to 'always'. Must be 'never' or 'madvise'.
1. Create a 2GB swap space in case a swap partition was not created initially during installation of OS
$ grep SwapTotal /proc/meminfo
$ dd if=/dev/zero of=/mnt/swapfile bs=1024 count=2097152 
$ mkswap /mnt/swapfile 
$ swapon /mnt/swapfile
2. Network Time Protocol (NTP) 
$ runlevel 
$ chkconfig --list ntpd
$ /sbin/service ntpd restart 
$ /sbin/chkconfig ntpd on 
$ /usr/sbin/ntpq -c rv | grep stratum
3. Disable SELinux at vim /etc/selinux/config by editing SELINUX=disabled
$ setenforce 0

4. CPU Scaling set to performance.
Make sure you configure the performance parameter in cpu scaling info at /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done

5. Vertica doesn’t support transparent hugepages. 
$ cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
$ sysctl -a | grep hugepage
$ vim /etc/grub.conf and add transparent_hugepage=never to the kernel line between root and initrd
Also, if the above solution is a thread to your kernel then add the following tp your /etc/rc.local file
  if test -f /sys/kernel/mm/redhat_transparent_hugepage/enabled; then
     echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
6. Disk Readahead, Vertica requires that Disk readahead by set to at least 2048 
• Do it for sda (which is the block storage): 
$ /sbin/blockdev --setra 2048 /dev/sda
$ echo '/sbin/blockdev --setra 2048 /dev/sda' >> /etc/rc.local

• Do it for sda1 (which is the block storage): 
$ /sbin/blockdev --setra 2048 /dev/sda1
$ echo '/sbin/blockdev --setra 2048 /dev/sda1' >> /etc/rc.local 
Verify at cat /etc/rc.local
7. Vertica requires either deadline or noop I/0 scheduler. And CFQ creates performance bottlenecks.

Permanent setting can be done by adding the following in kernel cmd line in vim /etc/grub.conf 
Configure the I/O Scheduler - Changing the Scheduler Through the /sys Directory
• Check the current status: 
$ cat /sys/block/sda/queue/scheduler 
  noop anticipatory deadline [cfq] 
• Change for the current run: 
$ echo deadline > /sys/block/sda/queue/scheduler 
• Add the change to rc.local so it survives reboot: 
$ echo 'echo deadline > /sys/block/sda/queue/scheduler' >> /etc/rc.local 
• Check the status: 
$ cat /sys/block/sda/queue/scheduler 
  noop anticipatory [deadline] cfq 
$ find / -iname "scheduler"
Ok, we have sda device. You can confirm it by: blkid command or fdisk -l
NOTE: There should not be any LVM partitions associated with vertica 
Make the change to /dev/mapper/VolGroup-lv_root as well
so Check the dev device name for partition like /dev/mapper/VolGroup-lv_root, lvdisplay -v or ll /dev/VolGroup/
In my case it is -> Block device 253:0s
Unable to resolve this one as LVM was used and had to reinstall my OS with EXT4
8. TZ Environment Variable
$ yum update tzdata
$ cat /etc/sysconfig/clock
$ export TZ="Asia/Kolkata"
9. Add dbadmin to the verticadba group as primary user
Setup vertica with dbadmin@
$ adduser dbadmin $ passwd dbadmin Add this to the /etc/sudoers list - create a verticadba group and add dbadmin to it $ id dbadmin $ groupadd -f verticadba //usermod -a -G verticadba dbadmin $ usermod -g verticadba dbadmin $ usermod -m -d /home/dbadmin/ dbadmin $ chown -R dbadmin:verticadba /home/dbadmin
10. Disable firewall for now, later add specific port forwarding
$ service iptables save
$ service iptables stop
11. Append the host name to /etc/hosts verticalocalhost
12. Finally we arrive at the actual installation of Vertica7, I was so happy with vertica 6 where my installation notes amounted to about 10 lines only.
Install vertica RPM
$ rpm -ivh vertica-7.0.1-0.x86_64.RHEL5.rpm
Check the installation by: ls /opt/vertica/ and initialize the vertica nodes using the install
$ /opt/vertica/sbin/install_vertica --hosts verticalocalhost --rpm vertica-7.0.1-0.x86_64.RHEL5.rpm -u dbadmin
13. Initialize the node and create the database
Use the configuration tool and go to the configuration menu and create a new database
You will have to accept the license for community edition as well.
Here you can create a database and associate a password through the configuration menu.
$ /opt/vertica/bin/adminTools

14. Connect to the database through the VSQL Vertica client
$ /opt/vertica/bin/vsql -U dbadmin

Here is a snapshot of the vertica adminTools interface. Its great for me as i prefer putty and this interface is smooth there.


Now, that is some nasty list of command for the setup. I just poured it all here. I hope it could help somebody. I will be laying my hands on the UDTF functions which is the core of analytics there. I hope i would be able to cover how beautifully it integrates with R-language in order to provide for extreme big data statistical analytics within the heart of Vertica.

There is a great database client called dbeaver  which is also called a free universal database manager as it can connect to most known sql databases easily over JDBC. Make the connections as per the settings below and kickstart your Vertica experience


Here is a bash script to initialize and start the database when we start the linux instance. This will take the burden off the simple db initialization, start, stop operations.

# description: Vertica service
export VERTICA_HOME=/opt/vertica
export VERTICA_USER=dbadmin
export VERTICA_DBPASS=xxxdb
export VERTICA_PORT=5433
export VERTICA_HOST=verticalocalhost
echo "------------------------------------STATE---------------------------------------"
# Pre check of the running process
vertica_ps=`bash -c "ps -eaf | grep vertica | grep $VERTICA_DB | grep $VERTICA_PORT"`
echo $vertica_ps
vertica_pid=`bash -c "ps -eaf | grep vertica | grep $VERTICA_DB | grep $VERTICA_PORT | awk '{print $2}'"`
#echo "PID: $vertica_pid"
echo "--------------------------------------------------------------------------------"
case $1 in

 sudo bash -c "service iptables stop"
 if sudo bash -c "/opt/vertica/sbin/install_vertica --hosts $VERTICA_HOST --rpm /home/dbadmin/share/vertica-7.0.1-0.x86_64.RHEL5.rpm -u $VERTICA_USER"; then
 echo -n "Vertica $1 successful"
 $0 start
 echo -n "Error in Vertica $1: $?"

 if bash -c "/opt/vertica/bin/admintools --tool stop_db -d $VERTICA_DB -p $VERTICA_DBPASS"; then
 echo -n "Vertica $1 $VERTICA_DB successful"
 echo -n "Error in Vertica $1: $?"

 $0 stop
 $0 start

 if bash -c "/opt/vertica/bin/admintools --tool start_db -d $VERTICA_DB -p $VERTICA_DBPASS"; then
 echo -n "Vertica $1 $VERTICA_DB successful"
 echo -n "Error in Vertica $1: $?"

 echo "To be done"
 echo $"Usage: $0 {start|stop|restart|initdb|status}"
 exit 1


15. Change in IP address of the server

If you are a developer like me who works on machines with dynamically assigned IPs. It is a pain to start the server with the new IP. here I have tried to provide a fix by going into the internal configurations in order to get this right.

  1. Make sure the /etc/hosts file is updated with the new IP
  2. Edit the /opt/vertica/config/admintools.conf and replace all the older IPs with the new ones.
  3. Edit the DB spread configuration file with the new IP.  /home/dbadmin/XXXDB/v_teamdb_node0001_catalog/spread.conf
  4. Edit the DB catalog configuration file with the new IP.  /home/dbadmin/XXXDB/v_teamdb_node0001_catalog/Catalog/
  5. Go to the admintools and start the database with your fingers crossed.

In case your database started with this configuration then its your lucky day today.


16. Backup the server data

This is the most important section.Now that you have a database in production its prudent to have a mechanism for backup and restore. Here I am adding a mechanism to backup the data.

1. Create your configuration file with the needed parameters. 
   $  --setupconfig
     Snapshot name (backup_snapshot): XXXDB_2016_08_10
     Number of restore points (1): 1
     Specify objects (no default):
     Object restore mode (coexist, createOrReplace or create) (createOrReplace):
     Vertica user name (dbadmin):
     Save password to avoid runtime prompt? (n) [y/n]: y
     Database user password to save in vbr password config file (no default):
     Node v_teamdb_node0001
     Backup host name (no default): verticalocalhost
     Backup directory (no default): /home/dbadmin/backup
     Change advanced settings? (n) [y/n]: n
     Password file name (no default): 123
     Saved vbr password to 123.
     Config file name (XXXDB_2016_08_10.ini):
     Saved vbr config to XXXDB_2016_08_10.ini.

2. Initialize a backup folder --task init --config-file XXXDB_2016_08_10.ini

3. Backup your data --task backup --config-file XXXDB_2016_08_10.ini

You will find manifest and a snapshot folder, apart from the backed up node in the /opt/dbadmin/backup directory.

17. Restore the servers data

This can be achieved using the same vbr config file, but will require the target database to be down.