A Guide to Setting up Tableau with Apache Spark

Version 1

    Apache Spark is the hottest thing to happen to big data analytics yet and Tableau is the one of the hottest data visualization and discovery tools out there. Pair them together and you got a potential game changer in the field of big data analytics and visualization.


    Tableau 9 is supported to be working with Spark but the setup isn't exactly straight forward, well not until you set the backend components setup correctly.


    The Technology Stack


    With a rapidly evolving open source ecosystem the Apache Hadoop and Spark has multiple stable and active versions. The below technology stack is what we've tested to be working with the Tableau integration.


    • Hadoop 2.6 or later
    • Hive
    • mySQL - for housing the Hive metastore_db (more on this later)
    • Spark 1.4 or later
    • Tableau Desktop 9 or later
    • Spark ODBC driver


    Hadoop 2.6


    In order to get all the Hive functionality working, you will need to be sure you are on Hadoop 2.6 or higher. There are work arounds to make this work on earlier Hadoop distributions, but its not worth the effort. Doing a Hadoop upgrade is far simpler. Since the Hadoop upgrade is in itself an extensive process, I am leaving it out of this post.


    Hive Setup


    Once you have installed or upgraded to Hadoop 2.6, you will need to install Hive. Start by downloading the latest stable version, unzipping the distribution and linking.



    root@XXXXX:/d01/swDump# wget http://mirror.cc.columbia.edu/pub/software/apache/hive/stable/apache-hive-1.1.0-bin.tar.gz

    root@XXXXX:/d01/swDump# md5sum apache-hive-1.1.0-bin.tar.gz

    root@XXXXX:/d01/swDump# tar -xzf apache-hive-1.1.0-bin.tar.gz

    root@XXXXX:/d01/swDump# mv apache-hive-1.1.0-bin /srv

    root@XXXXX:/d01/swDump# chown -R hadoop:hadoop /srv/apache-hive-1.1.0-bin

    root@XXXXX:/d01/swDump# ln -s /srv/apache-hive-1.1.0-bin /srv/hive


    Setup hive specific environment variables


    vim .bashrc

    --# Configure Hive environment

    export HIVE_HOME=/srv/hive

    export PATH=$PATH:$HIVE_HOME/bin

    --## Resolve hive error due to hadoop using jline 0.9x whereas hive upgraded to Jline2



    There are a couple of other configurations, but we will deal with that later.


    MySQL Setup


    The default configuration of Hive comes with an embedded Derby Datastore, however it has a few limitations. Firstly it does not scale up well for any kind of production deployment. Next it also does not support shared connections every well. If we are going to connect with Tableau, we will need this. Assuming we have mySQL already installed on your system , you can configure it with Hive as below

    Login to mySQL as the root user, create the metastore_db, the hive user account and the required privileges


    mysql -u root -p


    mysql> CREATE DATABASE metastore_db;

    Query OK, 1 row affected (0.00 sec)


    mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hive123';

    Query OK, 0 rows affected (0.00 sec)


    mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hive123';

    Query OK, 0 rows affected (0.00 sec)


    mysql> flush privileges;

    Query OK, 0 rows affected (0.00 sec)


    You will also need download the mySQL java connector from here (http://dev.mysql.com/downloads/connector/j), extract and copy it to your $HIVE_HOME/lib


    root@XXXXX:/d01/swDump/mysql-connector-java-5.1.36# cp mysql-connector-java-5.1.36-bin.jar /srv/hive/lib/

    root@XXXXX:/d01/swDump/mysql-connector-java-5.1.36# chown hadoop:hadoop /srv/hive/lib/mysql-connector-java-5.1.36-bin.jar

    Lastly create a hivesite.xml file in your $HIVEHOME/conf with the following entries. Replace the hive user and password with the appropriate values.






    <description>metadata is stored in a MySQL server</description>





    <description>MySQL JDBC driver class</description>





    <description>user name for connecting to mysql server </description>





    <description>password for connecting to mysql server </description>




    That's it you are all set(well almost). The next time you login to the hive shell, it should connect to the mySQL metastore_db and all objects will stored with that. You can start loading the test data.


    hive> create table testaadh(state String, district String, gen int,reject int) row format delimited fields terminated by ',' stored as textfile;


    Time taken: 1.526 seconds


    hive> LOAD DATA INPATH '/user/samx18/tempdata.csv' INTO TABLE testaadh;

    Loading data to table default.testaadh

    Table default.testaadh stats: [numFiles=1, numRows=0, totalSize=552089, rawDataSize=0]



    You can query the mySQL metastore_db directly to check your hive tables are successfully created.



    mysql -u hiveuser -p metastore_db

    mysql> select * from TBLS;




    |      1 |  1441400778 |     1 |                0 | hadoop |         0 | 1 | hivetest | MANAGED_TABLE | NULL               | NULL               |

    |      6 |  1441410858 |     1 |                0 | hadoop |         0 | 6 | testaadh    | MANAGED_TABLE | NULL               | NULL               |


    2 rows in set (0.00 sec)



    Spark Setup


    Make sure you have spark 1.4 or later setup on your Hadoop cluster, if not you can use the instructions here to get it setup.

    To complete the spark setup with hive you will need to copy your hive-site.xml to your



    $ cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/


    Connect to your favorite Spark shell (pyspark in our case) and test the connection to the Hive table using the Spark Hive context. Depending on your spark build your hive context may or may not have been built for you.



    >sqlContext = HiveContext(sc)

    >results = sqlContext.sql("SELECT * FROM testaadh")



    Once you have confirmed you can access the hive tables via SparkSQL, start the Spark Thrift server. This will enable Tableau to make connection and run SparkSQL queries.


    hadoop@XXXXX:~$ cd $SPARK_HOME/sbin

    hadoop@XXXXX:/srv/spark/sbin$ ./start-thriftserver.sh

    starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /srv/spark-1.4.1-bin-hadoop2.6/sbin/../logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-XXXXX.out




    Test connectivity to the Spark Thrift server using beeline


    hadoop@XXXXX:~$ cd $SPARK_HOME

    hadoop@XXXXX:/srv/spark$ ./bin/beeline

    Beeline version 1.4.1 by Apache Hive

    beeline> !connect jdbc:hive2://localhost:10000

    scan complete in 2ms

    Connecting to jdbc:hive2://localhost:10000

    Enter username for jdbc:hive2://localhost:10000:

    Enter password for jdbc:hive2://localhost:10000:

    Connected to: Spark SQL (version 1.4.1)

    Driver: Spark Project Core (version 1.4.1)

    Transaction isolation: TRANSACTION_REPEATABLE_READ

    0: jdbc:hive2://localhost:10000>




    Desktop Setup


    Your server components are now setup and you can move to the desktop setups.

    Download and install the following


    Open Tableau, and create a new data source. Choose SparkSQL as the server type and fill in the details for your Spark Thrift server






    Browse to the default schema and you should see be able to see your Hive Hadoop cluster tables.