Tableau and Python Integration

Version 2

    This Wiki is written for both Tableau users who are new to Python, as well as Python users who are new to Tableau and follows the very same structure Jonathan Drummey used for his community post on R integration as he seems to have found the winning formula ;).  

     

    Sections

    • What you have to know about Tableau
    • Getting started with Python and TabPy
    • How Tableau sends data to Python
      • Disaggregated data
      • Counting connections & Python script operations
      • Exactly what is sent to Python
      • Densification
      • Order of operations
    • Getting into arguments
    • Logging and Debugging
    • Running TabPy in the background
    • Making Python code reusable
    • Known issues

     

     

    What You Have To Know About Tableau

    Tableau talks to R, Python and MATLAB using a set of SCRIPT_ functions. SCRIPT_*() functions - SCRIPT_BOOL(), SCRIPT_INT(), SCRIPT_REAL(), and SCRIPT_STR() - work as table calculations and therefore you must have an understanding of table calculations, including an understanding of how dimensions and measures affect the view. Here’s a set of links to help you get going: 

    http://onlinehelp.tableau.com/current/pro/desktop/en-us/calculations_tablecalculations_understanding_addressing.html

     

     

    Getting Started with TabPy

    These links are helpful for getting an overview of Python and Tableau, for installing TabPy and connecting to Tableau and see some example use cases with downloadable sample workbooks.

     

     

     

    How Tableau Sends Data to Python

    Given that the SCRIPT_*() functions work as table calculations, they require aggregate measures or Tableau parameters to work properly. Aggregate measures include MIN(), MAX(), ATTR(), SUM(), MEDIAN(), and any table calculations or Python measures. If you want to use a specific non-aggregated dimension, it needs to be wrapped in an aggregate function, such as:

     

    SCRIPT_STR('return _arg1',ATTR([someTableauDimension])) 

     

    ATTR() is a special Tableau aggregate that does the following: IF MIN([Dimension]) = MAX([Dimension]) THEN [Dimension] ELSE * (a special version of Null) END, and it’s really useful when building visualizations and you’re not sure of the level of detail of data and what’s being sent. It can be significantly slower than MIN() or MAX() in large data sets, so once you get confident your results are accurate then you can switch to one of the other functions for performance.

     

    For other aggregates, you need to be aware of the dimensions in the view that will affect the level of aggregation. For example, if you are passing SUM([Sales]) to Python, the value(s) of SUM([Sales]) will likely be different depending on whether you have the Year or Month in the view. In addition, if you are passing the results of separate table calculation measures to Python, then these are nested table calculations and like other nested table calculations you need to be mindful of the addressing and partitioning of the nested table calcs.

     

    There’s a longer section about arguments further down, this was an overview.

     

     

    Using Every Row of Data - Disaggregated Data

    For accurate results for the Python functions, sometimes those Python functions need to be called with every row in the underlying data. There are two solutions to this:

    1. Disaggregate the measures using Analysis->Aggregate Measures->Off. This doesn’t actually cause the measures to stop their aggregations, instead it tells Tableau to return every row in the data without aggregating by the dimensions on the view (which gives the wanted effect). Using this with Python scripts can get the desired results, but can cause problems for views that we want to have Python work on the non-aggregated data and then display the data with some level of aggregation. The second solution deals with this situation:
    2. Add a dimension such as a unique Row ID to the view, and set the Compute Using (addressing) of the Python script to be along that dimension. If we’re doing some sort of aggregation with Python, then we might need to reduce the number of values returned by filtering them out with something like:
       IF FIRST()==0 THEN SCRIPT_REAL('insert Python script here') END  

    If we need to then perform additional aggregations on that data, we can do so with table calculations with the appropriate Compute Usings that take into account the increased level of detail in the view.

     

     

    Counting Connections & Python Script Operations

    For a given Python script, there will be 1+X connections to Python to execute the script, where there is always the main connection for the detail rows and X is a factor based on the # of grand totals & subtotals in the view, since grand totals & subtotals are separate computations at a higher level (more notes on that further down). For example, if there is a Row Grand total, then there will be 2 connections to Python, 1 for the detail and 1 for the Row Grand total. If there are Row and Column Grand Totals, then there will be 4 connections to Python: 1 for the detail, 1 for the Row Grand Total, 1 for the Column Grand Total, and 1 more for the combined Row & Column Grand Total.

     

    Within the main connection for the detail rows, the Python script will be executed N times, where N is the number of partitions based on the addressing and partitioning of the Python script. Within the connections for subtotals and grand totals, the Python script is executed once for each connection. 

     

    Using Python with Tableau Grand Totals & Subtotals really highlights how grand totals and subtotals are performed as separate computations at a higher level of detail. To explain another way: if the Python script is “expecting” a certain level of detail in the arguments passed to it, and that level of detail is not in the Grand Total computation, then the Python script will generate different results.

     

    So when we want to do custom Grand Totals, we have to be aware of the customizing Grand Total tricks for table calcs to use with our Python scripts. For more information, start with http://drawingwithnumbers.artisart.org/customizing-grand-totals-part-1/ and read more posts in http://drawingwithnumbers.artisart.org/category/grand-totals-2/. Also note that the two-pass custom Grand Totals are not currently available for Python  scripts (just like they are not available for other table calculations).

     

     

    Exactly What is Sent to Python

    The set of addresses in each Tableau table calc partition (aka rows in the partition) creates the vector for each argument passed to Python, for example _arg1 in SCRIPT_REAL('return sum(_arg1)',SUM([Sales])) that would effectively do a WINDOW_SUM(SUM([Sales])). The arguments passed to Python are always vectors. The values of the arguments are the values of each aggregate measure or Tableau parameter used for the arguments, and the length of the vector is equal to the number of rows in the partition. 'return len(_arg1)' in Python is the equivalent of the Tableau SIZE() function for a given Compute Using.

     

    The sorting of the vectors of arguments sent to Python is based on the sort used for the addressing & partitioning. If an Advanced Compute Using with a custom sort is used, that sort also orders what is sent to Python and what comes back back. In other words, the Python code works like we’d expect it to.

     

    Here’s an example of putting the about several paragraphs together: if you are using disaggregated measures, have 1,000 rows of data in the view and use an Python script with a Compute Using of Cell, the Python script will be executed 1,000 times within 1 connection, each time given vectors with a length of 1. Depending on your calculation, this can be significantly slower than using a different addressing with say, 1 partition, and then calling Python once with a vector of 1000 elements. If your calculation runs considerably slower in Tableau than in Python command line, most likely culprit is the addressing settings. 

     

    Null values and * (from ATTR) in Tableau are passed to Python with a value of None. If the Python script is not ready for None values then it could return undesired results.

     

    Conversely, None values in Python that are returned to Tableau become Null values in Tableau.

     

    In terms of data structure, Python receives each argument as a Python list. 

     

     

    Densification

    We also need to be aware of densification (domain completion, domain padding, mark type filling) effects. When Tableau is densifying the data in the view, that can increase the number of partitions and/or the number of rows within each partition. There are a couple of straightforward ways to identify densification:

    • Pay attention to the Status bar at the bottom of the Tableau window, in particular the number of marks, rows, and columns. If one or more of those values suddenly grows when you add a table calculation or Python script to a view, you’ve got densification.
    • Less commonly, densification will generate so many extra marks that the refresh time for a view grows a lot (such as from sub-second response times to 10 or 15 seconds), so if you add a table calculation or Python script to the view and it gets really slow to update, the issue could be densification (or a slow connection to TabPy, or complicated Python code). When you see this, check the Status bar and the number of marks, rows, and columns to verify.

     

    When unplanned, unexpected or unwanted densification is occurring, that might cause the Python script to return undesirable results. For example, a densified address that would ordinarily be treated as a Null in Tableau is becomes None in Python, and the Python script must be ready for that.

     

    Going into all the other details of densification is beyond the scope of this wiki. Turning off unwanted densification is a matter of moving most or all dimensions off of Rows and/or Columns to the Level of Detail Shelf, and where necessary for layout using aggregate measures on Rows and Columns. Here’s an example from the Tableau Forums: http://community.tableau.com/message/246394

     

     

    Order of Operations

    At a very high level, Tableau does the following:

    1. Runs queries on the data source(s) based on the pills in the view. Filters on dimensions and regular aggregate measures are applied.
    2. Densification and data blending happen, and filters that depend on blended calculations are applied.
    3. Table calculations and Python scripts are computed.
    4. Table calculation filters and filters on Python calculations are applied.
    5. Results are displayed.

     

    This has a few notable effects:

    • Python scripts are not run in the data source, but inside Tableau on aggregated data that has been filtered by most filters.
    • Tableau is smart enough to nest table calculations and Python scripts in any order and call them appropriately, so if you want to run something like table calc -> Python -> table calc -> Python, you can do it.
    • Since Python scripts are called before table calculation filters are applied, table calculation filters will not affect the results of Python scripts. If you need to filter for table calculation results, you'll need to implement that filtering inside a calculated field.

     

     

     

    How Tableau Receives Data from Python

    Though we can use a whole bunch of arguments (measures & Tableau parameters) for a given call to a SCRIPT_*() function, what gets returned from Python is a Tableau measure that is the result of the Python script, with one result per address (aka row in the partition). This leads to a couple of points:

    1. You can’t return multiple Tableau measures from a single Python script (see below for a workaround***).
    2. Like other aggregate measures and table calcs, we can’t use the SCRIPT_*() functions to increase the level of detail of the view (aka make more dimensions).

     

    *** The workaround is to concatenate the output of the Python script into a single value, then use multiple measures in Tableau to parse that out.

     

    If the Python code returns a single value (rather than a vector), then that value will be returned for every row in the partition. If the Python code returns a matrix of values, Tableau won’t know what to do, we have to return a single value or a vector, so the workaround is to parse the matrix appropriately to end up with either a single value or a vector of single values.

     

    In terms of data structure, Tableau expects vectors to be returned as Python lists. 

     

     

    Getting into Arguments

    The Python code needs to handle the vector(s) of the argument(s) that we’re sending to it, so, for example, if a Python function only looks at a single value of 1 argument then one can wrap the Python function like this:

     

     

    SCRIPT_REAL("import math

    return map(math.log,_arg1)",SUM([Sales]))

     

    or use a list comprehension

     

    SCRIPT_REAL("return [i * 2 for i in _arg1]",SUM([Sales]))

     

    The SCRIPT_* functions require at least one .arg# argument and that argument must be a regular aggregate, a table calc or a parameter. If you don’t give any argument, Tableau will give an error in the Edit Calculation dialog. 

     

     

    Logging and Debugging

    As a preliminary trouble shooting step, attempt to run the code in a platform that provides a Python interpreter like PyCharm, Jupyter, Zeppelin or command line and verify that it is returning the expected value. 

     

    Another step is to create a cross-tab with all of the dimension(s) used in the view - all the dimensions from Rows, Columns, Pages, and the Marks Card - on Rows, and all of the arguments being sent to R either as discrete values on Rows and/or in a Measure Names/Measure Values cross-tab. This will give you a perspective on what aggregations Tableau is performing. For example, if you are using ATTR([myDimension]) to pass a constant string value, you might see that ATTR() is actually passing * (effectively Null) because there is more than one value of myDimension for the given level of detail of your view.

     

    There are several possible ways to view and/or log what is being sent from Tableau to Python and back again for debugging purposes:

    • Use print statements in your Python code to write to the terminal window (command line) from which you started TabPy, for example:

         SCRIPT_REAL("print(_arg1)

      return sum(_arg1)",SUM([Sales]))

    • Use Python functions to write data out to a CSV

     

     

    Running TabPy in the Background

    When we start TabPy by running startup script, it will run in the console window and the process will be terminated when console window is closed. However it is possible to run TabPy as a background process or daemon.

     

    In Linux/Mac you can do this by starting it with:

     

    nohup ./startup.sh &

     

    On Windows we can create a shortcut in your Startup folder like this:

     

    C:\Windows\System32\wscript.exe C:\Users\Administrator\invis.vbs "C:\Users\Administrator\Anaconda\envs\Tableau-Python-Server\Lib\site-packages\tabpy_server\startup.bat" %*

     

    In this shortcut:

    1. wscript.exe is the Windows script host
    2. invis.vbs is a file you’ll need to create containing the following Windows Shell script that will start TabPy with a hidden window:

        Set WshShell = WScript.CreateObject("WScript.Shell") WshShell.Run """" & WScript.Arguments(0) & """" & sargs, 0, False

     

     

    Making Python Code Reusable

    We can publish Python functions to the server and call out to them by name from Tableau instead of embedding Python code into Tableau. This would make code management much easier since we won’t have to worry about what code was embedded in what workbook and instead can rely on TabPy as the one version of the truth. 

     

    This also allows us to train supervised machine learning models using historical data (and this could be done completely outside Tableau using Python editor of choice) and apply them to new data e.g. to predict churn in Tableau.

     

    The second example provided in the linked post below, uses a gradient boosting model trained on historical breast cancer cases to make predictions on new test result data user enters. Example also includes all Python code necessary to train/test and deploy the model to TabPy.

     

    https://www.tableau.com/about/blog/2017/1/building-advanced-analytics-applications-tabpy-64916