Tableau and R Integration

    Integrating R and Tableau has some complexities to it. This Wiki is written for both Tableau users who are new to R, as well as R users who are new to Tableau. Thanks to Bora Beran, Andrew Ball, Mary from Tableau Support, and the folks on the Tableau 8.1 Beta Forums for their input and questions.

     

     

    Sections

     

    • What you have to know about Tableau
    • Getting started with R
      • R Libraries
      • R on the Mac (and Tableau on a Windows Virtual Machine)
    • How Tableau sends data to R
      • Disaggregated data
      • Counting connections & R script operations
      • Exactly what is sent to R
      • Densification
      • Order of operations
    • Getting into arguments
      • Nesting arguments
    • Logging and Debugging
    • What the Tableau/R integration doesn’t do

     

     

    What You Have To Know About Tableau

     

    In Tableau, the R SCRIPT_*() functions - SCRIPT_BOOL(), SCRIPT_INT(), SCRIPT_REAL(), and SCRIPT_STR() - work as table calculations and therefore you must have an understanding of table calculations, including an understanding of how dimensions and measures affect the view. Here’s a set of links to help you get going: http://drawingwithnumbers.artisart.org/want-to-learn-table-calculations/

     

     

    Getting Started with R

     

    These links are helpful for getting an overview of R and Tableau, and for installing R and connecting to Tableau, especially the first two.

     

     

    R Libraries

     

    While the R script can include a library() call, that is inefficient because the library can end up being every time the view is refreshed. Libraries should be pre-loaded before Rserve starts, for example by setting up a .Rprofile on Linux or Mac for loading them into R before Rserve starts, or an Rserv configuration file, see Rserve - Binary R server - RForge.net.

     

     

    R on the Mac (and Tableau on a Windows Virtual Machine)

     

    Running R on Mac OS and Tableau inside a Windows virtual machine is doable, here are some notes:

    • You’ll need to set up an Rserv.conf file with at least one if not two options:

      remote enable workdir /some/accessible/working/directory

      remote enable is necessary for the Windows VM to access R. The default working directory for R is in /private/tmp/Rserv, you can use the workdir to put those files in another place.

    • When writing out files to the working directory without any other directory specification, R will generate a subdirectory for each connection to R with the format conn#####/. In cases where there are multiple connections to R for each refresh of the view in Tableau(see below for details), there will be multiple subdirectories created.
    • If you are starting Rserve in daemon mode, the only way to kill it is via a terminal window using ps -ef|grep Rserve, finding the process ID and then using a kill [process ID], or by using the Activity Monitor and quitting the process there.
    • The Tableau connection to R can use your Mac OS machine name for the Server and the default port 6311. If you can use a fixed IP address, that could be faster.
    • The Rserve daemon is really sensitive to the network connection. If there is no running network and the Mac is just running the default IP address, calls to R can get really slow (at least for Tableau running on Windows 7 under Parallels connecting to R on Mac OS Mountain Lion). If the network is unreliable and dropping & reconnecting, you may need to stop & restart Rserve or close & re-open the Tableau workbook to reconnect to R.

     

     

    How Tableau Sends Data to R

     

    If you're the type of person who wants to see what happens first before trying to understand the how/why, check out this post: Re: Replicate VAR functionality through Script_real

    Given that the SCRIPT_*() functions work as table calculations, they require aggregate measures or Tableau parameters to work properly. Aggregate measures include MIN(), MAX(), ATTR(), SUM(), MEDIAN(), and any table calculations or R measures. If you want to use a specific non-aggregated dimension, it needs to be wrapped in an aggregate function, such as:

    SCRIPT_STR('x<-.arg1',ATTR([someTableauDimension]))

    ATTR() is a special Tableau aggregate that does the following: IF MIN([Dimension]) = MAX([Dimension]) THEN [Dimension] ELSE * (a special version of Null) END, and it’s really useful when building visualizations and you’re not sure of the level of detail of data and what’s being sent. It can be significantly slower than MIN() or MAX() in large data sets, so once you get confident your results are accurate then you can switch to one of the other functions for performance.

     

    For other aggregates, you need to be aware of the dimensions in the view that will affect the level of aggregation. For example, if you are passing SUM([Sales]) to R, the value(s) of SUM([Sales]) will likely be different depending on whether you have the Year or Month in the view. In addition, if you are passing the results of separate table calculation measures to R, then these are nested table calculations and like other nested table calculations you need to be mindful of the addressing and partitioning of the nested table calcs.

     

    There’s a longer section about arguments further down, this was an overview.

     

     

    Using Every Row of Data - Disaggregated Data

     

    For accurate results for the R functions, sometimes those R functions need to be called with every row in the underlying data. There are two solutions to this:

    1. Disaggregate the measures using Analysis->Aggregate Measures->Off. This doesn’t actually cause the measures to stop their aggregations, instead it tells Tableau to return every row in the data without aggregating by the dimensions on the view (which gives the wanted effect). Using this with R scripts can get the desired results, but can cause problems for views that we want to have R work on the non-aggregated data and then display the data with some level of aggregation. The second solution deals with this situation:
    2. Add a dimension such as a unique Row ID to the view, and set the Compute Using (addressing) of the R script to be along that dimension. If we’re doing some sort of aggregation with R, then we might need to reduce the number of values returned by filtering them out with something like:
    IF FIRST()==0 THEN SCRIPT_REAL('insert R script here') END

    If we need to then perform additional aggregations on that data, we can do so with table calculations with the appropriate Compute Usings that take into account the increased level of detail in the view.

     

     

    Counting Connections & R Script Operations

     

    For a given R script, there will be 1+X connections to R to execute the script, where there is always the main connection for the detail rows and X is a factor based on the # of grand totals & subtotals in the view, since grand totals & subtotals are separate computations at a higher level (more notes on that further down). For example, if there is a Row Grand total, then there will be 2 connections to R, 1 for the detail and 1 for the Row Grand total. If there are Row and Column Grand Totals, then there will be 4 connections to R: 1 for the detail, 1 for the Row Grand Total, 1 for the Column Grand Total, and 1 more for the combined Row & Column Grand Total.

     

    Within the main connection for the detail rows, the R script will be executed N times, where N is the number of partitions based on the addressing and partitioning of the R script. Within the connections for subtotals and grand totals, the R script is executed once for each connection.

     

    NOTE: The details of the above two paragraphs will change in a future release of Tableau, due to some R functions that error out when only given a single value - this can certainly happen in the separate computations performed for Tableau subtotals and grand totals.

     

    Using R with Tableau Grand Totals & Subtotals really highlights how grand totals and subtotals are performed as separate computations at a higher level of detail. To explain another way: if the R script is “expecting” a certain level of detail in the arguments passed to it, and that level of detail is not in the Grand Total computation, then the R script will generate different results. So when we want to do custom Grand Totals, we have to be aware of the customizing Grand Total tricks for table calcs to use with our R scripts. For more information, start with http://drawingwithnumbers.artisart.org/customizing-grand-totals-part-1/ and read more posts in http://drawingwithnumbers.artisart.org/category/grand-totals-2/. Also note that the two-pass custom Grand Totals introduced in Tableau 8.1 are not currently available for R scripts (just like they are not available for other table calculations).

     

     

    Exactly What is Sent to R

     

    The set of addresses in each Tableau table calc partition (aka rows in the partition) creates the vector for each argument passed to R, for example .arg1 in SCRIPT_REAL('sum(.arg1)',SUM([Sales])) that would effectively do a WINDOW_SUM(SUM([Sales])). The arguments passed to R are always vectors. The values of the arguments are the values of each aggregate measure or Tableau parameter used for the arguments, and the length of the vector is equal to the number of rows in the partition. length(.arg1) in R is the equivalent of the Tableau SIZE() function for a given Compute Using.

     

    The sorting of the vectors of arguments sent to R is based on the sort used for the addressing & partitioning. If an Advanced Compute Using with a custom sort is used, that sort also orders what is sent to R and what comes back back. In other words, the R code works like we’d expect it to.

     

    Here’s an example of putting the about several paragraphs together: if you are using disaggregated measures, have 1,000 rows of data in the view and use an R script with a Compute Using of Cell, the R script will be executed 1,000 times within 1 connection, each time given vectors with a length of 1. Depending on your calculation, this can be significantly slower than using a different addressing with say, 1 partition, and then calling R once with a vector of 1000 elements.

     

    Null values in Tableau are passed to R with a value of NA. If the R script is not ready for NA values then it could return undesired results, for example sum(c([NA,1])) returns NA in R whereas sum(c[NA,1], na.rm=TRUE) returns 1 in R.  Also, there’s a bug in Tableau 8.1 through 8.1.2 where instead of NA the value of 9.654043e-321 is used, it was fixed in 8.1.3.

     

    Conversely, NA values in R that are returned to Tableau become Null values in Tableau.

     

     

    Densification

     

    We also need to be aware of densification (domain completion, domain padding, mark type filling) effects. When Tableau is densifying the data in the view, that can increase the number of partitions and/or the number of rows within each partition. There are a couple of straightforward ways to identify densification:

    • Pay attention to the Status bar at the bottom of the Tableau window, in particular the number of marks, rows, and columns. If one or more of those values suddenly grows when you add a table calculation or R script to a view, you’ve got densification.
    • Less commonly, densification will generate so many extra marks that the refresh time for a view grows a lot (such as from sub-second response times to 10 or 15 seconds), so if you add a table calculation or R script to the view and it gets really slow to update, the issue could be densification (or a slow connection to Rserve, or complicated R code). When you see this, check the Status bar and the number of marks, rows, and columns to verify.

     

    When unplanned, unexpected or unwanted densification is occurring, that might cause the R script to return undesirable results. For example, a densified address that would ordinarily be treated as a Null in Tableau is becomes NA in R, and the R script must be ready for that.

     

    Going into all the other details of densification is beyond the scope of this wiki. Turning off unwanted densification is a matter of moving most or all dimensions off of Rows and/or Columns to the Level of Detail Shelf, and where necessary for layout using aggregate measures on Rows and Columns. Here’s an example from the Tableau Forums: http://community.tableau.com/message/246394

     

     

    Order of Operations

     

    At a very high level, Tableau does the following:

    1. Runs queries on the data source(s) based on the pills in the view. Filters on dimensions and regular aggregate measures are applied.
    2. Densification and data blending happen, and filters that depend on blended calculations are applied.
    3. Table calculations and R scripts are computed.
    4. Table calculation filters and filters on R calculations are applied.
    5. Results are displayed.

     

    This has a few notable effects:

    • R scripts are not run in the data source, but inside Tableau on aggregated data that has been filtered by most filters.
    • Tableau is smart enough to nest table calculations and R scripts in any order and call them appropriately, so if you want to run something like table calc -> R -> table calc -> R, you can do it.
    • Since R scripts are called before table calculation filters are applied, table calculation filters will not affect the results of R scripts. If you need to filter for table calculation results, you'll need to implement that filtering inside a calculated field.

     

     

    How Tableau Receives Data from R

     

    Though we can use a whole bunch of arguments (measures & Tableau parameters) for a given call to a SCRIPT_*() function, what gets returned from R is a Tableau measure that is the result of the R script, with one result per address (aka row in the partition). This leads to a couple of points:

    1. You can’t return multiple Tableau measures from a single R script (see below for a workaround***).
    2. Like other aggregate measures and table calcs, we can’t use the SCRIPT_*() functions to increase the level of detail of the view (aka make more dimensions).

     

    *** The workaround is to concatenate the output of the R script into a single value, then use multiple measures in Tableau to parse that out. Depending on the R script and number of calls & time to process, that might be faster. For example, the Poisson distribution functions from the epicalc library return 4 values at once (low & high tails for .0005 and .025) as an array, it might be faster/easier to make a single call to R return something like “.01,.02,.98,.99″ for a given row in the partition and then use 4 separate calcs to parse that out, rather than have 4 separate SCRIPT_*() functions.

     

    If the R code returns a single value (rather than a vector), then that value will be returned for every row in the partition. If the R code returns a matrix of values, Tableau won’t know what to do, we have to return a single value or a vector, so the workaround is to parse the matrix appropriately to end up with either a single value or a vector of single values.

     

     

    Getting into Arguments

     

    The R code needs to handle the vector(s) of the argument(s) that we’re sending to it, so, for example, if an R function only looks at a single value of 1 argument (like the Poisson distribution code in the epicalc library I was referencing above), then even though we might hand it a vector with a ton of values it’s only going to take the first value of the vector. A workaround is to wrap the R function in something like this:

    SCRIPT_REAL(' output<-(1:length(.arg1)) for(i in 1:length(.arg1)
         {     output(i) <- anRFunctionThatDoesNotHandleVectors(.arg1[i])
         }
         output ', [someTableauAggregateMeasure])

     

    The SCRIPT_* functions require at least one .arg# argument and that argument must be a regular aggregate or a table calc. If you don’t give any argument, Tableau will give the “SCRIPT_* is being called with ‘string’, did you mean (string, …)?” error in the Edit Calculation dialog. If you give an argument that is not an aggregate, Tableau will let you save the measure but then you won’t be able to drag the measure’s pill into a view. If you just want to call R with static values or parameters, you can use a dummy argument like MIN(0):

    SCRIPT_REAL('1',MIN(0))   SCRIPT_STR('x<-"Hello world"',MIN(0))

     

     

    Nesting Arguments

     

    The [#] in .arg#[#] lets us pick arbitrary rows in the partition to use and is treated as a scalar when it is a single number. However, if the [#] is itself another.arg# then that is treated as a vector, which could lead to unanticipated results. Some examples:

     

    Partition of 3 rows, .arg1 = 2, .arg2 = 1, .arg3 = MIN(1)

    SCRIPT_REAL('sum(.arg1)',2,1,MIN(1)) returns 6,6,6 - calculation is sum(c(2,2,2)) for each row in the partition  SCRIPT_REAL('sum(.arg1[1])',2,1,MIN(1)) returns 2,2,2 - calculation is sum(c(2,2,2)[1]) for each row in the partition  SCRIPT_REAL('sum(.arg1[.arg2])',2,1,MIN(1))) returns 6,6,6 - calculation is sum(c(2,2,2)[c(1,1,1)]) for each row in the partition+++  SCRIPT_REAL('sum(.arg1[.arg3])',2,1,MIN(1))) returns 6,6,6 - calculation is sum(c(2,2,2)[c(1,1,1)]) for each row in the partition

    +++ We get the same results if .arg2 =1 or .arg2 = [some Tableau parameter] that has a value of 1.

     

    In this example, .arg3 = MIN(1), a regular aggregate. If that aggregate returns Null in Tableau for a given address/row in the partition—for example because it comes from a densified address/row in the partition— then it is passed to R as NA (as of Tableau 8.1.3, see notes above). So using the prior example, if the 1st row in the partition returned Null for MIN(1), then the call to R would look like the following:

    SCRIPT_REAL('sum(.arg1[.arg3])',2,1,MIN(1))) returns NA, NA, NA - calculation is sum(c(2,2,2)[c(NA,1,1)]) for each row in the partition

     

     

    Logging and Debugging

     

    As a preliminary trouble shooting step, attempt to run the code in R or Rstudio and verify that it is returning the expected value. The following code can be used to test the code in the SCRIPT function (also saved in attached Rscript file):

    #Runs the block of code using vectors (1,2,3,...,15) as input
    #Saves the output of the code block as returnValue
    #
    #If you have more than 5 arguments, create more lines
    #  of the form .arg* <- rep(1,15)
    #If arguments are strings, use .arg* <- rep("hi",15) instead
    #For information on rep, type ?rep into console

    Code <- function(){
      .arg1 <- rep(1,15)
      .arg2 <- rep(1,15)
      .arg3 <- rep(1,15)
      .arg4 <- rep(1,15)
      .arg5 <- rep(1,15)
     
      #Paste Code Here
    }

    returnValue <- Code()

     

    Another step is to create a crosstab with all of the dimension(s) used in the view - all the dimensions from Rows, Columns, Pages, and the Marks Card - on Rows, and all of the arguments being sent to R either as discrete values on Rows and/or in a Measure Names/Measure Values crosstab. This will give you a perspective on what aggregations Tableau is performing. For example, if you are using ATTR([myDimension]) to pass a constant string value, you might see that ATTR() is actually passing * (effectively Null) because there is more than one value of myDimension for the given level of detail of your view.

     

    There are several possible ways to view and/or log what is being sent from Tableau to R and back again for debugging purposes:

    • When starting Rserve, use Rserve(TRUE) instead of Rserve(). This turns on a whole lot of debugging code in the R console. Rserve is not started in daemon mode so the console won’t be available for running your own code at the R command line.
    • On Windows, start Rserve using Rserve_d.exe in the Rserve install directory from the Windows command line. That will output all the communication between Tableau and Rserve into the Windows console window.
    • Here’s a simple R logging function created by Rollie Parrish: https://gist.github.com/rparrish/8086147 that will write logfiles into your R working directory. Using this function there will be 1 logfile per connection to R, so based on the notes above for grand totals and subtotals there could be multiple log files for a single refresh of the view.
    • If Rserve has been started from the R console, any print() statements in the R scripts will be printed to the R console.
    • The R function write.csv can be useful for writing out what is passed to R, for example:
    SCRIPT_REAL(' x<-data.frame(.arg1,.arg2,.arg3) ## create human-readable column names instead of .arg1, .arg2, .arg3 names(x) <- c("Tableau Dimension 1","Tableau Dimension 2", "Tableau Measure") ## add row.names = FALSE to turn off the row index that R adds write.csv(x, file = "file name", row.names = FALSE) ## all R scripts must have a return value 0 ', MIN([Tableau Dimension 1]), MIN([Tableau Dimension 2]), SUM([Tableau Measure]) )

     

     

    What the Tableau/R Integration Doesn’t Do

     

    This first implementation of R and Tableau integration in Tableau 8.1 doesn’t do everything people might like, here are some notable bits:

     

     

    Known Issues

     

    • As of 10.8 through 10.8.3, R scripts and table calculations that are the results of IF/THEN, IIF, and CASE statements are always evaluated. For example, IF 1==0 THEN [[R script code]] ELSE "1 does not equal 0" END returns "1 does not equal 0" to every row in the partition *and* evaluates the R script. This can cause problems if the R script is doing something like writing data to a file. The workaround is to do the IF statement inside the R script.

     

     

    The Future

     

    With this first release, there's still a lot to learn about what works in what situations for Tableau and R. Please add comments and/or edit this document with your findings.

     

    [Edited 20140110 to include fix in 8.1.3 and new known issue for IF statements.]

    [Edited 20140131 to include additional note about testing in Tableau.]

    [Edited 20161130 to include post from Re: Replicate VAR functionality through Script_real]