Aggregating Data is relatively easy to collapse data in R using one or more BY variables and a defined function.
1 of 1 people found this helpful
There are a couple of questions you have that aren't completely directly addressed in the TabWiki that @Yuriy linked to (I'm the author of that TabWiki):
1) The reason why we have to use aggregates is a design constraint. Tableau sends queries to a data source that return data so there are really only two places where Tableau could run R:
a) before the query is run (impossible because there is no data to process)
b) after the query is run (which is what Tableau does)
Tableau's queries are by default aggregations, so the R code is run on aggregated data. We often want R code to compute over multiple values and Tableau already had built a way to compute over the multiple values of aggregated data, namely table calculations, so the R code is built like a table calc with addressing & partitioning options. What you and others have asked for is for Tableau to run the R code over the record-level results, then aggregate the data. I'm guessing that Tableau chose not to do that for two reasons: 1) performance, because running a query to return 100M records to then run some R code and then aggregate that would be really slow, and 2) ease of implementation.
That said, there are two ways to get R code to run over record-level results. One is to turn Analysis->Aggregate Measures->Off, the other is to use one or more dimension(s) in the view that will cause the aggregation to return every unique record -- such as by putting a Record ID on the Level of Detail Shelf.
This gets at your other question about needing all the data points to run the clustering algorithm. What the article does not specify is the level of detail of the view, aka the vizLOD or grain of the view that is created by the dimensions in the view. As I noted in the prior paragraph, we can bring a unique record ID into the vizLOD then we get every record and can run calcs over it. In this case the creator used Species and combinations of Petal/Sepal Length/Width as dimensions to create the grain of the view, which in the case of the linked workbook gets most but not all the records (there are 150 in the data set, so the Overview Iris dataset and Clustering the iris dataset sheets should be showing 600 marks instead of the 468 that they do). Whoever created this particular viz made a mistake.
The dimensions in the view then provide a reason why there are only 4 values being passed. Each value is a vector that includes the marks/addresses/rows in the partition as defined by the R pill's addressing & partitioning settings.
Thank you Jonathan,
the other is to use one or more dimension(s) in the view that will cause the aggregation to return every unique record -- such as by putting a Record ID on the Level of Detail Shelf.
When I add data on a record level (e.g. adding IDs to the view), then I'll also "split" up the aggregations on the view. However, it is supposed to be an alternative to the Analysis->Aggregate Measures->Off, meaning that it allows an aggregated view, but at the same time provides a way to send the data in a more disaggregated form to R. Could you elaborate on how to do that, or even better demonstrate it by an example?
Further, what is the addressing and partition when I don't have a table structure, but a something like a scatterplot? More specifically, how does Tableau determine the "order" of the marks/data points in the vector that is used in R? This is quite important. The k-means clustering R function for instance returns a vector with the resulting cluster-ids. Returning those back to Tableau requires that the cluster-ids get matched correctly to their marks/data points.
The example below shows a somewhat random vector-order, as it doesn't add [[profit]] in ascending/descending order to the vector, nor goes from left to right of the chart, or anything else. The second example shows an issue with partition. Although Table is selected, the vector only has the length one, which means each data point/mark creates its own partition.
1 of 1 people found this helpful
R scripts work as table calculations, so having an understanding of table calculation partitioning and addressing is critical. See http://drawingwithnumbers.artisart.org/want-to-learn-table-calculations/ for more info.
In the above views there is only one dimension, Region, on the Level of Detail Shelf. When the Compute Using (addressing) is set to Table (Across) that tells Tableau to address on the dimension(s) on Columns and partition on all other dimensions. You can see this by hovering over the table calc pill or editing the table calc:
So in that case what is sent to R are four vectors (one for each Region) and there's a length of 1 because there's only one value of Region in each partition. This is known as the address aka row in the partition (I wrote about this more in Tableau and R Integration in case you haven't seen that yet).
Here's a screenshot:
Note that this is a different result than you got in your second screenshot, I'm guessing that you had changed the addressing in that second screenshot. If I change the addressing to Region, I get a value of 4 for each Region:
Note that the hover tooltip now says "Results are computed along Region". This is Tableau telling us that it is addressing on Region, and partitioning on nothing. So R is getting a single vector with 4 values, and returning the length of the vector back to Tableau to each Region.
Here's another way to demo this, using cumsum() as a function with MIN(1) as the given measure. When I set the calc to compute using Table (Across) so it partitions on Region, the cumsum() returns 1 for each Region:
And if I set the Compute Using (addressing) to Region then I get a range of values from 1-4:
Hopefully that makes sense, and the above view should help with the next part...
Now to answer your question about ordering. When the partition (vector) only contains one value then the sort effectively doesn't matter. When the vector contains multiple values the sort is based on the dimension sort of the pills used for addressing. This is Tableau's behavior for other table calculations as well.
In this case there is no user-defined sort so Tableau defaults to the alphanumeric sort of the Region dimension. Here's a screenshot where I'm using the cumsum() with the addressing on Region and have added Region to the Label Shelf, and we can see that Central=1, East=2, and so on:
If I change the sort so West is first in the order:
The order of results of the cumsum() changes, so West is now first:
Does that make sense now?
I've attached the Tableau workbook I used to build this example.
Thanks a lot, Jonathan!
One last question. Let's say for a specific R calculation the data is sent in five paritions/vectors (but R needs all vectors to make a correct calculation). How can R return the results for some cells/marks, when R didn't even receive the entire input data yet? Isn't it necessary for R to cache the vectors until it received all, then make the calculation and only then it can return vectors/partitions with the results?
I'm glad this is helping you out!
Each partition created by table calculation addressing generates a separate call to R. So if you create a calculation whose addressing is such that there are 5 partitions then you'll have 5 separate calls to R and none would have access to all the data in the view. You'd have to change the addressing so that there is 1 partition that includes all the data in the view.
Here's an example: let's say there are 5 Order Priorities and 4 Regions in the view and the domain is complete (i.e. every Order Priority exists for every Region), for a total of 20 marks. What vectors and how many calls to R there will be depends on the addressing (compute using) of the R calculation:
- If the compute using of the R calc is on Region that will make 5 partitions and therefore 5 vectors (one for each Order Priority) of 4 values each (one for each Region), and R will be called 5 times, once for each vector.
- If the compute using of the R calc is on Order Priority that will make 4 partitions and therefore 4 vectors (one for each Region) of 5 values each (one for each Order Priority), and R will be called 4 times, once for each vector.
- If the compute using of the R calc is an advanced compute using on Region, Order Priority (or Order Priority, Region) that will make 1 partition and therefore 1 vector with 20 values sorted based on either the pill sort or the custom advanced sort that you can create in the advanced configuration dialog. R will be called once. This is the addressing that you'd need in order to have the R calculation compute over all 20 marks, then if you needed other calcs to be limited to sub-vectors you could do that inside your R script or use separate R calcs in Tableau.
The beauty of the current R integration is that it takes advantage of the addressing and partitioning logic that has already been built for table calculations. The challenge is that Tableau's addressing and partitioning logic is harder to understand than it could be.
I see, so R-calculations over all data points/marks in the view need to be sent in one partition - which on the other hand means that partitions live in their own world and don't even know that more data exists. This also counts for the different R calls, which make the same calculations for each partition repeatedly. Seems that many people didn't really catch this looking at all those K-means calculations on the web.
Thanks very much for all your effort, Jonathan!
2 of 2 people found this helpful
And I agree re: people not catching that on some of the R demo workbooks out there, I've seen three types of problems:
1) The calculations are not computing over the entire data set due to the addressing and partitioning settings (the problem we talked about here).
2) The goal is to compute over every record but the dimensions in the view aren't sufficient to generate a mark for every record. For example here's a table of male heights (in inches):
If Height is used as the only dimension in the view then I see 4 marks for the 4 distinct values of height:
Any R calc that is trying to compute over all the values of height using this view will be wrong because it's missing the 2nd value of 71.
If I want to compute over all the data then I need to bring the ID dimension in to the view to generate 5 marks (I've also changed the mark type and turned on transparency to make the 2nd value of 71 more obvious):
I've seen at least a couple of workbooks that had this error. The solution to this is to have awareness of the grain of the data vis-a-vis the grain of the view aka viz level of detail (vizLOD).
3) An alternative solution for the issue in #2 is to turn Analysis->Aggregate Measures->Off:
That gets us a view with 5 marks. However when we turn off aggregation that can a) lead to user confusion because it's not an obvious setting and b) lead to erroneous calculation results when the view is using combinations of regular aggregate calculations and table calculations/R scripts. I've seen cases where a demo R view was built using Analysis->Aggregate Measures->Off to get accurate results and then someone else tried to rebuild the view and left it on (getting wrong results), or some aggregate calcs were built using Tableau's default aggregation setting and then the user decided they needed to turn Analysis->Aggregate Measures->Off to get the R calcs right and that broke the regular aggregate calcs.
My preference to avoid the problems in #3 is to make sure that my data has sufficient dimensionality that there's a unique key (which might be a composite of multiple fields) so I can put those dimension(s) onto the Level of Detail and end up with a view that clearly specifies the desired result.
I was looking through the community for the same question on this thread. As users we believe that R scripts should be sent as record-level calculations and not Table calculations. If this was a design/technical constraint, does it still exist?
It would be really great to send the arguments of an R-script as record level fields instead of table calc aggregations.
I understand Table calculations well and I see what happens with R script. My issue, as with others, is that integrating R with Tableau through table calculations severely limits the extent of functionality and visualizations possible. We cannot further aggregate the table calculations and hence cannot generate visualizations at a level higher than the grain dimensionality of the SCRIPT_ (in my case, the 'predicted' outcome which is a 1/0 response is returned by the script and the outcome is generated for every Row_ID)
You provided two options and for reasons which discussed above and below, would not be an appropriate alternative:
1) Using the level of dimensionality in the details shelf: As Jacob mentioned, this will not provide the visualization that we are looking for. In my scenario, my SCRIPT_ generates a 1/0 predicted response variable for each Record_ID. I want to visualize a simple bar chart of the percentage of 1s by the Total(1+0) against a categorical variable with two levels, A & B. Adding Record_ID to the detail shelf changes the visualization and doesn't make this bar chart possible
2) Turning off Aggregates in Analysis: As you mentioned, this doesnt help either. I am not able to create a calculated field for % of total from the Table calculation SCRIPT_ that was returned.
For the above reasons, I believe SCRIPT_ should be a record level function. Using it as a table calculations means that we have limited ourselves in viewing only detailed cross-tabs using the output of the SCRIPT_ function