2 Replies Latest reply on Nov 29, 2016 6:19 PM by Jonathan Drummey

    Replicate VAR functionality through Script_real

    Danyang Su

      Hi all,

       

      I am trying to write a calculated field such that it will behave just like the default var in tableau. So ideally, I can pass the whole data to r, and calculate var for each category (without having to hard-code them), and return a smaller vector which will be displayed. But I don't know how to make that happen, I find a source but can't figure out what it says (see link). So I would appreciate if anyone can lend a helping hand.

       

      Tableau and R Integration

       

      PS: I am not bored. I just think VAR is a good example of what I would like to achieve eventually.

        • 2. Re: Replicate VAR functionality through Script_real
          Jonathan Drummey

          Hi,

           

          I'm the author of the Tableau & R integration document you linked to and yes, that part of the document is shorter than it could have been.

           

          Short answer to the question of "how can I get a smaller result vector": In a view that has every record you want to calculate the variance across a calculation like IF FIRST()=0 THEN SCRIPT_REAL(insert R script to calculate variance here, [your agg measure]) END with that pill as an output *and* on the Filters Shelf filtering for Special->Non-null values will get you the result you want. You could also get the same results using IF FIRST()=0 THEN WINDOW_VAR([your agg measure]) END.

           

          Long answer to the question of "how can I get a smaller result vector":

           

          I'm going to go high-level here to explain why this is somewhat complicated, then get back to the IF FIRST()=0 part, then go into some complications. There are four levels of calculation and a fifth grain that we work at in Tableau:

           

          * record level or grain of the data - this is the raw record level data and the results of any row-level functions.

          * regular aggregate - SUM(), MIN(), AVG(), VAR(), etc. are all regular aggregate measures and are calculated at the viz level of detail (vizLOD) which is based on the dimensions in the view. Level of Detail expressions are a special kind of regular aggregate that returns the results as record-level values.

          * table calculations - This includes FIRST(), INDEX(), TOTAL(), the WINDOW_*, RANK_*, and SCRIPT_* calcs as well as Python & R scripts. These are mostly aggregations of aggregations (the ones that take fields as arguments require those fields to be aggregates) and are computed along the dimension(s) used for addressing and partitioned based on all the other dimensions in the view. The vector of values sent to Python/R is based on the addressing dimensions.

          * post-aggregate calculations - This includes reference lines, forecasting, grand totals, the summary card, etc. and are built-in Tableau computations that can generate results we can see but are not usable in other calculations. For example if we want to use a reference line result as a threshold for other calcs then we need to build out our equivalent of a reference line using some other means (table calculations, level of detail expressions, and/or a Tableau data blend).

           

          The fifth "grain" that we work at is what I call the grain of the display. Here's a quick example using a 10 row data set in Tableau v10.0:

           

          Screen Shot 2016-11-29 at 8.44.44 PM.png

           

          In a view where the viz level of detail is this record-level then VAR(Value) is going to return Null as expected because it's only aggregating across one record.

           

           

          Screen Shot 2016-11-29 at 8.45.44 PM.png

          If I take the Row dimension out of the view then VAR(Value) is going to return the variance across all the records for each Source:

           

          Screen Shot 2016-11-29 at 8.48.16 PM.png

           

          This works great...now here's where things get more complicated. I'm going to use Tableau's WINDOW_VAR() function as a stand-in for an equivalent R or Python function. If I try to set up a WINDOW_VAR([Value]) (or a SCRIPT_REAL("R/Python function here",[Value]) Tableau is going to throw up some sort of "All fields must be aggregate or constant when using table calculation functions..." error.

           

          Screen Shot 2016-11-29 at 8.52.33 PM.png

           

          The reason why is that WINDOW_ and SCRIPT_ functions are table calculation functions and the way Tableau is designed they require aggregates as arguments. However, if I just use WINDOW_VAR(SUM([Value])) (or a SCRIPT_REAL("R/Python function here", SUM([Value])) in the view that I used VAR(Value) in then I'm going to get nonsensical resultsm, this is the WVar Sum Value calc below:

           

          Screen Shot 2016-11-29 at 8.53.56 PM.png

           

          This is due to the fact that the inner SUM(Value) is computed at the vizLOD which is just Source, so what's being fed to the WVar Sum Value calc is the two values of 120 and 330, not all the individual record-level values. Therefore to have the WINDOW_VAR() (or R/Python calc) return the same results as the regular aggregate VAR() calc we need to *increase* the vizLOD so the table calculation by adding the Row dimension back into the view so the calculation has all the values it needs. In addition we need to change the Compute Using to be on the Row so it partitions on Source. Here's the result, we see accurate variances again:

           

          Screen Shot 2016-11-29 at 8.58.22 PM.png

           

          Now we're getting back to the original question where given that the table calculation (or R/Python script) needs to be fed the vectors of values for each source yet ultimately we want to collapse that down to a single value for the grain of the display (or potentially further calculations). In that case we can use the IF FIRST()==0 THEN WINDOW_VAR(SUM([Value])) END formula to do that. FIRST() is a special table calculation function that identifies the first address in the partition so we can use that to return a single result:

           

          Screen Shot 2016-11-29 at 9.01.16 PM.png

           

          This is great, now usually what people do their first time around is go through a thought process like "Yeay, I got my result, now let me get rid of that Row dimension." And when they remove that from the view they get this with red pills that don't work because they require the Row dimension that is gone:

           

          Screen Shot 2016-11-29 at 9.02.54 PM.png

           

          Or they only have a single partition (I did this by filtering for just Source a) and were able to use the default Table (Down) addressing instead of a fixed addressing and the view looks like this with Null values for the WVar Sum Value calcs because the vizLOD isn't enough for them to be accurate, there's only a single SUM(Value) that is being passed to the WINDOW_VAR()  table calculation:

           

          Screen Shot 2016-11-29 at 9.04.18 PM.png

           

          Or (third mistake) they will just right-click on the Null values from the original and choose Exclude and end up with another broken view because of Tableau's order of operations. When we right-click+exclude we are excluding dimension values, this removes the needed marks from the viz because dimension filters are applied before table calculations (and R and Python scripts) are computed:

           

          Screen Shot 2016-11-29 at 9.09.57 PM.png

           

           

           

          So the table calculation (or R or Python script) *requires* that the Row dimension be in the vizLOD. We can get rid of the extraneous marks by Ctrl+dragging (Cmd+dragging on Mac) one of the table calc pills to the Marks card and filtering for Special tab->non-Null values and then we can get only a single mark for each source:

           

          Screen Shot 2016-11-29 at 9.07.30 PM.png

           

          At this point if we're done we can turn off the Row header or move it to the Detail shelf, then turn off the tooltip for it, and be all set.

           

          However let's say we wanted to show the Avg of all values at the level of Source. Because the vizLOD is Source, Row, AVG(Value) is going to be computed at that (record) level and not at the level of Source, so it's incorrect as in this view:

           

          Screen Shot 2016-11-29 at 9.12.52 PM.png

           

          We have to take into account the vizLOD vs. what we want for the calculation, in this case I dealt with it by using WINDOW_AVG(SUM([Value])) with a Compute Using on the Row so it partitions on each source, and now we have an accurate result:

           

          Screen Shot 2016-11-29 at 9.12.57 PM.png

           

          So we have to keep in mind the level of the calculation we are working with *and* the particular level of detail grain it is being computed at for *every* calculation in the view. What I generally try to do is use record level, regular aggregate, and LOD expressions as much as possible and only resort to table calcs & R scripts (Python just came out) when I absolutely have to because even though they add a tremendous amount of capability they also add quite a bit of complexity.

           

          I've attached the v10.0 workbook I used to build the screenshots above.

           

          Jonathan

          2 of 2 people found this helpful