4 Replies Latest reply on Jan 8, 2019 8:38 AM by Jonathan Drummey

    R Linear Model - Unexpected Prediction Behavior

    Adam Yansick

      Workbook with dummy data attached. Using Tableau 2018

       

      In the sheet, there is a calculated field called Predicted Impressions that has the following code

       

      SCRIPT_REAL("

      dat <- data.frame(cbind(cost = .arg1, impressions = .arg2))

      model <- lm(impressions ~ cost, data = dat)

      model$fit

      ",

      SUM([Cost]),

      SUM([Impressions])

      )

       

      When run in R, I produce expected results (this is just randomly generated data, so it won't make actually make sense). When I run this code in Tableau, I run into an issue I can't puzzle out:

       

      The aggregated view shows the same number for Impressions and Predicted Impressions. This makes sense (and the number matches the sum found in R via sum(model$fit). However, when I dis-aggregate the data, I am left with a very curious output (I expanded the columns to exaggerate the issue):

      Why does the view show up like this? I expected an output that aligns one predicted value per one actual value and one explanatory variable. Even if I shrink the columns to be only width 1 each:

      The predicted values do not match up to the values in the R output (which generates what I want to see). For instance, in R the impressions of 7314 has a predicted value of 1528.449 and a cost of 7.8871631. 

       

      Can anyone explain this behavior to me and if there is something wrong with my code that does not produce the expected output?

        • 1. Re: R Linear Model - Unexpected Prediction Behavior
          Patrick Van Der Hyde

          Hello Adam,

           

          The .twbx file looks to be dependent on your R server so it will not meaningfully open.  My guess here is that the disagregation is causing every instance of a value in the dataset to be returned once - like a Dimension.  is there a reason for using that option?

           

          Patrick

          • 2. Re: R Linear Model - Unexpected Prediction Behavior
            Jonathan Drummey

            Hi Adam,

             

            There are a couple of things going on here. I can say that Tableau is actually computing accurate results so there's no particular issue with your R code, here's a view with Analysis->Aggregate Measures->Off and a calculation to show what goes with what for each record and we can see the correct value of Predicted Impressions:

             

            Screen Shot 2019-01-04 at 5.16.57 PM.png

             

             

            This view that you had questions about is this one:

            pastedImage_9.png

             

            This view is *not* showing what goes with what, instead it's showing the first N values of Cost sorted from highest to lowest and then the rest are indicated by the ellipses in the bottom right, then the first N values of Impressions from highest to lowest and then the ellipses and so on.

             

            This is due to Tableau's layout engine. Fundamentally Tableau lays out views based on the pills in the view, their values, and some internal logic (more on the logic below). What's happening in this view is that Measure Names on Columns and Measure Values are on text, and Analysis->Aggregate Measures is Off. So with this view there is no pill on Rows to tell Tableau how to vertically lay on the marks and it's using it's internal logic which says "fit as many marks as can fit in a vertical column taking up the whole view and as many columns as will fit given the measure column width" with the sort that I described in the prior paragraph. This logic also explains what's happening in the view in your original post where you'd shrunk the measure column width to a single column, Tableau is still sorting each column from highest to lowest values in that column.

             

            In order to see the "what goes with what" in the first screenshot above I'd added the Order calculated field with the formula RANK_UNIQUE(SUM([Impressions])). With that as a discrete pill on Rows there's now a set of values used to position each mark (i.e.the value of each measure from each row in the data. If you'd like to understand a bit more about how that calculation works, see Show Me *All* The Numbers: Displaying Every Record for Too Tall Data | Drawing with Numbers for more details.

             

            Now a bit more about Tableau's internal logic: It's generally based on the principles/assumptions that a) we've got a lot of records ("a lot" meaning "more than we can or want to manually analyze & review", it could be 5 records or 5 billion), b) want to aggregate across records, and c) that the data has sufficient dimensions/categorical variable(s) (record IDs and/or combinations of fields that generate unique keys) to control the grouping and display of the data at whatever level of detail we need. This comes from Tableau's background as a tool for interactive visual analytics of large data sets, where the data sets are typically databases that meet all those criteria. c) in particular is something that tools like R, Python, Stata, etc. don't really care about because of their iterative process of 1) compute record by record over a data set or result set, 2) generate a result set, 3) repeat. They are starting out as a record-level result and then aggregating to other levels of detail as desired. Whereas Tableau's more database-oriented model is doing set-based operations so if we want to do record-based work then we need key field(s) (i.e. principle c) to tell Tableau to compute over sets of 1 record each.

             

            Coming back to this particular issue: It provided dataset works well with principle a) and b) but does not have any unique key or combination of keys c). So it's a little more challenging to work with in Tableau at a record-level because we don't have any field(s) we can use as dimension(s) in the view to tell Tableau to compute results at a record level. Instead we have to use Analysis->Aggregate Measures->Off in order to work with record-level values and then (for the layout) add the calculated field to get the view to render in a way that makes sense.

             

            Unfortunately Tableau does not have a function to automatically add row/record IDs to the data, here's a feature request for that in Tableau Desktop https://community.tableau.com/ideas/3345 and another request for Tableau Prep https://community.tableau.com/ideas/8714 .

             

            I've attached a workbook with the Order calculated field. Hope this helps!

             

            Jonathan

            • 3. Re: R Linear Model - Unexpected Prediction Behavior
              Adam Yansick

              Thanks Jonathan! This is a very useful explanation. It also helps that in the balance of time between the original post and your response, I worked my way through your Table Calculation wiki which gave me a much broader base of understanding how to integrate R with Tableau. Great work on that one!

              • 4. Re: R Linear Model - Unexpected Prediction Behavior
                Jonathan Drummey

                I'm glad it was useful for you!