12 Replies Latest reply on Nov 20, 2011 11:03 PM by Joe Mako

    Rolling Median across Months (with variable nr of items)

      Hi. This is a tough nut, but I hope it is solvable with Tableau (even if it requires some data transformation magic beforehand).

       

      My data looks like the first two columns of this sheet: http://bit.ly/v7SFyL I.e. I have got 0-n scores per month, and for each month I would like to calculate the rolling median across the last 3 months.

       

      I know that WINDOW_MEDIAN gives me the rolling median, but I am struggling with advising the formula on how many previous rows to consider (as each month might have a different nr of rows). Does someone have a creative idea of how to solve this?

       

      many thanks! michael

        • 1. Re: Rolling Median across Months (with variable nr of items)

          in case this is of help, here is a packaged workbook with some dummy data attached.

          • 2. Re: Rolling Median across Months (with variable nr of items)
            Andy Cotgreave

            Hi Michael

            I've got part of the way, but don't yet have a solution. I can work out how many rows back I need to count, but the correct rolling median calc is still eluding me. Attached is what I have so far - anyone else got any ideas?

            Andy

            • 3. Re: Rolling Median across Months (with variable nr of items)
              Richard Leeke

              Take Andy's approach, sprinkle lightly with data blending, tear your hair out for half an hour and hey presto...

              • 4. Re: Rolling Median across Months (with variable nr of items)
                Andy Cotgreave

                Richard, that's amazing. I can't imagine how you imagined doing it that way. It got me thinking, could we do something similar using Custom SQL rather than multiple blends?

                [that's a rhetorical question; i don't expect you to work out an answer!]

                • 5. Re: Rolling Median across Months (with variable nr of items)
                  James Baker

                  I'm definitely going to remember this one for my bag of tricks.

                  • 6. Re: Rolling Median across Months (with variable nr of items)
                    Richard Leeke

                    Andy: I got that trick of offsetting the data blend date from a posting from Joe some while back (who else?).

                     

                    I'm fairly sure you're right that you could use either Custom SQL or a RAWSQL expression to derive the required row count.  I'd have done it now if the data source had the original month as a date, but as it's a string it would mean embedding the expressions necessary to do date arithmetic on a string in the Custom SQL or RAWSQL, which would be a lot of effort.

                    • 7. Re: Rolling Median across Months (with variable nr of items)

                      Guys, you are amazing. Thanks for all the torn out hair! So, it is possible!

                       

                      The problem is: I don't fully understand the solution. And I'm afraid even if i do now, the next time that me or someone else has to change sthg, he/she will face similar difficulties. What I am trying to say is, that complex solutions come at their price over the long run. And secretly I was hoping that there is a way within Tableau to simply tell it: Calculate across dimension X, but consider dimension Y for counting the window.

                       

                      Would the solution become easier if the data source would already contain certain attributes, e.g. records per month, or a running index number for each record? I can pose these kind of requirements to the data preparation folks.

                      • 8. Re: Rolling Median across Months (with variable nr of items)
                        Richard Leeke

                        Your point about the complexity and therefore the difficulty of maintenance is quite true.  I often go back to workbooks I've done in the past and find it takes me a while to figure out what is going on - and it's a lot harder when it is something which someone else has done.

                         

                        Certainly preparing the data to make the calculations easier is an option worth thinking about - particularly if this is a workbook which you are going to need to maintain long term.  I don't often take that approach, but most of my analysis tends to be short term, throw-away workbooks (i.e. answer a specific question and then move on), but you will very often see Joe Mako advocating data transformation with an ETL tool prior to Tableau, for this very reason.

                         

                        In this case there are a number of ways that you could build derived data into the data source to help with this specific calculation.  But there is a trade-off.  The more you pre-calculate to support a particular piece of analysis, the less flexible your solution would become.  Off the top of my head, some options I can think of (but I haven't gone back and looked at the workbook, so don't take these as gospel) are:

                         

                        1) Add a count of records per month to each row.  That wouldn't actually help much (or even at all) - you would still need to lookup the numbers for prior months, and I don't think it would even get you past the problem Andy was having with NULL values for prior months in the first two months of data, so I think you would still need to use the data blending approach I used to get past that.

                         

                        2) Add a total count of records in the 3 months of interest to each row.  That would avoid the data blending, but the analysis would now be locked into 3 months and you wouldn't be able to apply any filters across the rows to be analysed.  There would still be a couple of simple table calculations involving INDEX() in order to work out the exact offsets for the WINDOW_MEDIAN() function.

                         

                        3) You could pre-calculate the offsets needed for the WINDOW_MEDIAN() for each row.  That would mean you also had to ensure that the sort order used for the pre-calculation matched the sort order in your workbook.

                        • 9. Re: Rolling Median across Months (with variable nr of items)
                          Andy Cotgreave

                          Here's a great solution from Marc Rueter - genius Director of Sales Consulting. And it's elegant too. And once you wrap your head round it, it's simple in one of those "Well, it's obvious now I see it" kinds of ways. (it's v7, I'm afraid, so those not on the beta programme will have to wait until we release v7) - sorry.

                          Andy

                          • 10. Re: Rolling Median across Months (with variable nr of items)
                            Richard Leeke

                            Yes, that's much better than my one.  I never think of using "At the level"or "Restarting every".  I was trying to think how to parameterise the number of months, like that - much better.

                             

                            I can't see the reason for [start index] to be defined as "IIF([index]=1,index(),NULL)" - just "index()" seems to work fine.

                             

                             

                            It's not using any version 7 features that I can see - I presume Marc just happened to use version 7.

                            • 11. Re: Rolling Median across Months (with variable nr of items)
                              Andy Cotgreave

                              Re: the index formula check for NULL - I expect that's just part of Marc's work-in-progress and yes, you could change it.

                               

                              There's nothing specific to v7 in there, but it's what's being used internally most of the time now!

                              • 12. Re: Rolling Median across Months (with variable nr of items)
                                Joe Mako

                                Here would be my route:

                                 

                                1. add another sheet with a list of months of interest

                                2. use custom SQL like (creating an extract):

                                 

                                 

                                SELECT [Data$].[Month] AS [Data_Month],
                                
                                   [Data$].[Score] AS [Score],
                                  [Months$].[Month] AS [Month]
                                FROM [Data$], [Months$]
                                


                                 

                                3. turn the Month fields into actual dates with a formula like:

                                 

                                 

                                DATE(RIGHT([Month],LEN([Month])-7)+"/1/"+LEFT([Month],4))


                                 

                                4. create an Integer parameter named "Moving Months"

                                5. create a calculated field named "Moving Median" with the formula:

                                 

                                 

                                IF FIRST()==0 THEN
                                
                                   TOTAL(MEDIAN(
                                    IF DATEDIFF('month',[Data_Month (Date)],[Month (Date)]) < [Moving Months]
                                    AND [Data_Month (Date)] <= [Month (Date)] THEN
                                      [Score]
                                    END
                                  ))
                                END
                                


                                 

                                6. place "Month" on the Rows shelf, "Data_Month" on the Level of Detail shelf, and "Moving Median" on the Text shelf

                                7. set the Compute using for "Moving Median" to "Data_Month"

                                 

                                IMHO I believe this is a more optimal approach.

                                 

                                Here is a breakdown of what I am doing in each step:

                                 

                                1. This allows you do display for months that don't have data records

                                2. This is a Cartesian product of rows from tables in the join, it will produce rows which combine each row from the first table with each row from the second table. This makes every data record available to every date of interest. If you knew it was always going to be three months, you could filter at this point instead of filtering in the formula.

                                3. I convert your month text to dates so I can later perform data arithmetic

                                4. The parameter allows you to change your moving range

                                5. The innermost is the field "Score", and it is only provided to the next level up when the difference between the dates is within range. The next level is the TOTAL(MEDIAN()) and this will provide a median of all Scores provided to it. The final outer layer is a visual cleanup that will only return one mark per Month.

                                6. The layout is very flexible with this solution, and you could arrange the pills any way you want. If you add additional dimension pills they will add partitioning.

                                7. This will give you a median across all Data_Months within each Month making it basically a subtotal of the formula:

                                 

                                 

                                MEDIAN(
                                
                                     IF DATEDIFF('month',[Data_Month (Date)],[Month (Date)]) < [Moving Months]
                                    AND [Data_Month (Date)] <= [Month (Date)] THEN
                                      [Score]
                                    END)
                                


                                 

                                See http://www.thedatastudio.co.uk/blog/the-data-studio-blog/groundwork-for-custom-table-calculations for some more details on table calculations.

                                 

                                I believe this is a better solution because it would scale better with larger data, taking advantage of the aggregation benefits of the Tableau Data Engine, and therefore be more responsive on an interactive dashboard. Plus I think it is simpler, easier to follow what is happening in the formulas, allows for the pills to be rearranged or other dimension pills added to partition.