1 2 3 Previous Next 92 Replies Latest reply on Dec 18, 2017 5:42 PM by Yuriy Fal

    How to count distinct users on a running period

    Alexandre DAVID

      Hi,

       

      I'm currently trying to count how many distinct users are involved in my data on the last 7 running days (for each day).

      In other words, for today, I need distinct users from today until day-6.

      For yesterday, I need distinct users from yesterday until day-7 and so on.

       

      My data is structured as 1 line per unique combination of 1 user and 1 date (you can see the structure in the workbook attached).

       

      Basically, I am able to get a daily COUNTD([UserId]) which gives me this information on a daily basis (as my data is also structured daily).

       

      Here is an example of data from 09/01 to 15/01 :

      Day 0 : 22 distinct users

      Day -1 : 7 distinct users

      Day -2 : 1 distinct users

      Day -3 : 4 distinct users

      Day -4 : 6 distinct users

      Day -5 : 23 distinct users

      Day -6 : 14 distinct users

      Sum of day -6 to 0 is 22+7+1+4+6+23+14 = 77.

       

      My next step was to make this calculation on the running last 7 days.

      For this I used WINDOW_SUM(COUNTD([UserId]),-6,0).

      It seems that if a same user is present on several distinct days, it will be counted for each day of presence instead of only once.

      So using my example above, I will always count 77 users on day 0 (representing the sum of distinct users from day -6 to day 0) while I should never have more than 69 which is my total users LTD.

       

      I am not sure if I am using the right function improperly of if this should be done totally differently.

       

      Can you help me with this ?

        • 1. Re: How to count distinct users on a running period
          Joshua Milligan

          Alexandre,

           

          This is a very interesting issue.  Up until now, I would have thought that WINDOW_SUM(COUNTD()) would return the distinct count as a sum.  But with this example, it's obvious that it doesn't and I now ralize that it's obvious why.

           

          The reason is that the COUNTD() is executed first at whatever level of detail is present in the view (in your case, day).  Then the WINDOW_SUM is then simply summing the values for the window.  But it is not recalculating the COUNTD for the entire window, just adding together all the different values.  So if you had 5 distinct users on day 1 and 4 distinct users on day 2, then WINDOW_SUM is going to give you 9.  But in reality, you could have anywhere from 5 to 9 distinct users.

           

          So, although I now understand the issue, I don't have a perfect solution.  (I'm interested in what others such as  Jonathan Drummey  or Shawn Wallwork would add.)

           

          One thing you could potentially do is

          TOTAL(COUNTD([Login]))

          partitioned on the week of each day.  But that will give you a static window and I realize you were looking for a moving window instead.

           

          Joshua

          • 2. Re: How to count distinct users on a running period
            Alex Kerin

            We need a window_countd but I cannot get the solution to that (http://community.tableau.com/thread/109935) to work here.

            The rolling part is my challenge - without it something like countd(str(datetrunc('week',[date]))+[Login]) would work

             

            Edit. No that doesn't even work because you want to show the dates as well... I pinged Richard on this as well.

            • 3. Re: How to count distinct users on a running period
              Jonathan Drummey

              Hi,

               

              I started to take a swing at this, but didn't have the time to do more. The challenging bit here is that there are really 7 different countds that need to be done, so we can't use a regular countd (unless the data is reshaped), we have to build one ourselves. Here are a couple of related posts from my notes that I started working with:

               

              A running countd (see the Ross Bunker):

              http://community.tableau.com/thread/119179

               

              running sum of different count distinct

              http://community.tableau.com/thread/119993

               

              Jonathan

              • 4. Re: How to count distinct users on a running period
                Alex Kerin

                It's funny - I think I have a calc that provides whether the ID is new to the last 7 days:

                 

                if datediff('day',lookup(attr([Connection Date]),-1),attr([Connection Date]))<7 and

                lookup(attr([Login]),-1)=attr([Login]) then "Logged in within week" else "new" end

                 

                Note the compute using order which is critical

                countd.png

                 

                but expanding this to a chart and getting a count I cannot achieve:

                • 5. Re: How to count distinct users on a running period
                  Alexandre DAVID

                  Thank you all for your replies

                  I'm still trying on this one but so far I'm stuck with either the "running" or the "distinct".

                  Now trying to understand and reuse/adapt the formula Alex mentioned above : not easy either

                  • 6. Re: How to count distinct users on a running period
                    Richard Leeke

                    Here's a link to an old thread on the same sort of question.

                     

                    http://community.tableau.com/message/139101#139101

                     

                    I'm reading this from my phone,  so haven't had a chance to look carefully,  but the question of running distinct count has come up several times before.  So with the help of Dustin's new search toy I've hunted down one example of a previous thread. That has links to a few others. I think I came up with a custom SQL approach and Joe did a table calculation version if I remember rightly.

                     

                    Anyway,  that link at the top is a thread with some more links (I couldn't get my phone to paste where I wanted,  so gave up on trying to paste any more links).

                     

                    It's also worth just trying a search for "running distinct count" using the new search tool.  There is a link to it in Shawn's latest digest - I couldn't paste that,  either!

                    • 7. Re: How to count distinct users on a running period
                      Shawn Wallwork

                      Richard sometimes I wonder why we ever buy phones at all!

                       

                      --Shawn

                       

                      PS: Hopefully your holiday season will end soon and you can set us straight from some sort of regular keyboard.

                      • 8. Re: How to count distinct users on a running period
                        Richard Leeke

                        I looked back through all of those old threads and had a bit more of a think about this one.

                         

                        I think this is a really good example of a type of problem which is unreasonably difficult to do with table calculations. The requirement is very easy to define, and the problem can be solved very easily in SQL. Depending on the data source (i.e. if it's a non-JET SQL data source and not a data extract) I would definitely do this with either a custom SQL connection or a RAWSQL expression. Those approaches are much simpler than any of the convoluted table calculation approaches discussed in the various threads and would also scale much further than an approach that relies on calculating the distinct count in Tableau after returning all rows. I talked about trialling the SQL approach successfully on 20 million rows of data in one of those old threads, which is way more than the table calculations would manage.

                         

                        In one of the threads referenced above Jonathan and Ross were having lots of trouble getting this to work using table calculations to do the distinct count. I actually wonder if this may have got harder than it was in those old threads I pointed to - I suspect that some of the issues Jonathan and Ross were hitting to do with inconsistent padding behaviour depending which shelves are used may have been introduced more recently.

                         

                        So just for the **** of it, I had a go at a different approach using the sample workbook in this thread, which uses a data extract and therefore doesn't have the option of SQL. So I decided to see if I could get it to work using the data engine's COUNTD() function, rather than the roll your own approach to count distinct. I managed to do that, which should mean it scales well, but it's a bit clunky.

                         

                        What I wanted to achieve was some way of using the TOTAL() function to evaluate the COUNTD() partitioned over the last 7 days. But because of the rolling 7 days, it's not just a matter of dropping a weekly grid over the data - you need a 7 day long grid that "slides".

                         

                        The way I did that was with a whole load of calculated fields, with a set for each day of the week, so that I can set the partitioning on the TOTAL(COUNTD()) differently for each day of the week. I then have a conditional statement which picks the appropriate one of the 7 instances. If you look at the sheet individual logins by day and play with parameter, the colour coding of the logins shows the grouping.

                         

                        There are also two sheets showing the results - one as a bar chart and one as an attempt at a line chart - but for some reason I just can't get the line to draw, even with the help of the Path shelf. Almost everything I put on the path shelf gives the same set of partial lines (even putting [Measure Values] on Path, which makes no sense to me). Anyone got any idea what is happening here?

                         

                        Obviously this approach only works for the particular case of 7 days. It kind-of feels to me as if the approach could be made more general if table calculations allowed cross referencing of rows. I've lobbied for another addressing function ("THIS()") to complement FIRST(), LAST() and INDEX() in the past. Essentially this would be the equivalent of the cross-referencing you can do in sub-queries in SQL. This is why the problem is so much easier in SQL.

                        • 9. Re: How to count distinct users on a running period
                          Alex Kerin

                          Nice job Richard. I got closer today with at least a table showing new users this rolling week, but could not aggregate it as a running sum.

                          • 10. Re: How to count distinct users on a running period
                            Joe Mako

                            Attached would be my initial route, added notes to captions in attached.

                             

                            Be aware, this would not be ideal for a large data source, but it does have some good advantages over some of the other routes shown here.

                             

                            - uses domain completion, and can be combined with domain padding (show missing dates), if not all dates are in the data source.

                            - easy to change the mark type

                            - uses techniques from Richard in window_sum for speed in v7

                            - added a parameter to adjust length

                             

                            Calculation for counting, window based on parameter:

                            In Window:

                            WINDOW_MAX(SUM(1),1-[Window Length],0)

                             

                            Calculation for summing the counter:

                            Moving CountD:

                            IF FIRST()==0 THEN

                            WINDOW_SUM([In Window],0,IF FIRST()==0 THEN LAST() ELSE 0 END)

                            END

                             

                            Calculation for filtering:

                            Is First:

                            FIRST()==0

                             

                            I added in a simpler version of the Moving CountD calc, if the other was too complex.

                             

                            Let me know if you have any questions, or would like additional details.

                             

                            Message was edited by: Joe Mako, removed comments on changes in v8, as I do not know for sure what they are.

                            • 11. Re: How to count distinct users on a running period
                              Shawn Wallwork

                              Nice work Joe!

                               

                              The masterful strokes in Joe's solution are:

                              1. Using FIRST()==0 on the filter shelf to limit the number of login records returned to a single one. (To see the effect, take this off the filter shelf.)
                              2. Using SUM(1) as the expression in the WINDOW_MAX calc.

                               

                              I've attached a trimmed down version of Joe's work with just the fields and data necessary to draw the line chart. It helped me better figure out what was going on, it might help others.

                               

                              --Shawn

                              • 12. Re: How to count distinct users on a running period
                                Joe Mako

                                Thanks for trimming the worksheet down. The primary key that makes this work is domain completion, activated because of the table calc use with the compute using setting and dimension pill arrangement. The filter is just to clean up the view. The SUM(1) would likely be better as an AVG(1) though.

                                 

                                If anyone would like to discuss domain completion or domain padding, you are welcome to contact me.

                                • 13. Re: How to count distinct users on a running period
                                  Alex Kerin

                                  The domain completion was what was missing for me. Because that hadn't occurred to me I was stuck using much more convoluted calculations - I had a window sum that went back 6 rows, but couldn't get it to work fully.

                                   

                                  Great work as ever.

                                  • 14. Re: How to count distinct users on a running period
                                    Shawn Wallwork

                                    and dimension pill arrangement.

                                    Joe, I'm not sure what you mean by that. In fact, the pill arrangement was a bit confusing to me until I moved the Login pill to the LOD shelf, then things started to make more sense to me -- because that's actually the role that pill seems to be playing, at least in my mind.

                                     

                                    In the past, when trying to use WINDOW_XXX functions, I had continually run up against the "what field can I use to generate an expression to satisfy the syntax requirement?" It had never occurred to me to slug in a SUM(1). So I'm thankful for your workbook, because now I have a work-around for this. Not earth-shattering for you, but a liberating bit of info for me.  [BTW, for those playing along at home, MIN() MAX() AVG() SUM() COUNT() COUNTD() ATTR() of the number 1, will ALL return the exact same result, so they are all completely interchangeable in this case -- just for the record.]

                                     

                                    The other 'trick' of your's that was a Homer 'duh moment for me was the FIRST()==0 on the filter shelf. I realize this probably has limited use, but I'm sticking it in my pocket anyway. Sorry if I missed the KEY to your solution, (and as you know we'll be talking on Sunday about this). But for the benefit of others, let me continue to play the dunce, or foil...

                                     

                                    Is this what you consider to be the "domain completion":

                                     

                                    Domain Completion.PNG

                                     

                                    Or is it that you're using the WINDOW_MAX function in the other calc field? Domain completion is a short-hand for something I don't completely understand. Could you take one more shot at explaining it? And the diffidence between it and "domain padding"? As well as basic ways of accomplishing both? [In your spare time of course. ]

                                     

                                    --Shawn


                                     


                                     

                                    Message was edited by: Shawn Wallwork

                                    1 2 3 Previous Next