1 2 3 Previous Next 96 Replies Latest reply on Dec 13, 2019 7:02 PM by Yuriy Fal

# How to count distinct users on a running period

Hi,

I'm currently trying to count how many distinct users are involved in my data on the last 7 running days (for each day).

In other words, for today, I need distinct users from today until day-6.

For yesterday, I need distinct users from yesterday until day-7 and so on.

My data is structured as 1 line per unique combination of 1 user and 1 date (you can see the structure in the workbook attached).

Basically, I am able to get a daily COUNTD([UserId]) which gives me this information on a daily basis (as my data is also structured daily).

Here is an example of data from 09/01 to 15/01 :

Day 0 : 22 distinct users

Day -1 : 7 distinct users

Day -2 : 1 distinct users

Day -3 : 4 distinct users

Day -4 : 6 distinct users

Day -5 : 23 distinct users

Day -6 : 14 distinct users

Sum of day -6 to 0 is 22+7+1+4+6+23+14 = 77.

My next step was to make this calculation on the running last 7 days.

For this I used WINDOW_SUM(COUNTD([UserId]),-6,0).

It seems that if a same user is present on several distinct days, it will be counted for each day of presence instead of only once.

So using my example above, I will always count 77 users on day 0 (representing the sum of distinct users from day -6 to day 0) while I should never have more than 69 which is my total users LTD.

I am not sure if I am using the right function improperly of if this should be done totally differently.

Can you help me with this ?

• ###### 1. Re: How to count distinct users on a running period

Alexandre,

This is a very interesting issue.  Up until now, I would have thought that WINDOW_SUM(COUNTD()) would return the distinct count as a sum.  But with this example, it's obvious that it doesn't and I now ralize that it's obvious why.

The reason is that the COUNTD() is executed first at whatever level of detail is present in the view (in your case, day).  Then the WINDOW_SUM is then simply summing the values for the window.  But it is not recalculating the COUNTD for the entire window, just adding together all the different values.  So if you had 5 distinct users on day 1 and 4 distinct users on day 2, then WINDOW_SUM is going to give you 9.  But in reality, you could have anywhere from 5 to 9 distinct users.

So, although I now understand the issue, I don't have a perfect solution.  (I'm interested in what others such as  Jonathan Drummey  or Shawn Wallwork would add.)

One thing you could potentially do is

partitioned on the week of each day.  But that will give you a static window and I realize you were looking for a moving window instead.

Joshua

• ###### 2. Re: How to count distinct users on a running period

We need a window_countd but I cannot get the solution to that (http://community.tableau.com/thread/109935) to work here.

The rolling part is my challenge - without it something like countd(str(datetrunc('week',[date]))+[Login]) would work

Edit. No that doesn't even work because you want to show the dates as well... I pinged Richard on this as well.

• ###### 3. Re: How to count distinct users on a running period

Hi,

I started to take a swing at this, but didn't have the time to do more. The challenging bit here is that there are really 7 different countds that need to be done, so we can't use a regular countd (unless the data is reshaped), we have to build one ourselves. Here are a couple of related posts from my notes that I started working with:

A running countd (see the Ross Bunker):

running sum of different count distinct

Jonathan

• ###### 4. Re: How to count distinct users on a running period

It's funny - I think I have a calc that provides whether the ID is new to the last 7 days:

if datediff('day',lookup(attr([Connection Date]),-1),attr([Connection Date]))<7 and

Note the compute using order which is critical

but expanding this to a chart and getting a count I cannot achieve:

• ###### 5. Re: How to count distinct users on a running period

Thank you all for your replies

I'm still trying on this one but so far I'm stuck with either the "running" or the "distinct".

Now trying to understand and reuse/adapt the formula Alex mentioned above : not easy either

• ###### 6. Re: How to count distinct users on a running period

Here's a link to an old thread on the same sort of question.

http://community.tableau.com/message/139101#139101

I'm reading this from my phone,  so haven't had a chance to look carefully,  but the question of running distinct count has come up several times before.  So with the help of Dustin's new search toy I've hunted down one example of a previous thread. That has links to a few others. I think I came up with a custom SQL approach and Joe did a table calculation version if I remember rightly.

Anyway,  that link at the top is a thread with some more links (I couldn't get my phone to paste where I wanted,  so gave up on trying to paste any more links).

It's also worth just trying a search for "running distinct count" using the new search tool.  There is a link to it in Shawn's latest digest - I couldn't paste that,  either!

• ###### 7. Re: How to count distinct users on a running period

Richard sometimes I wonder why we ever buy phones at all!

--Shawn

PS: Hopefully your holiday season will end soon and you can set us straight from some sort of regular keyboard.

• ###### 8. Re: How to count distinct users on a running period

I think this is a really good example of a type of problem which is unreasonably difficult to do with table calculations. The requirement is very easy to define, and the problem can be solved very easily in SQL. Depending on the data source (i.e. if it's a non-JET SQL data source and not a data extract) I would definitely do this with either a custom SQL connection or a RAWSQL expression. Those approaches are much simpler than any of the convoluted table calculation approaches discussed in the various threads and would also scale much further than an approach that relies on calculating the distinct count in Tableau after returning all rows. I talked about trialling the SQL approach successfully on 20 million rows of data in one of those old threads, which is way more than the table calculations would manage.

In one of the threads referenced above Jonathan and Ross were having lots of trouble getting this to work using table calculations to do the distinct count. I actually wonder if this may have got harder than it was in those old threads I pointed to - I suspect that some of the issues Jonathan and Ross were hitting to do with inconsistent padding behaviour depending which shelves are used may have been introduced more recently.

So just for the **** of it, I had a go at a different approach using the sample workbook in this thread, which uses a data extract and therefore doesn't have the option of SQL. So I decided to see if I could get it to work using the data engine's COUNTD() function, rather than the roll your own approach to count distinct. I managed to do that, which should mean it scales well, but it's a bit clunky.

What I wanted to achieve was some way of using the TOTAL() function to evaluate the COUNTD() partitioned over the last 7 days. But because of the rolling 7 days, it's not just a matter of dropping a weekly grid over the data - you need a 7 day long grid that "slides".

The way I did that was with a whole load of calculated fields, with a set for each day of the week, so that I can set the partitioning on the TOTAL(COUNTD()) differently for each day of the week. I then have a conditional statement which picks the appropriate one of the 7 instances. If you look at the sheet individual logins by day and play with parameter, the colour coding of the logins shows the grouping.

There are also two sheets showing the results - one as a bar chart and one as an attempt at a line chart - but for some reason I just can't get the line to draw, even with the help of the Path shelf. Almost everything I put on the path shelf gives the same set of partial lines (even putting [Measure Values] on Path, which makes no sense to me). Anyone got any idea what is happening here?

Obviously this approach only works for the particular case of 7 days. It kind-of feels to me as if the approach could be made more general if table calculations allowed cross referencing of rows. I've lobbied for another addressing function ("THIS()") to complement FIRST(), LAST() and INDEX() in the past. Essentially this would be the equivalent of the cross-referencing you can do in sub-queries in SQL. This is why the problem is so much easier in SQL.

• ###### 9. Re: How to count distinct users on a running period

Nice job Richard. I got closer today with at least a table showing new users this rolling week, but could not aggregate it as a running sum.

• ###### 10. Re: How to count distinct users on a running period

Attached would be my initial route, added notes to captions in attached.

Be aware, this would not be ideal for a large data source, but it does have some good advantages over some of the other routes shown here.

- uses domain completion, and can be combined with domain padding (show missing dates), if not all dates are in the data source.

- easy to change the mark type

- uses techniques from Richard in window_sum for speed in v7

Calculation for counting, window based on parameter:

In Window:

WINDOW_MAX(SUM(1),1-[Window Length],0)

Calculation for summing the counter:

Moving CountD:

IF FIRST()==0 THEN

WINDOW_SUM([In Window],0,IF FIRST()==0 THEN LAST() ELSE 0 END)

END

Calculation for filtering:

Is First:

FIRST()==0

I added in a simpler version of the Moving CountD calc, if the other was too complex.

Let me know if you have any questions, or would like additional details.

Message was edited by: Joe Mako, removed comments on changes in v8, as I do not know for sure what they are.

1 of 1 people found this helpful
• ###### 11. Re: How to count distinct users on a running period

Nice work Joe!

The masterful strokes in Joe's solution are:

1. Using FIRST()==0 on the filter shelf to limit the number of login records returned to a single one. (To see the effect, take this off the filter shelf.)
2. Using SUM(1) as the expression in the WINDOW_MAX calc.

I've attached a trimmed down version of Joe's work with just the fields and data necessary to draw the line chart. It helped me better figure out what was going on, it might help others.

--Shawn

• ###### 12. Re: How to count distinct users on a running period

Thanks for trimming the worksheet down. The primary key that makes this work is domain completion, activated because of the table calc use with the compute using setting and dimension pill arrangement. The filter is just to clean up the view. The SUM(1) would likely be better as an AVG(1) though.

If anyone would like to discuss domain completion or domain padding, you are welcome to contact me.

1 of 1 people found this helpful
• ###### 13. Re: How to count distinct users on a running period

The domain completion was what was missing for me. Because that hadn't occurred to me I was stuck using much more convoluted calculations - I had a window sum that went back 6 rows, but couldn't get it to work fully.

Great work as ever.

• ###### 14. Re: How to count distinct users on a running period

and dimension pill arrangement.

Joe, I'm not sure what you mean by that. In fact, the pill arrangement was a bit confusing to me until I moved the Login pill to the LOD shelf, then things started to make more sense to me -- because that's actually the role that pill seems to be playing, at least in my mind.

In the past, when trying to use WINDOW_XXX functions, I had continually run up against the "what field can I use to generate an expression to satisfy the syntax requirement?" It had never occurred to me to slug in a SUM(1). So I'm thankful for your workbook, because now I have a work-around for this. Not earth-shattering for you, but a liberating bit of info for me.  [BTW, for those playing along at home, MIN() MAX() AVG() SUM() COUNT() COUNTD() ATTR() of the number 1, will ALL return the exact same result, so they are all completely interchangeable in this case -- just for the record.]

The other 'trick' of your's that was a Homer 'duh moment for me was the FIRST()==0 on the filter shelf. I realize this probably has limited use, but I'm sticking it in my pocket anyway. Sorry if I missed the KEY to your solution, (and as you know we'll be talking on Sunday about this). But for the benefit of others, let me continue to play the dunce, or foil...

Is this what you consider to be the "domain completion":

Or is it that you're using the WINDOW_MAX function in the other calc field? Domain completion is a short-hand for something I don't completely understand. Could you take one more shot at explaining it? And the diffidence between it and "domain padding"? As well as basic ways of accomplishing both? [In your spare time of course. ]

--Shawn

Message was edited by: Shawn Wallwork

1 2 3 Previous Next