10 Replies Latest reply on Sep 30, 2011 12:21 PM by Richard Leeke

    Running Distinct Count

    Katy Laurance

      I currently have a bar chart that counts distinct Ids by week. I added a table calculation which is a running total of this distinct count. This allows me to show how weekly distinct ids add to the overall total ids.

       

      However, this running total is simply summing the weekly counts, not re-running a distinct count from that week through all the weeks prior. This means that my running total is inflated if an id is recorded in more than one week. If I take a count distinct of ids without the weekly binning, the count is lower than what the running total by week was showing.

       

      Is there a way to calculate a running distinct count of these ids, rather than just summing up bins of distinct ids?

       

      Thanks!

        • 1. Re: Running Distinct Count
          Tom Bash

          I've been having the same problem and haven't found a way to do running distinct counts either. I can do weekly distinct counts using predetermined week numbers, but not for the trailing 6 days, etc.

          • 2. Re: Running Distinct Count
            Richard Leeke

            2 potential ways that I can think of: one using table calculations, one using RAWSQL.

             

            The table calculation approach wouldn't scale as far as the RAWSQL way - but should be good up to tens of thousands of rows, at least.  The RAWSQL way won't work if you are using a data extract (no RAWSQL) or a datasource that uses JET (like a text file or an excel spreadsheet) (JET doesn't support COUNT DISTINCT).

             

            The table calculation approach would be based on the approach to doing count distinct which I explained in the TCRL forum, here:

             

            http://www.tableausoftware.com/tcrl/table-calculation%3A-distinct-count

             

            That would need to be extended to cope with the running sum and bins.  I can elaborate if you get stuck.

             

            The RAWSQL approach would use a subquery which evaluates the count over the entire date range up to the current bin, so would need an expression something like the following (note that the exact syntax varies depending on the back-end database, and I've just typed this off the top of my head without even checking it in any database, so this will doubtless need some tweaking):

             

            RAWSQLAGG_INT("

            SELECT COUNT (DISTINCT [X])

            FROM [T]

            WHERE [DATE] <= %1

            ",

            [DATE_BIN]

            )

             

            This assumes that you want to count distinct values for column [X] from table [T] and you have a column (probably a calculated field using DATETRUNC()) called [DATE_BIN] which is gives the dates of the bins that you want to aggregate over.

            • 3. Re: Running Distinct Count
              Tom Bash

              Thanks, Richard.

               

              I'll have millions of rows so have to go with the second approach.

               

              However, I don't have date bins. I essentially want to create a chart with dates on the x axis (by day) and a count of distinct values for variable x for the trailing 6 and 29 days, including the current day. In this case I think I need [date], %1, and [date_bin] from your example to all be the same field? How would the syntax for that statement look?

               

              Thanks a lot.

              • 4. Re: Running Distinct Count
                Richard Leeke

                Something like:

                 

                RAWSQLAGG_INT(

                "

                SELECT COUNT ([T2].[X])

                FROM [T] AS [T2]

                WHERE [T2].[DATE] BETWEEN (%1 - 28) AND %1

                ",

                [DATE]

                )

                • 5. Re: Running Distinct Count
                  Richard Leeke

                  I just remembered that a very similar question came up a few days ago in this thread.

                   

                  I came up with the same answer that time, too - but also had some comments on how long it might take and a suggestion on an index that would be needed to get even vaguely tolerable performance with large data volumes - so it's worth a read of that thread.

                  • 6. Re: Running Distinct Count
                    Tom Bash

                    Thanks. I think with the volume of data I'm going to have to calculate the field in Vertica before connecting to the database.

                    • 7. Re: Running Distinct Count
                      Lucas Hedgren

                      Richard,

                       

                      I have much less data, and I am attempting to use the table calculation method, to no avail. I need some help with the particulars.

                      I have some support case data, showing it weekly. I'd like to add a running total of distinct client names.

                       

                      I created the calculation as above:

                       

                      WINDOW_SUM(IIF(FIRST()==0, 1, IIF(ATTR(Client)==LOOKUP(ATTR(Client), -1), 0, 1)))

                       

                      For "compute using" I chose the Week field, then Order Along Client, Count, Ascending.

                       

                      Do I need to add the Running Total function to the calculation?

                      Do I need to choose Count Distinct for the Order Along?

                      Thanks for any help.

                      • 8. Re: Running Distinct Count
                        Richard Leeke

                        Hmmmm.

                         

                        Well I had another think about this on the plane this morning and I'm fairly sure I was wrong about being able to do this with table calculations.  This looks to me to be a classic example of the missing addressing function in table calculations that I go on about from time to time.

                         

                        There is a whole class of problems which would be solvable if within the window functions you could reference the values on the row that you are calculating the result for.  As it stands, you can only reference the row in the partition that the window function is currently operating on.  You need the extra capability to be able to do the equivalent of the SQL "correlated sub-query" that I used in the raw SQL approach.

                         

                        There needs to be a 4th addressing function to go with FIRST(), LAST and INDEX(), which lets you reference the target row of the calculation.  I mentioned this to one of the Tableau folk a while back and suggested it should be called THIS() and he confirmed that they had actively considered exacltly that - even to the point of calling it THIS() - but it hadn't made the final cut.  I still think it's needed.

                         

                        Sorry if my earlier comments have caused you to waste lots of time trying to figure out how to do it.

                         

                        There is one last hope.  Usually when someone dares to suggest that something isn't possible on the forums Joe Mako treats this as a personal challenge and responds within hours with an answer.  Consider the gauntlet thrown, Joe.

                        • 9. Re: Running Distinct Count
                          Joe Mako

                          Attached is my method for getting a running count distinct.

                           

                          Have had the chance to look at this method yet Richard? Notice that I do not use the COUNTD() aggregation to compute the final value, so this can be even done with a MS Jet data source (but I do not condone using MS jet as a data source, just saying it is possible).

                           

                          Lucas, if you would like detailed assistance in employing the method I demonstrate, please provide a sample packaged workbook that represents your situation.

                          • 10. Re: Running Distinct Count
                            Richard Leeke

                            Great stuff Joe.  Not one but two better ways than my custom SQL approach - that's really telling me!

                             

                            I think this thread actually demonstrates the single most powerful problem solving technique with Tableau that I know: tell Joe I can't do it.