1 2 3 Previous Next 32 Replies Latest reply on Oct 13, 2016 5:11 AM by Anshul Gupta

    Table Calculation: Distinct Count

    Richard Leeke

      Name: Distinct Count

      Alternate names:

       

      Summary:

      Calculates the number of distinct values of a field within a partition.  This is a table calculation equivalent of the COUNTD() database function, which may be required either because the result needs to be used in a calculation at a different level of detail to the level at which it is calculated, or because the datasouce does not support the COUNTD() function (i.e. if the datasource is the Microsoft JET engine).

       

      Definition:

       

      
       WINDOW_SUM(IIF(ATTR(A)==ZN(LOOKUP(ATTR(A), -1)), 0, 1))
      
      


       

      Compute Using: Ordered by A ascending.

       

      Inputs and Setup:

      A: The field for which to count distinct values.

       

      Example:

       

      
       WINDOW_SUM(IIF(ATTR([CustomerID])==ZN(LOOKUP(ATTR([CustomerID]), -1)), 0, 1))
      
      


       

      Commentary:

      Works by adding 1 for each value of A for which the value of A in the previous row (sorted by A) was different (or undefined, in the case of the first row).

       

      There are a couple of examples of this calculation in this forum thread.

       

      Note that currently (as at 6.0.4) the performance of the window functions such as WINDOW_SUM() degrades badly for large partitions, due to an unnecessary sort of the window for each row of the partition.

       

      In cases such as this where the window function is evaluated over the entire partition, returning the same value for every row in the partition, there is a workaround to avoid the performance issue with large partitions.

       

      The workaround is to evaluate the result for the first row of the partition and then copy the result for every other row in the partition.  There is no way to prevent the window functions from performing the sort for the other rows, but as the result will be ignored for all other rows, the window size can be forced to a single row, to minimise the overhead.  Here is the optimised version:

       

      
       PREVIOUS_VALUE(WINDOW_SUM(IIF(ATTR(A)==ZN(LOOKUP(ATTR(A), -1)), 0, 1), 0, IIF(FIRST()==0, LAST(), 0)))
      
      


       

       

      Note that if the database engine supports the COUNTD() function, an alternative approach to evaluating a distinct count, avoiding the use of window functions but still allowing the result to be used in a calculation at a different level of detail, is to use data blending to connect to a copy of the datasource, linked at the relevant level of aggregation and using the COUNTD() function.

        • 1. Re: Table Calculation: Distinct Count
          Michel Baldellon

          Great idea !

          I tried to make a count distinct of Customer Id (string) and I got the following error message :

          "ZN should be only called with Integer of Float".

          Any idea how to solve that ?

           

          Regards. Michel

          • 2. Re: Table Calculation: Distinct Count
            Richard Leeke

            Whoops, sorry, I should have thought of that.

             

            You just need to replace the ZN() with IFNULL(), specifying a string value which will never occur in this field in your data, to substitute for the null that will be returned by the LOOKUP() for the first row of the partition (because there is no previous row.  So, for example, if you know that know rows can ever contain the empty string you can use ("") for the null case, like this:

             

            WINDOW_SUM(IIF(ATTR(A)==IFNULL(LOOKUP(ATTR(A), -1),""), 0, 1))

             

             

            Actually this highlights a corner case I should have discussed.  In the example I quoted with a numeric field, I implicitly assumed that that field could never be zero.  If it was, and zero would be the first row returned by the sort, the result would be wrong (it wouldn't count the value 0 because it would appear to be the same as the (non-existent) previous row.  Ditto in your string case if you substitute the null with a value that appears in your data.

             

            So on reflection, I think it's better to express the whole thing like this, to avoid making any assumptions about possible values:

             

            WINDOW_SUM(IIF(FIRST()==0, 1, IIF(ATTR(A)==LOOKUP(ATTR(A), -1), 0, 1)))

            • 3. Re: Table Calculation: Distinct Count

              So Richard I have a challenge for you: Can you think of a way to implement a SUM_DISTINCT calculation?

               

              So given SUM_DISTINCT(10, 10, 10, 15, 15, 66, 66, 66, 66, 100, 100) the result would be 191

               

              Is there a way to do this?

              • 4. Re: Table Calculation: Distinct Count
                Richard Leeke

                Sure.  Just replace the 1 (used for counting) with the actual value in the above table calculation.  Note that the first row is special-cased, so there are actually 2 occurrences of 1 to replace.Like so:

                 

                WINDOW_SUM(IIF(FIRST()==0, ATTR(A), IIF(ATTR(A)==LOOKUP(ATTR(A), -1), 0, ATTR(A))))

                 

                And then that whole expression needs to be wrapped up in the PREVIOUS_VALUE() nonsense if you want it to perform OK on large partitions.

                 

                Caveat: I've looked at this for 30 seconds and not checked what I've just typed.  I'll try to find time to check it later - and also expand out the optimised version.

                • 5. Re: Table Calculation: Distinct Count

                  Hi Richard,

                   

                  Actually, I tried that before posting here... what I'm seeing is that the when you replace the 1 with ATTR(A), if A has more than one distinct value it returns *

                   

                  So plugging 10, 10, 10, 15, 15, 66, 66, 66, 66, 100, 100 into the formula above gets you a *

                   

                  The attached workbook illustrates what I mean.

                   

                  It's entirely possible (and quite likely) that I'm mis-using the ATTR() function... what am I missing?

                   

                  Cheers,

                  Jamie

                  • 6. Re: Table Calculation: Distinct Count
                    Richard Leeke

                    Sounds as if you haven't got the partitioning right.  I'll look later and confirm - let me know if you get it sorted in the meantime.

                    • 7. Re: Table Calculation: Distinct Count

                      Hi Jamie,

                       

                      Table calcs are really dependent on table order / layout. For your sample data, I'd have two columns:

                       

                      Column 1 would be an arbitrary list (used to identify your numbers): a, b,c,d,e,f, etc.

                      Column 2 would be the numbers you'd want to sum: 10, 10, 10, 15, 15, 66, 66, 66, 66, 100, 100

                       

                      In the table layout, drop column 1 on the row shelf and drop column 2 on the text shelf

                       

                      Then, just use your table calc (window_sum) and drop it either on the level of detail shelf and/or on the text shelf (compute using Table Down).

                       

                      Hope that helps,

                      Mike

                      • 8. Re: Table Calculation: Distinct Count
                        Joe Mako

                        here is my response to the much repeated question of getting the sum of just the distinct values in a list of numbers:

                         

                        The specified item was not found. (with the attachment sum_distinct.twbx)

                        • 9. Re: Table Calculation: Distinct Count
                          Richard Leeke

                          Just expanding on Mike's explanation of why the table calculation was returning a '*' and what to do about it.  I'll comment separately on whether or not this is a good way of doing what you want to do it the other thread Joe pointed to.

                           

                          The key thing to understand is that table calculations operate on the results returned to Tableau from the database - which means that you need to ensure that you haven't "aggregated out" the detail needed by the table calculation before it gets its hands on the data.

                           

                          When you place dimensions and measures on the worksheet, Tableau constructs a query which returns a single aggregated value for each measure, for each distinct combination of dimension values which exist in the database (and meet any filter conditions).  Table calculations then operate over the dimensions and aggregate measures returned by this query.

                           

                          Typically the aggregate functions we use are things like MIN(), MAX(), AVG() and SUM().

                           

                          For certain table calculations, such as this one, we actually want all of the detail returned, so that the table calculation can examine all the underlying values.  To do that we just need to include enough dimensions to uniquely identify each row.  That's where Mike's point about really needing a unique identifier in the underlying data comes in - there is actually no way to get all the rows back if we can't distinguish between them.  (In fact, in the example data you provided, we can still get the answer you want without that unique ID, but I'll come back to that.)

                           

                          Assuming that we do have a combination of dimensions which uniquely identifies each row, it actually wouldn't matter which of those aggregate functions we used - MIN(), MAX(), AVG() and SUM() of a single value all give the same answer.  But if we accidentally fail to supply enough dimensions, the answers will be different depending what we chose (and probably wrong in all cases).  But that won't be immediately obvious, because the result returned will be a plausible looking number.

                           

                          That's where the special ATTR() function comes in.  If a field only has a single value for all rows being aggregated, the ATTR() function returns that value.  If there are multiple values it returns a '*'.  So in cases like this I always use ATTR() rather than one of the other aggregate functions because if you see a '*'; it immediately tells you that you must have the partitioning wrong.  That's how come I was so confident to tell you the partitioning was wrong without even looking at your example workbook yesterday.

                           

                          I've attached a modified version of that example with [Sheet 2] set up to try to demonstrate everything I've been talking about.  By adding [Test Data] to the level of detail shelf I've returned one row for each distinct value of [Test Data] for each [Group].  Note that without a unique identifier for every row there is actually no way of returning every underlying row.  In this case that doesn't actually matter, we can still get the sum of the distinct values that you are after, but a unique identifier would give the option of getting at all rows if you need to.

                           

                          In particular, look at (and experiment with changing) the partitioning definitions for the two table calculations (click the drop-down on the green pills and select "Edit Table Calculation...".  I have them set up to operate over all values of [Test Data] for [Group].

                           

                          Hopefully this makes things a bit clearer?

                          • 11. Re: Table Calculation: Distinct Count
                            Richard Leeke

                            Yes, COUNTD() is an aggregate function that is available in the database for most datasources (i.e everything except text files, Excel spreadsheets and Access databases, which use the MS JET engine, which doesn't support COUNT DISTINCT).

                             

                            If you look back at the "Summary" section for my original post at the top of this thread, the reason for suggesting this table calculation was for two cases that I can think of:

                             

                            a) the datasource does not support COUNT DISTINCT

                             

                            b) the view results in a query at a different level of aggregation from the level at which you want to calculate the distinct count.

                             

                            This leads to a couple of other observations.

                             

                            If you actually want the distinct count at a higher level of aggregation than the underlying query, it may be valid just to use COUNTD() to evaluate that at the fine-grained level of aggregation needed for the view for some other reason and then add up the total of those over a partition defined at the higher level of aggregation, using TOTAL().  That's what Joe has done in the latest version of the calculation he posted in that other thread.  I suspect that will probably be quicker than using the approach I'm describing here, simply because the first level of aggregation is done in the database and there are therefore less rows for the table calculation to have to worry about.  (Joe has actually emailed me to propose a speed test, so we'll report back.)

                             

                            But that approach only applies if it makes sense for the data levels of aggregations involved.  So from the example in your other thread, if you count distinct stores by region, you can add those numbers up to get the countrywide distinct count (one store can only be in one region).  But if you count distinct product types sold by region, you can't add those up, because the same product types can be sold in multiple regions.

                             

                             

                            As for a SUMD() function.  Databases don't support that, so it couldn't be supported in the same way as COUNTD().  It's also actually not a need that comes up much in my experience, so I think Tableau are likely to feel that having the ability to do it with table calculations is probably good enough.

                             

                            Edit:

                             

                            I've realised I had misunderstood the TOTAL() function when I wrote the comments about TOTAL() above.  Actually TOTAL(COUNTD()) works fine - irrespective of whether there are overlaps between the dimensions covered - TOTAL(COUNTD()) evaluates the distinct count over the partition defined for the table calculation - rather than adding up the results of COUNTD() for each row in the partition as I had assumed - based on the rather misleading function description text in the calculation edit window.  (TOTAL() also isn't mentioned anywhere in the online help, by the way.)

                            • 12. Re: Table Calculation: Distinct Count
                              Shawn Wallwork

                              Richard et. al. I know this was finished up months ago, but I just ran into a similar problem with COUNTD() not being available because of no Jet support. I mentioned this to Francois and he sent this: "It's based on the capabilities of the data source. Create an extract from your xls and voila! COUNTD() appears."

                               

                              So if you can work with extracts instead of live connections, COUNTD() will always be available.

                              • 13. Re: Table Calculation: Distinct Count
                                Richard Leeke

                                My original description of the reasons you might need this says: "may be required either because the result needs to be used in a calculation at a different level of detail to the level at which it is calculated, or because the datasouce does not support the COUNTD() function (i.e. if the datasource is the Microsoft JET engine)" - but I subsequently realised that the TOTAL() table calculation function solves the level of detail issue.  You're quite right about an extract, too, so the only time I can now see when you might need this is if you need a live connection to a JET data source.

                                • 14. Re: Table Calculation: Distinct Count
                                  Alex Kerin

                                  Richard, pinging you as you are not on my contacts - here's an interesting one - countd over a rolling window...:

                                   

                                  http://community.tableau.com/message/197426#197426

                                  1 2 3 Previous Next