1 2 3 4 5 6 7 Previous Next 94 Replies Latest reply on Jan 14, 2019 3:50 PM by Yuriy Fal Go to original post
      • 45. Re: How to count distinct users on a running period
        Jonathan Drummey

        Hi Joshua,

         

        Joe Mako and I have been putting something together on Tableau's pipeline, I'm still not confident about parts so I don't feel ready to put the whole thing out to the world just yet. Here's my understanding of the basics as of version 7, I haven't seen whether there are changes in v8:

         

        • Tableau generates the SQL query/ies (typically one per datasource) and packs as much of the calculation, filtering, & computation as it can into them.
        • Inside Tableau, filters on aggregate calcs are applied.
        • Densification happens. Densification depends on pill type, pill arrangement, whether domain padding is being requested via Show Missing Values, compute using settings of table calcs, mark types, data structure/density, etc.
        • Data blending occurs.
        • Compute blended aggregates - These are aggregate calcs in one data source that use fields from the other data source.
        • Filters on blended aggregates are applied.
        • Compute table calcs.
        • Compute grand totals & subtotals.
        • Compute total-based reference lines on regular aggregates, and percentile distribution reference lines for regular aggs and table calcs.
        • Apply filters on table calcs.
        • Filter latitude (generated) and longitude (generated).
        • Manually hide marks.
        • Compute all other reference lines.
        • Pages Shelf.
        • Compute Tooltips, Title, Annotations.

         

        This really hasn't been obtained through talking with developers, mostly through setting up test conditions to see what happens. i.e. to test reference lines, put a table calc and a regular aggregate in the view, set up reference lines, and apply different kinds of filters. It's an ongoing process of discovery, for example just a couple of minutes ago in putting together this post I learned that percentile-based reference lines are not changed by hiding marks or table calc filters.

         

        Jonathan

        • 46. Re: How to count distinct users on a running period
          Richard Leeke

          I promised earlier in this thread to write up some notes on the pros and cons of the various approaches explored here. But then I got distracted into exploring the data densification issues I ran into with one of the approaches, so I haven't got back to it.

           

          Well I've raised a couple of bugs on the data densification with Tableau support (the issue with seeing a different number of marks added by densification when connected to a SQL Server and an extract and also the issue with the number of extra marks changing simply by refreshing the view). On the third issue I didn't understand, which was a whole load of extra densification marks when making one of the dimensions discrete, I've been having a long off-line conversation with Ross Bunker (plus Joe and Jonathan). I put together a workbook which demonstrated that behaviour which Ross has now looked at and explained. He has promised to post his explanation on this thread, so I'll just post the workbook I sent him so that anyone interested can follow along.

           

          As to the pros and cons of the different approaches, I'll just post a very brief summary for now. I can elaborate if anyone is interested.

           

          There are basically three different approaches discussed in the thread, with some minor variations possible on each.

           

          Method 1: Joe's approach with table calculations

           

          This method uses a pair of nested table calculations to evaluate the distinct count in Tableau (rather than in the database). It relies on Domain Completion to allow the table calculations to work correctly and can handle missing dates in the underlying data using Domain Padding ("Show Missing Values").

           

          Pros:

          • Can handle missing data
          • Supports multiple mark types
          • The size of the window can be parameterised
          • Relatively straightforward calculations

          Cons:

          • Considerably slower than the other methods on large data sources (but considerably faster in version 8)
          • Does not scale as far as the other methods (limited by table calculations)
          • Does not allow a continuous date dimension (or at least I can't get that to work)

           

          Method 2: Using RAWSQL to evaluate the distinct count

           

          This method pushes all the work to the database, by using a RAWSQL expression to evaluate the distinct count over the required window for each date.

           

          Pros:

          • Supports multiple mark types
          • Works with discrete or continuous date dimensions
          • The size of the window can be parameterised
          • Simple to define calculations (assuming SQL knowledge)
          • Missing data can be handled in the data source (by joining to a table containing all dates, for example)
          • Fast

          Cons:

          • Does not work for JET datasources (text files, Excel or Access) as JET does not support COUNTD()
          • Does not work with a data extract (no RAWSQL)
          • Requires SQL knowledge

           

          Method 3: Using multiple overlapping partitions to evaluate the distinct count in the database

           

          This method also pushes all the work to the database to evaluate the distinct count, but in a way that allows it to work with the data engine. This picture aims to show how the staggered partitions work:

           

          Staggered Partitions.PNG

          This method uses a rather convoluted set of table calculations, which limits its flexibility considerably. This was the method that caused all those issues with densification - so it obviously takes Tableau outside its comfort zone.

           

          Pros:

          • Fast and scalable
          • Works with discrete or continuous date dimensions
          • Works with a data extract

          Cons:

          • Does not work for JET datasources (text files, Excel or Access) as JET does not support COUNTD()
          • Does not support line mark types (or at least I can't make it join the dots)
          • Does not handle missing data (except by providing the missing dates in the original dataource)
          • Does not allow interval to be parameterised - each extra day in the interval requires several new or changed calculated fields
          • Convoluted, difficult to set up and clunky

           

           

          I think I would normally go for method 2, unless I needed it to work on a JET data source or a data extract. Next choice would be method 1 and I'd only resort to method 3 if the size of the data source made method 1 too slow to be viable.

           

          The attached workbook demonstrates the issue I was having with unexplained extra padded marks. Ross has promised to jump in and explain what that is about.

          • 47. Re: How to count distinct users on a running period
            Shawn Wallwork

            Nice explanation Richard. Question about cons of Method #2, if you did the RAWSQL work before doing the extract wouldn't that get materialized in the extract? Couldn't this make both RAWSQL and COUNTD() work in the same workbook? This would seem to cut the cons down to just knowing SQL.

             

            --Shawn

            • 48. Re: How to count distinct users on a running period
              Richard Leeke

              No,  unfortunately RAWSQL aggregates don't get materialised, only dimensions.

              • 49. Re: How to count distinct users on a running period
                Shawn Wallwork

                Thanks Richard, good to know.

                 

                --Shawn

                • 50. Re: How to count distinct users on a running period
                  Michael Mixon

                  Part of my brain just leaked onto my keyboard...but not before passing on an "a-ha" moment to the remaining brain cells, namely that the table calc you use for In Window works even if there aren't enough rows to make up the designated partition.  In other words, even though you tell it to look at 7 colums (current plus the last 6), it will still return a value even if there are fewer than 7 columns to evaluate (e.g. on 1/9/2013, there are only three columns to evaluate, not 7).  This didn't occur to me before.  As Shawn mentioned earlier, this is another tool that will come in handy.  Thanks.

                   

                  WINDOW_MAX(SUM(1),1-[Window Length],0)

                   

                  mako magic.png

                  -Mike

                  • 51. Re: How to count distinct users on a running period
                    Ross Bunker

                    Richard's workbook shows a couple of issues related to how padding works in Tableau.

                     

                    First, a couple quick notes.  There are three sources of added values:

                    ‘Show Missing Values’ for ‘line and area charts’, ‘Show Missing Values’ for table calcs, and table calc densification (think of this as filling in missing values in a text table).  These three can interact in different ways.

                    Let’s call these axis domain padding, table calc domain padding and table calc domain completion respectively.  Note that i’m just using those terms in this response, we haven’t formally designated any terms, but these relate closely to what's been used recently on the forum.

                     

                    The first thing I’ll note is that of the first four examples, only the ‘Padding (discrete)’ sheet is doing axis domain padding.  ‘Padding (continuous)’ uses a continuous dimension on the axis, so we don’t do axis domain padding.  ‘Padding (continuous)’ does get marks added, but it’s because of table calc domain padding.  Basically, when we do table calcs, we fill in time and numeric bin dimensions if they are in the ordering fields and have ‘Show Missing’ values on.  ‘Padding (discrete)’ probably has both happening, but they are in sync, so the same marks would be added by either one.

                     

                    It seems like the first four examples are basically behaving as expected.

                     

                    The last two sheets have two dimensions, while the previous four only have a single dimension.  This is why things change drastically.

                     

                    In ‘Attempt at padding and and completion (continuous)’, there's a dimension on LOD.  This means we now potentially have a partitioning field.  On this sheet, it is using ‘Cell’ ordering, so there are no ordering fields, which will always prevent any table calc domain padding or table calc domain completion.

                     

                    I thought the ‘Cell’ ordering may have been unintentional so i explored alternatives.  Padding doesn’t occur when switching to ‘Rows’ or either dimension alone.  Table calc domain padding only happens within the partition (i.e. fill in the gaps between dates within the partition). Since each partition in all these cases has only a single row (each dimension is unique for each row in the underlying data) there are never any gaps to fill in.  Table calc domain completion won’t happen either since Table Calcs only do that when partitioning and ordering columns are on opposite shelves (or the page shelf) and here, your other dimension is on LOD (*side note).

                     

                    Side Note (not relevant to this discussion, but wanted to throw it out there):  There is a special case for a time based ordering to be padded against LOD, but only if all ordering fields are time based and that LOD _not_ time (think Date ordering with Market on LOD, all Markets should probably appear on each Date).  This is a special case for dates because we know that dates tend to be orthogonal to all other dimensions (they are self-contained).

                     

                    Interestingly, if you put both Date and Other Date in the ordering fields, you do get the missing values filled in.  Table calc domain padding is responsible in this case.  With both Date and Other Date in the ordering fields, then all marks are in the same partition, so we can fill in the gap on Date (which is the one with ShowMissingValues on).

                     

                    Finally, in ‘Attempt at padding and and completion (discrete)’, we see some very interesting behavior.  In this sheet, Date is discrete, which triggers axis domain padding.  Axis domain padding sees there are no values for 2/1 through 2/7 (remember, 'Show Missing Values' is on for Date).  It wants to fill these in.  When it does so, it needs a value for Other Date (measures are filled in as NULL, but not dimensions).  Currently, we just pick the value from the 'first' row (1/2/2013) as you noticed.  This adds the 2/1 – 2/7 values to the table.  Now, the table calc is going across rows which means we are ordering on Date and partitioning on Other Date.  It has ‘Show Missing Values’ on Date which would trigger table calc domain padding, but as I explained for the (continuous) case, this only happens within a partition.  Since we are partitioning on Other Date, which is uniquely valued for each Date, each partition should have only a single row and no table calc domain padding to do.  But, recall that axis domain padding added values for 2/1 - 2/7 with OtherDate = 1/2/2013 to the table.  This now means that the [Other Date]=1/2/2013 partition for the table calc has 8 rows with values of Date=1/1, 2/1, 2/2, 2/3, 2/4, 2/5, 2/6, 2/7.  So, the table calc domain padding kicks in because Date is an ordering field and we fill in the gap between 1/1 and 2/1 for the partition (1/2).

                     

                    Now, just because I understand it does not mean I condone the mystery or complexity of the behavior.  The thing we've heard loud and clear is that a) this is mysterious, and b) there is a need for more control over the behavior.  Please keep the questions, comments and scenarios coming and we'll keep trying to make the product better.

                     

                    :)ross

                    • 52. Re: How to count distinct users on a running period
                      Matt Lutton

                      This thread is intense. 

                      • 53. Re: How to count distinct users on a running period
                        Noah Salvaterra

                        You've got that right! It is also pretty old, forum wise. I wonder if rank or some other new feature since then has the special sauce for cracking this one. I'd have thrown up my hands and said reshape your data 2 pages ago, actually I did, but I'll probably finish reading through the rest of this later... It is hard to catch up on one of these in one sitting. I'm sure I've seen an idea for window_countd, ah... there it is:

                        Window_CountD

                         

                        N.

                        • 54. Re: How to count distinct users on a running period
                          Matt Lutton

                          Good to see I had voted already. I just started looking through Ross Bunker's posts and was hitting like with a fierceness. I know these guys are seriously busy, but when they contribute, it can be invaluable. I wish some of the folks in the posts would come back and share some more but I know they're all willing to help and I understand "passing the torch" as well.

                           

                          It would be really interesting if someone could go through old posts and revive them with new alternate solutions. Perhaps an admin, if they could find the time? Or maybe a Think Data _____(enter day of week) revival? There's just so much good stuff in the "archives" if this forum...

                           

                          Happy Sunday NS

                          • 55. Re: How to count distinct users on a running period
                            Noah Salvaterra

                            I suppose that could be interesting to see, however, I'm not sure the CBA would be looked upon favorably by stockholders. Maybe limit it to the threads with a minimum number of contributors (or pages). Assuming someone has the ability to search in that way (I'm pretty sure I don't).

                             

                            Another option might be to take a more forward looking approach, keeping track of problems that seem common or reasonable, but whose problem is out of reach, or maybe just inelegant or inefficient even when the heavy hitters jump in. Linking to these threads in ideas that might help address the problem would demonstrate that it is something people are struggling with. It could make for an interesting blog as well.

                             

                            N.

                            • 56. Re: How to count distinct users on a running period
                              Matt Lutton

                              I'm for anything that encourages Forum users to collaborate and explore different approaches to the same problem.  I really liked the Think Data Thursday sessions that I was able to catch.

                              • 57. Re: How to count distinct users on a running period
                                Noah Salvaterra

                                Here is a link to the TDT archives:

                                The specified item was not found.

                                There are a few I haven't gotten to yet and a few I watched so long ago that I'd probably get something new if I had a refresher. If you want to get a study group together to chat about these after watching, I'd be game join in on that.

                                 

                                N.

                                • 58. Re: How to count distinct users on a running period
                                  Matt Lutton

                                  Yeah, I am the same way--and I have no doubt learned more since viewing them; I like that idea a lot as there is plenty to explore on the existing concepts covered In the videos. If we can find some others and and a time that works for everyone, I am all for this. Anyone seeing this feel free to reach out here or via email (see my profile Matt Lutton)

                                  • 59. Re: How to count distinct users on a running period
                                    Nicholas Blankevoort

                                    I have had similar problems, where trying to count unique variables per day ( further summed by week/month/year) and solved it like this

                                    • create a new variable in a calculated field with (unique day values + the variable) to be counted (see below)
                                    • put the new variable in worksheet and used countd(new_variable) in the worksheet

                                     

                                    The new variable in the calculated field would be something like this.

                                     

                                    NEW VARIABLE (as calculated field)

                                     

                                    str(

                                    year([YOUR DATE VARIABLE]) + month([YOUR DATE VARIABLE]) + day([YOUR DATE VARIABLE]))

                                    + YOUR_VARIABLE_TO_BE_COUNTED_PER_DAY]

                                     

                                    this create unique records for what I want to count distinct over a given period.

                                     

                                    By putting this variable in the worksheet and countd(NEW VARIABLE) it solved my problem.