Jonathan's Teachings re: Table Calculations & Granularity

Version 1

    This came from a thread I didn't want to lose:

     

    Taking a step back here, the question is about comparing across records in the underlying data in order to filter out Email IDs that belong to list 123, then count the remaining Email IDs per List ID. This kind of problem is a great practice for me to explain, so I'll take a little extra time to walk through the steps involved to reach a couple of solutions.

     

    A distinct record (row) in the data is made of the Email Address and List ID. So if we want to exclude certain Email IDs based on the presence of a List ID for that record, we have to look across records. In order to make Tableau do that (there's no single magic function to do that kind of filtering, like the other BI tool you mentioned, Roopa) we need to work with the three calculation levels in Tableau:

     

    - Record (or row) level. The calculation is computed for each record, using only values from that record. This can return a dimension or a measure.

    - Aggregate. An aggregate calculation like SUM() is computed across the set of records that are returned based on the distinct values of the dimension(s) in the view. These are always measures.

    - Table calc. Table calculations are aggregates of aggregates, and can be made to compute over the values of the dimensions in the view in most any way we want. In addition, table calculations can be nested. These are always measures.

     

    If we have a view with both the Email Address and List ID dimensions in it, then we end up with a view where we have to set up a calculation that can look across those discrete combinations of dimensions, and the only calculation that can do that in Tableau is a table calculation. Roopa, this is why Shawn was asking you about the final view, because that can change the kind of calculation needed to get results.

     

    With this model, what Shawn had set up is a table calculation in a view with both List ID and Email Address as dimensions that does the following:

    1) Does a row-level calculation to turn the List ID into a string. This actually isn't necessary. Shawn, I'm curious as to where you got the understanding that you had to convert numbers into strings to evaluate them?

    2) Wraps that in the MIN() aggregation. Because the dimensionality of the view is such that it's effectively record-level, the MIN() returns the List ID.

    3) Does a comparison to identify (I think of this kind of calc as a flag) whether the particular List ID is 123, if so then return 1 else 0. Because this is an aggregate calculation, this is computed for every Email Address/List ID combination in the view. I've separated out the calc and put it into the "Without Filter" view in the attached.

    4) Then the WINDOW_SUM() is a table calculation that will sum across all List IDs for each Email Address. This returns the same result for each List ID within each Email Address, I set that calc up as well.

    5) Finally, the result of the WINDOW_SUM is evaluated against 0, to return a boolean result. This can then be used as a filter, as was set up in the With Filter worksheet.

    6) Now comes the sticky part, how to use that to count up all of the remaining Email IDs for each List ID?

     

    I took Shawn's idea with a slightly different approach. First of all, though, I changed the data source to include two additional rows to cover two additional conditions. I added List ID 456 for Email ID z@gmail.com so that Email ID has both Lists that would pass the filter, and added List ID 456 for Email ID x@gmail.com so that Email ID has all three Lists, and should not pass the filter. Note that this changes the desired results to 8 for List ID 456, and still 2 for List ID 789.

     

    Here's the inner part of the revised TC flag calc:

     

    TOTAL(SUM(IF [List ID]=123 THEN 1 ELSE 0 END))

     

    This also has a Compute Using of the List ID, so it restarts for each Email Address. The calc doesn't do the string conversion, and I've put the inner IF statement as a row-level calc. I tend to do that to try to keep calculations at the lowest level possible, among other reasons one is that if the grain of the view changes due to other dimensions coming into it I still have the most predictable results. And I've used the SUM() aggregation in this case because since the inner calc can return 1 or 0, a MIN() would return 0. MAX() would also work. I'm using TOTAL() here because it's almost always faster than WINDOW_SUM(). You can see that in TC Workout #1.

     

    Then for the full TC Flag calc, I used this formula:

     

    IF TOTAL(SUM(IF [List ID]=123 THEN 1 ELSE 0 END)) == 0 THEN 1 END

     

    The IF wrapper then causes this to flip the 1s and 0s, so that only the Email Addresses we want to count are 1s, and you can see that in the TC Workout #2 worksheet. I'm not putting an ELSE 0 on this calc because at this point I'm not needing those other email addresses at all, so why assign values to something that is supposed to be filtered out?

     

    Now we have a flag calculation that only returns 1 for the right values, we can sum that up. Here's the TC Total calc:

     

    IF FIRST()== 0 THEN WINDOW_SUM([TC Flag]) END

     

    This outer TC Total calc has a *nested* compute using that you set using the Edit Table Calculation dialog, where it's set up to compute using Email Address, partitioning on List ID and the inner TC Flag calc retains its compute using of List ID, partitioning on Email Address. So the WINDOW_SUM([TC Flag]) of the TC Total calc sums up all the TC Flags for each Email Address within each list, and since TC Flag is 1 or Null that's the same as a COUNT(). I'm using WINDOW_SUM() here rather than TOTAL() because TOTAL() can't be nested, while WINDOW_SUM() can. Also, WINDOW_COUNT() would've worked just as well as WINDOW_SUM() in this case. The IF FIRST() part causes this calculation to only return one result for the entire partition (List ID). And again, there's no ELSE for this IF statement because I don't want to make Tableau do any more work than it has to. You can see this in TC Workout #3. Note that I also rearranged the pills on Rows to better show the addressing (compute using) for the TC total calc.

     

    To get to the final view that you can see in TC Workout #4, I did the following:

    1) Duplicated the TC Workout #3 worksheet.

    2) Got rid of the TC Flag calcs.

    3) Moved Email Address to the Level of Detail Shelf and turned off Tooltips for that pill. The Email Address at this point is just there to get the calcs right, we don't need to see details.

    4) Ctrl+Dragged a copy of the TC Total pill onto the Filters Shelf, and set it to filter for Special->non-Null values.

     

    That makes a nice demonstration of the power of Tableau to sum across the dimensions in different ways. However, that's a lot of work to set up. Fortunately, there's an easier way.

     

    The original goal is to filter out Email IDs that belong to List ID 123, then count the remaining Email IDs per List ID. So another way to put this is that we're trying to a calculation with one level of granularity (to get the Email IDs that pass the filter), then to do a calculation on that (to count the Email IDs) with a different level of granularity.

     

    The table calc solution does this all at once, however since both dimensions are in the view the calculations and their addressing have to work at all three calculation levels and be aware of the differing levels of granularity, so no wonder it's complicated.

     

    Tableau provides a couple of alternative routes for having computations at different levels of granularity in the same view. One is via a Tableau data blend, the other is using Top & Conditional Filters. When we create a Top or Conditional Filter on a dimension, Tableau is performing a computation at the level of granularity of that dimension that is *separate* from the calculations for the view that use the dimension(s) in the view. In this case, we want to filter out email addresses that belong to List ID 123.

     

    I set up an Agg Flag # calc with the following formula:

     

    IF [List ID]=123 THEN 1 ELSE 0 END

     

    This is a row-level calculation. In the CF #1 view that has *both* Email Address and List ID in the view, the calc has the default SUM() aggregation but that doesn't matter because the granularity of the view is at the record level and we see lots of Trues and Falses.

     

    However, as I'd said Conditional Filters are computed at the level of granularity of the dimension being filtered, so we need to get rid of the List ID dimension to see what would actually be happning. In the CF #2 view that has *only* the Email Address, the calc is doing computation for each record and then being summed up for each email Address, so we see the 1s for each email address that has a List ID of 123.

     

    We can now turn that calc into something that returns a boolean True/False for each Email Address:

     

    SUM(IF [List ID]=123 THEN 1 ELSE 0 END) == 0

     

    This is in the CF #3 view. Often when working with calcs piecemeal them together with a series of calculations and views like this so I can validate each step.

     

    Next, I can set up the Conditional Filter on Email Address using a Formula of the Agg Flag boolean calc. (There's more that could be written here about how Conditional Filters require aggregates, I'm skipping that for now), that's the CF #4 worksheet. Note that we're seeing 9 marks here because there are 9 email addresses that pass the filter.

     

    In CF #5 I've added List ID to the view and we can see 10 marks, because z@gmail.com belongs to both lists.

     

    In CF #6 I've removed the measures and turned Email Address into COUNTD(Email Address). Note that I had to extract the data to get access to COUNTD() for the text data source, in Tableau 8.2 that requirement will go away as Tableau introduces a new connector for text & Excel files.

     

    That's a lot easier to put together and (for me at least) conceptually makes more sense than the table calcs, there are only 3 pills in the CF #6 view. There's a filter that only returns certain Email Addresses, we've got List ID as a dimension, and are counting the distinct Email Addresses for each value of List ID. In fact, we don't even need that Agg Flag boolean calc as a separate calculation in the Measures window, that could be embedded in the Conditional Filter.

     

    Or, to make that filter re-usable, we can create a Set on Email Address that uses the same conditional filter, I put that into the Set Filter worksheet. Then you can use the IN/OUT of the set to start doing things with your data like look at variances between Email Addresses that don't belong to 123 vs. those that do. Or, if you *really* don't like List ID 123, you could make the Conditional Filter a data source filter or an extract filter and keep those email addresses completely out of the data source for other users.

     

    So there's a table calc solution and a conditional filter solution. Another option would be to pre-filter in the data source via Custom SQL or a custom view. Personally, I'd tend to go with the conditional filter as it's computed in the data source and in most cases likely to be faster than the table calc solution. Also, further calculations on the data are a lot easier when you don't have to deal with the finer granularity required by the table calc solution.

     

    If you have any questions, please let me know!

     

    Jonathan