1 2 Previous Next 18 Replies Latest reply on Jun 27, 2016 4:15 PM by Jonathan Drummey

# How to calculate sum of distinct values

Is there a way to calculate a sum of distinct values without a) requiring the dimension of values being summed to be in the view or b) using a RAWSQLAGG on a data source that supports SUM DISTINCT? (I'm avoiding RAWSQL because extracts don't support it).

In the attached I set up an example of how to get a sum distinct with table calcs on Superstore Sales. For the top 100 customers, there are 1775 values of Sales, but 1779 records because some values of Sales are in the data multiple times. (that's the first worksheet).

In the second worksheet I show that when there aren't any other dimensions confounding the calcs possible to get the distinct sum of sales via a Grand Total or in the status bar, for this data set its \$8,721,008.75.

In the third worksheet (which more mimics the actual situation I'm dealing with) the Customer ID is added to the view. In this case, one way to get the sum of distinct sales is to use a nested table calc where the inner calc uses something like IF FIRST()==0 THEN MIN([Sales]) END with a Compute Using on the Customer (so it partitions for each value of Sale, and returns only a single value for each Sale no matter how many Customers there are), then an outer calc that does a WINDOW_SUM([Sales at level of Sale]).

I'm trying to avoid requiring the Sales dimension to be in the view because in this particular case the granularity I want to work with is only 100 marks for 100 customers, not 1779 marks from the 1795 distinct values of sales for 100 customers. That factor of 17.79. (in this case) is a performance drag and adds complication to the view.

Anyone have any ideas on how to do this in a regular aggregate, data blend, etc.?

Jonathan

• ###### 1. Re: How to calculate sum of distinct values

(I'm not sure I'm going to get a response on this, but figured I'd *ping* anyways).

• ###### 2. Re: How to calculate sum of distinct values

I cannot seem to find a way to make this happen with blending as of yet. This is just screaming of a classic LOD issue that is tough to handle without blowing out the view.

Dan

• ###### 3. Re: How to calculate sum of distinct values

Using a copy of the data source, you could hide nearly every field.  Keep Sales as a Dimension and a copy of Sales as a Measure with the default aggregation set to MIN (or MAX or AVG).  Then create an aggregate extract rolled up to visible dimensions.  This will give you 1 record per unique sales amount and a sales amount you can aggregate.

Then blend away!

Edit: Of course there are complications if you want to blend on anything other than the entire data set...

(Maybe use Sales as a dimension in the primary to link to the Sales dimension in the secondary)

Regards,

Joshua

3 of 3 people found this helpful
• ###### 4. Re: How to calculate sum of distinct values

This guy......all saving the day n stuff.

• ###### 5. Re: How to calculate sum of distinct values

Thanks Dan!  But let's wait to see... Jonathan has a way of showing me why my complex "solutions" don't always work (or could be done a lot more simply)

• ###### 6. Re: How to calculate sum of distinct values

This sounds very cool, Joshua! I tried it out on a data set and it definitely works. And I can poke a hole in it as well...

One way to look at this problem is that to get the table calc solution to work right to work right, we have to increase the granularity of the view to include the dimension being summed. In the above example, the view has 1779 marks, which corresponds to the 1779 rows returned from the query.

In the data blending solution you proposed, the granularity of the view isn't increased by taking advantage of data blending and a pre-aggregated data source, so using the above example the view would only need 100 marks, so it theoretically should be fast. However, the query to the primary source would still be returning 1779 rows because it would be on Customer & Sales. When using DB2 data blending (which this is using), the main query on the primary source to get data is the combination of all view dimensions & all linking dimensions.

That's not a big deal for the example I'd originally posted, but in one of the use cases I'm thinking of there could be 80+ distinct values to be summed and 10K+ customer-type dimensional values, so that's a query returning 800K+ records. Now Tableau would also be returning 800K+ records in the table calc solution, but DB2 blending (the kind used in this case) will generate at least 4 queries at a minimum (2 on primary, 2 on secondary), which could further slow things down because those 800K + records would be returned twice, once for the initial query to look up the domains and then again for the query to pull all dimension & regular aggregate measures for the view. This might be faster because of caching, maybe not. And that's not even all Tableau does, it also has to actually blend the 800K values from the primary with the 80 values from the secondary, then it would be invisibly re-aggregating each of the summed values (Sales) at the granularity of the view (Customer in this case). All of that would likely lead to a view that is too slow.

FYI, there's an idea to have a native SUMD function: http://community.tableau.com/ideas/2177

Jonathan

3 of 3 people found this helpful
• ###### 7. Re: How to calculate sum of distinct values

Thanks to the Level of Detail calcs in version 9, here's a solution:

{FIXED [Sales] : MIN([Sales])}

When SUM() is wrapped around that, we get the sum of the MIN() Sales for each distinct Sale, which is the sum of distinct sales.

v9 workbook is attached. Answering my own questions is fun!)

12 of 12 people found this helpful
• ###### 8. Re: How to calculate sum of distinct values

HA!

• ###### 9. Re: How to calculate sum of distinct values

Nice!

I'm starting to get a lot of questions about performance around LOD calcs.  I haven't had any issues, but since performance was one concern in this case, did you see any issues?  Or have you ever, so far, experienced any degraded performance using LODs?

Regards,

Joshua

• ###### 10. Re: How to calculate sum of distinct values

Hi Joshua,

I was definitely in the front row during a very recent presentation on LOD calcs by the Tableau product consultants.  They gave us an overview of the SQL that's generated and, to paraphrase, said that "it isn't perfect yet and on very large data sets you're going to find performance problems ... please let us know when you do".

Again to paraphrase, their suggestions re: getting the LOD calc SQL to perform well on those large data sets was to make that you A) understand the SQL, and B) build the necessary indexes & statistics in your database, and C) etc...

Not exactly ideal.  But, on the other hand, I can't blame them for following the correct sequence of events with this awesome new functionality:

1. first, get it working
2. then, make it perform well

Also, Jonathan, I don't recall exactly (and I admit that I haven't yet looked inside of your workbook attached here), but was it possibly the case that this question first came about as a result of our trying to build the "AND" button ?

Thanks!

2 of 2 people found this helpful
• ###### 11. Re: How to calculate sum of distinct values

I had a worksheet with ~30 LOD calcs --- each calculating a fixed average for a different measure. These where then used inside another regular calculated field to determine a ratio of filtered to fixed/non-filtered values.

LOD calcs were about 3-4 x slower than a blend, which was 2-3 x slower than row level calcs, where I added the avg to every row in the data source.

I didn't really look into it, but I suspect that each LOD was a separate query.

My impression is that for a lot of cases, table calcs will be faster with smallish data sets, since you're not going back to the the database.

I also don't necessarily think LOD calcs are easier than table calcs (although I've been using TCs for a long time). The main advantage of LOD calcs for me is that I can set the aggregation externally from the view, which means I don't need to constantly check the addressing/partitioning in a worksheet.

Jim

• ###### 12. Re: How to calculate sum of distinct values

Jim Wahl wrote:

I also don't necessarily think LOD calcs are easier than table calcs (although I've been using TCs for a long time). The main advantage of LOD calcs for me is that I can set the aggregation externally from the view, which means I don't need to constantly check the addressing/partitioning in a worksheet.

Very glad to see someone else utter these words -- the concepts are definitely there, but without loads of practice and examples, I'm bound to get confused -- the choices are great to have, but you really have to know your stuff to choose the best path and it seems to be getting more difficult to choose the right path without help/direction from others. I could be all alone in thinking this and I certainly don't have an adequate solution, other than lots of time and practice.

• ###### 13. Re: How to calculate sum of distinct values

@Keith - Thanks for those notes! Yes, in fact it was in regards to the "AND" button. As I was writing this up yesterday I was trying to remember what the original source of the problem was, I did such a good job masking it even from myself that I couldn't remember!

@Joshua - There's a thread on the beta forum from a month or two ago where Shawn was doing a set of nested LOD calcs and running into performance issues, there was some good info from Tableau devs.

@Jim - Yes, each LOD calc is computed as a separate subquery at this time (with potential for nesting), they are using a similar process to top & conditional filters and computed Sets. One thing I've been thinking about with LOD calcs is trading off the granularity of the LOD calc vis-a-vis the level of detail of the visualization. My thinking is that If the viz is at a coarse granularity, then a table calc acting at that coarse granularity is likely going to be faster than an LOD calc that may be computed and/or joined in at a fine granularity.

@Matthew, et al - Something that I keep in mind is that LOD calcs are v1.0. In this release there's definitely room to improve performance, there's no GUI for LOD calcs, and LOD calcs don't have any sorting or ranking built-in. I think of table calcs as somewhere in the 2.x release, where 1.0 would be the initial table calcs, 2.0 would be custom table calcs, and the .x includes the changes for densification, adding the RANK functions, etc. Beyond the GUI for table calcs (as much as it needs improvement), a big factor that helps understanding table calcs is that we can always see the marks (aka addresses/rows in the partition) that are being computed over. Since LOD calcs are separate subqueries that themselves can be aggregated based on the viz level of detail we need to have (or develop) the skill and experience to mentally project what the LOD calc will be doing. Folks like me who have a background of writing SQL aggregate calcs at the command line have that skill to whatever degree. For those who don't have that skill yet, I've suggested to Tableau that one way to teach LOD calcs is to show people how to build separate views that express what the LOD is up to so they see what the LOD calc is doing in isolation before the results get re-computed in whatever way inside the viz.

Jonathan

• ###### 14. Re: How to calculate sum of distinct values

Trust me, we aren't done here with either table calcs or LOD calcs . We know that there are still improvements to be made.

Dan Huff

3 of 3 people found this helpful
1 2 Previous Next