I had a go with your workbook.
The issue is indeed due to the poor performance of the WINDOW_xxx() functions.
I wrote up an explanation of why they are slow and a workaround you can use in this post on Robert Mundigl's Clearly and Simply blog site.
I've applied that technique to your workbook and that has brought the refresh time down from around a minute to 3 or 4 seconds on my laptop. See sheet 2. But that is with only a tiny amount of data (5,000 rows in your sample workbook). The time to calculate window functions goes up with the square of the number of rows, so I think the saving will be even bigger with your 45K lines.
I had to break out a couple of the nested window calculations into separate variables in order to use PREVIOUS_VALUE() in the way that you need.
I'm pretty sure that using the technique I describe in this post will make it a lot quicker still, especially as you get to more rows (I've used it successfully on 100 million rows). This works by getting the database to do as much of the work as possible by counting each distinct value in the database. The more rows you have, the more duplicate values you will have, so the more of a saving you will get.
Note that I also fixed an issue in the [Variable] calculation - it didn't match one of the parameter values.
Slow Quantiles RL.twbx.zip 1.9 MB
Thanks Richard for posting this.
I need something like this but don't know how to adjust your formula.
Here's my situation, I have attached a simplified workbook to show you what the data looks like:
I have a conversion rate (sum(contracts)/sum(inquiries)), which needs to be calculated in Tableau and not in my database since it's a ratio based on a changing number of lines of data, depending on filters.
I would like to show the conversion rate over time, for a store or all stores in a market, and be able to filter on a specific channel (which is already done in the workbook).
But I also want to have a similar line which shows for each day, the 80th percentile conversion rate on the MARKET and channels included in the chart.
So for example if I only select Store 1 and Web channel, I would want the line to display for each date the 80th percentile conversion rate of all contracts and inquiries made in the USA (which is the market of store 1) and over the web channel.
Do you have any idea if this can be done?
percentile.twbx.zip 24.3 KB
I guess the summarized and more general question is:
How do you apply a quantile calculation like this one on a set of data smaller than the full dataset, but larger than what is displayed in the chart (e.g. across all stores of the store's market when displaying only one store)
Interesting question. This is actually a class of problem I've been thinking about a bit recently, so it made an interesting exercise to see if I could do what you're after.
So far I haven't managed to get exactly what I think you're looking for, but I've attached where I've got to and will explain what I've done. The issue preventing me from getting exactly what you want is that the filter shelf applies to both axes in a dual-axis chart. What I think is needed is to have the ability to specify different filter criteria for each mark, in the same way that you can have different versions of the other shelves by choosing multiple mark types.
I don't imagine it's as simple as that though - I'm sure there could be all sorts of issues with that which I haven't thought of. But I think this makes a good example of why you might want to, so I'd be interested if Ross Bunker has any thoughts on whether that might be a possibility one day (you always say you want to see real world examples, Ross).
I know it's a difficult area because a subtle change in the way that table calculation partitioning works with the filter shelf which came out in 7.0.4 has just broken several of my workbooks. In fact that's the reason I've been thinking about this - multiple filter shelves would let me get past that problem.
Anyway, back to your question and where I managed to get to with it. I've attached an updated workbook with several sheets, I'll talk through them one by one.
Original is how you posted it. The chart shows the overall conversion rate for all channels and stores selected, by day. The blue line is trying to show the 80th percentile conversion rate, but isn't correct, for a couple of reasons. Firstly, with the view structured like that, there is only one value per day available and the calculation is trying to get the 80th percentile of the daily values. So the calculation needs to sort the daily values into order and find the lowest value for which 80% of the days have the same value or lower. The way my algorithm does that is by making the value in question (in this case conversion rate) a dimension and setting the Compute Using on the calculated field to conversion rate. But conversion rate is an aggregate calculation, so can't be a dimension, so can't be used for Compute Using.
As you have it, the table calculation is just set to Compute Using table across. So it looks at all the values in table order (which means in date order) and finds the lowest conversion rate value for a date that is 80% or more of the way through the dates. So it finds the minimum conversion rate on or after 18th of May. Not what you wanted at all.
In 80th Percentile Overall Conversion Rate by Date I've adjusted the table calculation partitioning so that it sorts the rows by conversion rate, so it now displays the 80th percentile of the overall daily conversion rates.
But I still don't think that's what you are after. I think from reading the question that what you really want is the 80th percentile of all the individual channel and store conversion rates on each day. I've illustrated what I mean by that in the next few sheets.
I always find with this sort of problem that looking at the detailed data helps me to get my head around where I'm trying to get to. So in the next sheet Scatter I've shown all of the daily conversion rates. In your sample data there are 4 values per day (2 channels for each of 2 stores).
On Scatter Showing 80th Percentile I've added in the 80th percentile for each day. You'll notice that the 80th percentile is just the maximum for the day. That's just because you only have 4 values, so only the maximum value is greater than or equal to 80% of values.
So to make it easier to see what's happening I added a parametrised version of the calculation in sheet Scatter Showing (param) Percentile, which I've currently set to to the 75th percentile.
Finally in sheet Filter to Individual Channel and Store I've switched back to a line chart view and illustrated the best I can achieve in terms of your original filtering question.
In order to calculate the percentiles across the whole Market, you need to bring the rows for all channels and all stores back to Tableau, so you can't just filter on those in the normal way. So I've used a technique I described in this thread to apply a filter to the results of the table calculation, to limit what gets displayed, whilst still keeping all rows available for the calculation.
That's all very well, but it only works to filter down to an individual channel and an individual store. I think you would need separate filter shelves for each axis to get to what you want.
The only other way I can think of is to have a dashboard with the the two calculations on different sheets.
percentile RL.twbx.zip 56.8 KB
Thanks for this clever and helpful response. It's not 100% of what I wanted to display, but I think that by using a trick I saw on this forum a while ago by which one sheet "disappears" and another "appears" on a dashboard depending on the value of some filters/actions, I can get pretty close.
Thanks a lot !
Another view which might be helpful (though again isn't what you were thinking of) is a scatter of individual channels and stores, with the percentile line overlaid. That way you can use filters to show and hide different channels and stores, without affecting the percentile value.
I've attached a modified version with that sheet added.
Note that the reason this works is that the percentiles are actually drawn repeatedly - once for each channel/store combination - so as long as you still have one of those showing the percentile line will still display.
percentile RL2.twbx.zip 64.6 KB
I do love the real world examples. Thanks! :-)
Two things. First, apologies on the 7.0.4 breakage. I've seen the bug and we've got a fix checked in. it will be out in 7.0.6. The problem was a side-effect of a fix we did to make table calc's on the filter shelf work better in the case where they _didn't_ appear anywhere else in the viz. We had an issue where we didn't show any dimensions for ordering that appeared in the LOD or other shelves (besides rows/columns). The fix caused us to look at all the panes in such a case and only show dimensions that appeared in all panes. In your case, the table calc was used for filtering _and_ in one of the panes. The fix caused us to treat the table calc on the filter shelf as invalid because it was using a dimension that only appeared in one of the panes. 7.0.6 will have an update that fixes this problem.
The second thought is that there is a trick you can use to sort of get per pane filtering. If you create a table calc of the form LOOKUP(<field>, 0), and set the ordering to Cell, then it essentially forces <field> to be computed locally, though it really isn't doing any work. If you then filter on this, it will always be a local filter.
Now, if <field> is something like ATTR(StoreId), and StoreId is only in one of the panes (because you are using dual axes with different dimensions in play), then ATTR(StoreId) will be '*' in the other pane. You can therefore filter to allow '*' and one other store id. This will leave one pane alone, and filter out all the data in the other pane except the desired StoreId. I haven't looked enough at the use case here to know if that solves the problem or not. If StoreId is in both panes, this won't work.
Another way you can sometimes achieve a 'filtered' computation is to use NULLs. A calculation like:
IF ATTR(StoreId) == [Parameter] THEN SUM(Sales) ELSE NULL END
will give you only the SUM(Sales) for a particular store. This can be used in a table calc to order and compute a quantile for a single store which can then be compared against all other stores.
As for a formal 'non-hacky' way of doing per pane filtering, we are definitely thinking about this, but I don't know when it will be implemented or released. The real-world example will definitely help.
Thanks for the feedback Ross.
I figured that must have been what happened on the breakage. Support had told me there was a fix under development, good to know its checked in.
I must be missing something on the LOOKUP(ATTR(),0) trick - or one of us must be. ;-)
I use that trick a lot (but have never come across the dual axis twist). In fact that is what I am doing in the workbook I posted back for Alexis above in order to pull all values back to compute the quantiles over the whole set but the filter on channel or store for the other axis. So I have channel and store on LoD for one pane, and a LOOKUP() based filter for each of them - but the filter isn't offereing '*' as an option.
If I could get that to work I think it would definitely help for some classes of problem. More often I want the alternative filtering to result in different result sets or different underlying rows for the aggregations that are returned to Tableau - so need different server side filters to apply per pane. I'm sure that will have to wait for if and when the non-hacky version eventuates. I can see that raising all sorts of interesting challenges, though, so I'm not holding my breath.
Okay, i took a closer look at the workbook you posted.
The reason you aren't seeing an '*' values is that 'Conversion rate (detail)' is being used as a dimension (All values). Since this conversion rate is different from every Day,Store combination, every point is already separate for each store. If you remove [STORE filter] and put ATTR(Store) on the label shelf, you'll see that every point on the blue line is actually 2 points, one for each store.
Not sure if this is always true or an artifact of the data. It definitely highlights a problem. the problem with filtering on LOOKUP(ATTR(),0) and relying on '*' is that if one of your points _happens_ to be unique for the dimension, it will get filtered. hmmm...this per pane filtering hack is even weaker than i thought.
More incentive for us to figure out the right way to do it!
Right, I should have spotted that.
I'm not quite sure whether the data would always be like this - Alexis just provided some sample data - but from my understanding of the question I think even if there were lots of duplicated conversion rates there would also inevitably be lots of cases where only one row had a given rate. The Quantile algorithm requires the distinct values to be returned as a dimension, so I really don't think this trick will work in this case. But I can see that there will be other circumstances where it will provide a workable approach.
You helped me with a speed issue using quartile metrics a couple of weeks back. In the example, 10K lines of data, you were able to help me reduce the update time drastically. I have since pulled 45K lines in and instead of selected a variable to choose, I made a measure - with the help of Joe Mako, that shows: cycle time, paid expense, paid loss and total claim outcome (TCO).
I was hoping to get you to look at this workbook and see if you have suggestion for speeding up the calculations again. Also, I can't get the paid expense calculation to work correctly.
I also attached the sample workbook you send for a past reference.
Any help you good give me would great.
You will need to attach a packaged workbook (*.twbx) rather than just a workbook (*.twb).
It sounds from your question as if this should probably be a new thread, rather than hijacking the quantile one - which is already getting very long. Or are the calculations you are talking about still the quantile calculation?
Either way - post a packaged workbook and I'll take a look. If you do think it's better as a new thread just post a link to it on here so that I get notified - I don't even try to keep up with all the traffic on the forums these days, so could easily miss it.
Attached is a workbook.
The questiion still is a quantile quesiton and I agree this string is getting long. I can't get a couple of the calculations to work in the paid loss sections. In addition, you first looked at the example with 5K and decreased the speed, you thought the speed would increase as the lines of data increased. It has not.
Thanks for your help on this.
Just to be clear (and I haven't looked at the latest workbook you sent yet) - what I was saying on the previous version I posted back was that I had applied an optimisation to the way that you were calling the WINDOW_XXX() functions, which sped it up at the size that you had at the time and would prevent it from getting MUCH, MUCH slower as your data got bigger.
But I was also suggesting that using the approach I use in this thread, which shifts as much as possible of the work to the back-end database, would make it faster still, and would scale much better.
I'll take a look at what you have now and see if that's still what I think. ;-)
How are you finding the speed at the moment? How long does it take to refresh the view with how many rows of data - and how does that compare to your ideal?
I'm a bit confused. I was expecting that you would have posted the packaged workbook version of "MCD GL 06-11 working.twb", but you've posted the unpackaged version of the workbook I had posted for you earlier on this thread (Slow Quantiles RL.twb) and an unpackaged version of a new file called "JLM example workbook.twb").
You need to post packaged workbooks so that they include the data, otherwise there is nothing anyone can do with them without your data source. I'm also not sure what the file "JLM example workbook.twb" is for.