1 Reply Latest reply on May 14, 2017 5:04 PM by Peter Quackenbush

    Table calculation performance when calculating an average of a rank

    Peter Quackenbush

      Hello,

       

      I'm trying to construct a relatively complex table calculation that I can construct, but the performance is very poor.

       

      Here's the actual challenge (but to have a share-able analogue, I'll provide an example with superstore data.).

       

      I have a database with a bunch of auto insurance quotes. There's 1 record per quote_ID, vehicle_number, and quoting company. What I'd like is to first, get the SUM(Premium) for the data by quote_id and quoting company. Then, rank the quote_id's by the quoting company from cheapest to most expensive. Then, provide the average of the rank by things like postal code, number of drivers, or other dimensions. I'd like a plot of the average of the rank for each of the quoting companies, but I'd settle for just 1.

       

      Overall the database has around 250,000 records with around 20 columns and around 10 different quoting companies. Policies can have 1 2 or 3 vehicles on it. For example, a 2 vehicle policy that 5 companies provide a quote for will have 10 records.

       

      I've accomplished this using table calculations, but the performance is not ideal. It can take 1-2 minutes to process the table. That's far too long. I think the main problem is that all these table calculations result in a cartesian product of the table calculation's dimensions.

       

      I've built a very similar analogue to what I want using the superstore sample data. Essentially, for each order I want calculate the RANK_PERCENTILE for the SUM of Sales for every product's Sub-Category. In other words, group the data by Order ID and the product's Sub-Category, and find the SUM of Sales. Then, use a WINDOW_AVG to find the average rank by state for one one particular product sub-category.

       

      Phew.

       

      Anyways, attached is the worksheet that I've put together as an example. The Demo tab shows the whole aggregation that I'm doing, and the Demo2 tab shows a line graph. The values of that graph are what I want. You'll notice that there's a bunch of marks--though as I've said they are all basically the same because doing this calculation on a table calculation seems to create an unnecessary--and slow--table calculation.

       

      I'd appreciate if anyone has an idea on how to make this more performant.

       

      Thanks!

        • 1. Re: Table calculation performance when calculating an average of a rank
          Peter Quackenbush

          Update:

           

          So I played with this and found a way to make it much faster. Its now fast enough from a processing standpoint so I'm pretty happy now--but it raises a new technical question.

           

          I added a 3rd level view... This one called "ShowAverageOfRank" which is just a statement that says only show the value if its the first value in the partition, and partition it by the dimension on the chart.

           

          Now the question is that I had to change the chart to a bar chart. When I plot a bar chart, it is much, much faster rendering and changing the plot. I also noticed that in the warning on the bottom right warning about null values shows far fewer null values for a bar chart than a line chart. In the attached workbook, I made a tab Demo3 showing the final bar chart that I want--its sufficiently fast. It should show >9k null values. When I switch to a line plot, I get >448k null values.

           

          I don't get why I get so many null values just by changing the chart type from bar to to line. I think this is one of the big reasons why line charts are so slow in this use case.

           

          Thanks!