4 Replies Latest reply on May 5, 2010 10:45 PM by Richard Leeke

    Improving performance by avoiding drawing multiple marks at the same location

    Richard Leeke

      I mentioned a technique I use to speed up the rendering of views with large numbers of marks in this thread: http://www.tableausoftware.com/forum/map-issue-lat/lon and promised to explain.  Here's a bit more detail about how that works - posted as a new thread to avoid hijacking the original any further.

       

      The basic principle is really simple, though the detail can be quite tricky.

       

      If you have a viz that has very large numbers of marks (by which I mean hundreds of thousands or millions), it is very likely that lots of those marks are going to be drawn on top of each other.  So if you can avoid drawing marks that won't be visible anyway, you can speed things up.

       

      You do this by redefining the view such that the database returns less marks at each position.  The attached images show the original and optimised versions of one of my views.  The original has 1.5 million marks and takes over 2 minutes to refresh; the optimised version has 50,000 marks and takes around 7 seconds.  Comparing the images with an image comparison tool there are lots of differences - but by eye they're good enough for me.

       

      There are some limitations and restrictions with doing this - most notably that zooming the display doesn't work very  well - and the workarounds I've managed to come up with for that are a bit clunky.  But if rendering speed is the main constraint on your analysis, the limitations may be worth it.

       

       

      So how do you decide which marks are going to be "on top of one another", which ones can you safely suppress without affecting the viz, and how do you do it?  For now I'll just outline the answers to each of these - I can fill in more detail if anyone is still interested (Michael!).

       

      Which marks are on top of one another?

       

      If the x and y coordinates for 2 marks are within the bounds of the same pixel, I treat them as being at the same place and so as candidates for suppression.  Now of course I don't know the inner workings of the Tableau rendering engine, so I may be truncating when Tableau is rounding, or whatever, but this approach seems to give a good enough result.

       

      Which marks can you safely suppress?

       

      Just because 2 marks are at the same pixel location doesn't mean that they are not both visible.  A big mark can peek out from underneath a smaller mark at the same position.  A solid circle is visible through the hole in the middle of a hollow circle.  etc.

       

      So in general, it's only safe to suppress multiple marks of the same size and shape at the same position.

       

      Even this approach doesn't properly account for transparency, I don't think.  I haven't really looked at this in any detail, but I'm pretty sure that multiple identical semi-transparent marks at the same place give a darker result - I think that's the whole point.  So if you're using transparency like that you're probably out of luck with this whole approach.

       

      Depending on the structure of the view, just returning a single mark of any given shape/size/colour at any one pixel location may be enough to make a big difference to speed, or may make very little difference.  If there are only a few distinct permutations of shape/size/colour it may be enough just to return one of each and leave it to the rendering engine to sort out what's visible.  But with lots of mark types, that may not make much difference.

       

      For example, the attached images have about 150 different values on the colour dimension, so just reducing down to one of each at any position only reduced it from 1.5 million to 500,000 marks.  So in that case I chose to use just a single size and shape, and take control for myself of the order of display of the different colour series, by putting the colour display order in a table in the database and just returning the top mark of any type at each location.  That brought it down from 500,000 to 50,000 marks.

       

      How do you do it?

       

      Basically you need to truncate or round the x and y coordinate values to the nearest pixel.  In some cases even just doing one dimension may be enough.

       

      The steps to do this are are as follows.

       

      1) Measure/estimate/guess the pixel dimensions of the view.  The most accurate way I've found of doing this is to take a screen print of the viz, copy it in to Paintbrush, crop it to the dimensions of the drawing surface and then look at the image properties in Paintbrush.  But actually you don't need to be very precise - just knowing your display resolution and guessing the proportions of the drawing surface to the nearest hundred pixels works just fine.  I measured 1250 x 820 for the attached images, but 1200 x 800 looks just about as good.  Really it would be nice if Tableau exposed the dimensions in some way - but I don't think that's as easy as it sounds.

       

      2) Round or truncate the continuous x and y dimensions to the nearest pixel.

       

      There are various ways of doing this, depending on your data source.  I sometimes do it with calculated fields, sometimes RAWSQL, sometime in a custom SQL connection and I've also done it in a stored procedure - though there are some gotchas with that approach.

       

      Here's the formula for a calculated field for a continuous floating point dimension on the x-axis:

       

      [xPixels] = width of the view in pixels

      [xMin] = minimum value of the x-dimension (columns shelf)

      [xMax] = maximum value of the x-dimension (columns shelf)

      [PixelRatio] = number of x units per pixel

      [xRaw] = original x-dimension

      [xTrunc] = truncated x-dimension

       

      [PixelRatio] = ([xMax]-[xMin]) / [xPixels]

       

      [xTrunc] = INT( [xRaw] / [PixelRatio] ) * [PixelRatio]

       

      Strictly this should use floor() rather than int(), since int() truncates towards zero, but Tableau doesn't have a floor() function.  You can do it by taking account of +ve and -ve values of [xRaw] in the calculated field, but I don't think it makes much difference - just a 1 pixel wide distortion at x = 0.

       

      Depending on the viz, there may be simpler approximations.  I quite often have views with a continuous datetime axis with millisecond accuracy, showing data for a 24 hour period.  There are (24 * 60) = 1440 minutes in 24 hours, which is more than the pixel width of the view on my display.  So replacing my original millisecond resolution datetime with a continuous calculated field defined as DATETRUNC('minute', [datetime]) is all that's needed to speed up a scatter chart by an order of magnitude.  I often don't even bother to do anything with the y-axis.

       

      Depending on the mark size, it can even work fine to normalise it down to a much coarser-grained grid - 10 x 10 pixels say.  The bigger the marks the less point there is in retaining so many marks.

       

      3) Use the truncated x and y dimensions on the columns and rows shelves, instead of the original dimensions.

       

      That's all that's needed to return one mark of each type at each location.  Returning just the "top" mark type needs a bit of tricky SQL, which I'll leave as an exercise for the reader.

       

      And that's all there is to it.  ;-)

       

      Actually there's some more complexity around determining your minimum and maximum dimensions, and as I mentioned, zooming the view gets very clunky (and really needs a few of the extra Tableau features I've posted about on the "Wouldn't it be nice if" forum over the years).  But I think that should be enough to get you going or more likely put you off.

       

      This originally came up in the context of a posting about maps, and I get the impression that was also what your follow-up question was about Michael.  Of course there's an extra complication with mapping which is that the map projection means longitude doesn't map uniformly to pixels.  The nearer you get to the poles, the more pixels per degree of longitude - I think.  I haven't really thought very hard about this, and I certainly haven't tried it, but I think just using a value for the pixel ratio worked out at about the Arctic Circle, or maybe a bit further South, will probably work just fine - it will just mean that there's a bit more overlap than necessary down near the Equator.  I doubt that you have to go to the trouble of calculating the ratio accurately as a function of longitude.

       

       

      Good luck Michael.  You did ask!

        • 1. Re: Improving performance by avoiding drawing multiple marks at the same location
          James Baker

          Impressive.

           

          "Returning just the "top" mark type needs a bit of tricky SQL, which I'll leave as an exercise for the reader."  -- Aw, this was the part I was most interested in. :)

          • 2. Re: Improving performance by avoiding drawing multiple marks at the same location
            Richard Leeke

            What I really meant was I can't remember how I did that.    ;-)

             

            And anyway that post took way longer than I'd intended - but just for you, James, I'll take a look to remind myself...

            • 3. Re: Improving performance by avoiding drawing multiple marks at the same location
              Richard Leeke

              I reminded myself about "Returning just the "top" mark type".

               

              First thing to say is that this really is a compromise between speed and functionality.  Essentially this approach means that you can't use shape, size or transparency, and you even lose a lot of the smarts of the colour shelf.  Highlighting no longer lets you see what's hiding underneath the top layer of marks (because there isn't anything).  But with very large numbers of marks you probably wouldn't be getting much value out of shape and size anyway - so it's not as big a loss as it sounds.  I generally use this approach to take a quick look for interesting patterns before zooming in much closer, at which point I can afford to turn back on all the good shape, size and colour stuff.  Sort of like quick aerial surveillance from 30,000 feet through a bit of cloud, before flying past at tree-top hight with full visibility.

               

               

              There are actually a couple of ways to limit the view to the "top" mark.  If you don't care which one goes on top, the quick and easy way is just to use a MIN() or MAX() of your colour dimension.  No tricky SQL at all for that approach.  I was using this today for a scatter chart from a data set with 4 million points (in a fast data source) and the viz was refreshing in under 5 seconds.

               

              You can even arrange to sort your colour dimension as you want it with the MIN() or MAX() approach by prefixing all the values with a sort order: "01_Electrical", "02_Furniture", etc.

               

              The other way I've done it in the past to get full control over the sort order is to have a sort order column on the colour dimension table in the database.  That needs the slightly tricky SQL I was referring to.  That way allows the sort order to be changed on the fly.  We had taken this approach to sorting marks in an early prototype of a visualisation tool we were thinking of developing before we discovered Tableau (we haven't touched the prototype since).  But I have hooked Tableau up to that database and controlled the sort order from the prototype while doing the visualisations with Tableau.

               

              For the record, the SQL looked something like this:

               

               

              
               SELECT
                  norm.dimX AS dimXNorm,
                  norm.dimY AS dimYNorm,
                  c2.colourName AS colourName
              FROM (
                  SELECT
                      FLOOR(X / XPixelRatio) * XPixelRatio AS dimX,
                      FLOOR(Y / YPixelRatio) * YPixelRatio AS dimY,
                      MIN(c.colourOrder) AS minColourOrder
                  FROM data d
                      INNER JOIN colour c
                      ON c.colourId = d.colourId
                  GROUP BY
                      FLOOR(X / XPixelRatio) * XPixelRatio AS dimX,
                      FLOOR(Y / YPixelRatio) * YPixelRatio AS dimY
                  ) norm
              INNER JOIN colour c2
                  ON c2.colourOrder = norm.minColourOrder
              
              


              • 4. Re: Improving performance by avoiding drawing multiple marks at the same location
                Richard Leeke

                Just applied a very light weight version of this trick (i.e. one simple calculated field) to a workbook I only use occasionally so hadn't previously worried about trying to speed up.

                 

                It went from 10-15 minutes to 10-15 seconds for a refresh.  One of the reasons for the extent of the saving was that it was fetching 250,000 rows from a remote Oracle database over a slow network.  The modifed version only brings back 2,000 rows.  The difference in the viz is not even noticeable.

                 

                It was when I tried printing a PDF of the original viz and found it was 40 MB (which was the original issue that prompted me to post about this trick) that I realised what was happening.

                 

                I might have to look back through my other postings and see what other bits of my own advice I've been ignoring!