4 Replies Latest reply on Jan 3, 2012 5:46 PM by Robert Morton

    Wildcard limitation?

      Hello!

        We've been using Tableau with great success with some awesome workbooks the last 1.5 years, but I just ran into something I can't seem to get around.  Wildcard filters seem to work for contains queries, but is there a way use true wildcarding / regex for a filter?

       

      IE: I have a parameter that I use for a bool check on a filter to determine which data to display.

       

      For the simplicity of explanation, suppose this parameter is looking at a country name field (it isn't, but this exposes the challenge) and there are many different variants of the name Morocco.  In the dataset, suppose we have:

       

      Maracco

      Morocco

      Moracco

      Marocco

      Moracca

       

      I need to be able to search on M*R*CC* and have it filter to show all the variants above, but nothing else...  Our situation is actually far more magnified than this simple country example (we actually have on the order of 100+ variants  per search) and the running entity resolution beforehand won't work either (what we have aren't mispellings and are intended to be this way... instead think an encoded GUID where each digit position in the guid is meaningful) as we actually want to pull up all the original representations (as they each mean something different)... and we can't just break out the digit positions into their own dimensions - we can't do that either as the digit positions vary :) (*sigh*)...  Fortunately, the encoding is such that we can use vowels and non-vowels much like the Morocco search above, so if we can get that to work, we're in good shape.  If we have to, I guess we can pre-process everything beforehand and make whole new column mapping all variants to all possible searches, but that would be bordering on insane levels of denormalization that we would rather not explore if we can just get the wildcards above to work.

       

      If there is a way to do this, how would I then wire that up through the condition filter I have set up for that parameter?  IE: right now, the best I can do is Contains(UPPER([Field1]), UPPER([Parameter1]))... is there an equivalent to Wildcard(UPPER([Field1]),UPPER([Parameter1]), '*') or something?

       

      Also, a completely separate question to post... any word on a 64 bit thick client version being released?  We have some extremely large workbooks with ~100GB of data that work great, but always run out of RAM on the 4GB 32 bit limit (and we've got well over 100GB RAM)...  I know we have to be careful not to draw too many points on the screen for visualization (else the visualization becomes a mess and is worthless), but we've pretty well solved that display problem through careful filtering, pre-computing all data before Tableau, and other techniques to aim the gun away from foot... now we just need the "running low on memory crashes" to go away - so I was just curious if it was in the works :)

       

      Tableau rocks - thanks!

      -Jonathan

        • 1. Re: Wildcard limitation?

          Just an update, I had an epiphany of one crazy way to get around this temporarily right after posting - but hopefully there is a more elegant solution.

           

          1. Create a new calculated field that replaces all vowels with '*' -> Optimize extract

          2. Change my filter to do a contains on the original field and the new calculated field... make it an "OR" criteria

           

          Of course, this only works if the user wants to search the "ALL vowel combinations" in the Morocco example above.  Any better ideas?

           

          Thanks,

          Jonathan

          • 2. Re: Wildcard limitation?
            Robert Morton

            Hi Jonathan,

             

            The Data Engine does not yet support regex for Tableau extracts, but if you are connected to a live database you may find that it will support your use case. Many databases provide a 'LIKE' predicate which you can use with the SQL wildcard '_' for single-letter match, or '%' for multi-letter match. To take advantage of this feature you will need to create a RAWSQL calculated field to pass the SQL syntax directly to your database. When you subsequently create an extract, the calculated field should get materialized (you may need Tableau 6.1.4 or newer for this).

             

            Regarding 64-bit support, we already provide a 64-bit Data Engine but we have not made any public commitment to a timeframe for broader support in Tableau. For Desktop applications the memory constraint will be most apparent when visualizing a very large number of points or when performing calculations at a very fine level of detail, for example when using fine-grained partitions in Table Calculations. Sometimes however the database driver is the cause of substantial memory consumption (we've seen this with drivers based on the PostgreSQL pgodbc, e.g. Greenplum), since it may cache data too eagerly -- we've tried to address these problems when they arise, but Tableau 7.0 will make additional improvements by moving Tableau's connection layer for database drivers into a separate process (still 32-bit though).

             

            I'm curious to learn more about your memory consumption. Do you have a large number of worksheets or data sources in your workbook? Do you have many complex table calculations, or visualizations with substantial number of marks (this easy to do with visualizations involving continuous date/time fields, the page shelf, or detailed maps)? What kinds of database connections do you use? Have you verified that you have the proper drivers and latest version of Tableau? Last, can you pinpoint your memory consumption down to a single viz or two, and would you be willing to share that viz and the data with our support team?

             

            Thanks,

            Robert

            • 3. Re: Wildcard limitation?

              Hi Robert!

                Thanks for the quick reply - the pass through RAWSQL function is a great idea I hadn’t thought of, but unfortunately that won't work for us as we have to generate Tableau extracts for performance reasons.  We have been using Hive as our backend for the last couple years (which isn't exactly meant for quick queries, but is awesome for large data), but we have tools that export the Hive data into Tablea-happy CSVs.  I did see the engine was 64 bit (tdeserver64.exe gave it away :), so we are always hopeful for a 64 bit client (and some programmatic interface api to auto-magically drive the creation of .tde’s / .twbx files).  To get around this, we've basically crafted our workbooks very carefully to look at small swatches of our data at any point in time through global filters.  If we mess up at all, then it is out of memory, so we just save often.  (Of course, if we aren't careful - plotting everything is quite useless as it produces a giant mess).  We always pull everything into a Tableau extract and that we always optimize our extract after the creation of every calculated field (which presumably serializes an index into memory and disk for quick retrieval… this changes the render time from about 4 hours+ to about 5 seconds). 

               

              As such, to summarize, we've always just been using a .csv import into Tableau’s extract engine and then optimized it as much as possible so there are no dynamic calculations when it comes to Tableau’s render time (IE: we do complicated “calculated fields” in Hive and do simpler on the fly calculations in Tableau).  Of course, we were very happy to see that Tableau now connects to Hive – that was certainly a pleasant surprise 

              As for our usage and memory consumption – I’m convinced it is entirely just visualization with substantial number of marks (or items on page shelf, which we try to keep to a minimum).  When we do this, everything has been optimized and no aggregate fields are used so Tableau only does simply data lookups at this point.  We abuse Tableau and go a bit crazy in that we often use it to produce rasterized images pivoted on a page dimension (IE: time) - IE: source data has x / y dimensions that we choose… and then we pivot that on a third dimension (IE: time) for say a billion points... we load it all into Tableau, but then filter to look at plotting out a small region for analysis / tuning / and the build out of calculated fields.  At this point, we typically only show about a couple million points by constraining our x / y window with a small dimensional range filter and also constrain the page filter accordingly also holding about 12 steps in memory as well (so - around 24 million points loaded... 2 million rasterized at a given moment).  If the x / y dimensions are dense and distributed enough, we get a beautiful rasterized image by taking the size setting down to the smallest pixel size and binding our calculated fields to color and flipping through the pages.  Tableau then rocks as we have high dimensional data and can then start experimenting with different combinations of dimensions through calculated fields (that are optimized after creation) to see what the effect on the raster is (rather than building a whole rasterization tool ourselves).  Thus, we can test out hypothesis and experiment very easily by using Tableau to explore transformations on the data and binding those to the visualization to great effect.  Thus, these rasters keep changing until we are satisfied with a particular set of dimensional configuration… our take-away then is not just the actual workbook, but instead the finalized logic behind the calculated fields.  These settings are then exported to our in-house raster tools that are then configured to the values we discovered inside Tableau.  As such, all of this leads me to believe that it is simply the number of marks that is causing the out of memory errors that we get.  We generally aren’t using table calculations at all (we pre-compute these in Hive if needed), aren’t using continuous date / time fields (in the views that run out of memory that is), and keep our page shelf to a reasonably small set.  The database drivers shouldn’t be an issue since we just make the .csv into a Tableau extract right off the bat and never have live connectivity to Hive.  As far as the memory consumption – in general, it applies to many of our visualizations, but always ones where we are using the rasterization trick or when we accidentally load too much data in through the filter (or into aggregate field that can’t be optimized due to their dynamic nature as they depend on the selected dimensions).

               

              I can probably get you a workbook exhibiting the problem if it helps… to give you an idea, let me make up an example that describes the technical problem (the dimensional storyline is the same so you can get the picture of the technical challenge).  I’m entirely making this scenario up, so bear with my creative storytelling :) Imagine a situation where you have weather data at some fine resolution (let’s say the nearest 1 foot)… Next, you also have minute by minute weather readings for the past century for each meter as well (for an easy source of page filter data [time] that has ties to an obvious x / y dimensional data binding).  Suppose we are scientist trying to research and display how the different weather dimensions (pressure / temperature / wind speed / etc) all play out over a given area.  We may make a calculated field that combines pressure / temperature in a certain manner and then plot that out.  As you can imagine, trying to rasterize an area of any size (IE: a 100 sq miles) would be intractable and result in out of memory (we would also have to limit the page filter to a very small timeframe or else aggregate the readings in hive to reduce the number of pages).  As such, you just test out the combinations on a small area to see what effects you achieve with your hypothesis calculated fields (say the area the size of a football field).  For the sake of argument, say this works fine in the RAM constraints of a 32 bit system.  Basically, expanding to a 64 bit system will substantially increase our rendering time, but it will give us a little more leeway about the size of the box.  IE: we would love it if we could give Tableau a couple hundred GB of RAM to play with for this exploratory process.

               

              Whew – sorry for the long explanation, but I wanted to try and give you all the information you might need to help justify a 64 bit client version   If you have a support team member’s contact information I can work with – I’ll gladly work this further with them!

               

              Thanks,

              Jonathan

              • 4. Re: Wildcard limitation?
                Robert Morton

                Hi Jonathan,

                 

                This is very interesting, thanks for the details! Are you using the Hive connector in 6.1.4+ yet, or are you just waiting for the 7.0 release? With the Hive connector you have REGEXP_EXTRACT and REGEXP_REPLACE, which may address your original question. Just as we do for RAWSQL calculations, Tableau will materialize the results of calculated fields which the Data Engine cannot compute natively. This will allow you to create extracts directly from Hive instead of routing through CSV files.

                 

                As far as memory usage, another factor to consider is that Tableau caches query results. If you manipulate the visualization (e.g. by changing the color encoding) this may trigger a new query, so the query cache will eventually become quite large while you are exploring your data. However when you hit 'F5', Tableau will flush the query cache and reconnect to the database (or Data Engine, in your case). Here's the workflow I suggest when exploring your data: load the workbook and switch to the worksheet of interest; turn off auto-query; change the viz (e.g. drag a new field to color) and hit F5 to flush the cache and trigger the new query.

                 

                I'll contact you via your forum email to discuss using your workbooks to help us internally evaluate performance.

                 

                -Robert