5 Replies Latest reply on May 2, 2011 8:32 AM by Jay Janarthanan

    In memory data source and load.

    Jay Janarthanan

      We need to load a text file that is CSV, about 4 M lines. This file we can keep it in memory when created . The question is how to load this into Tableau desktop.


      My options are write to disk (which takes time), then use MS Jet to export but i keep hearing about the 4gb limitation.


      I am willing to use a data base and keep the table in memory. For example MySQL Heap engine. Will moving data from that to Tableau be faster ?



        • 1. Re: In memory data source and load.
          Joe Mako

          When Tableau 6.1 is released, there may be another data loading option. See thread at The specified item was not found.


          Until then, if your data is large and in a flat file, then loading it into a database is a good option, something like Calpont/Infinidb or Infobright, and using an ETL application like Pentaho/Kettle is very useful for preparing data for Tableau. I am finding that unless there is a dedicated data warehouse, the Tableau Data Engine is fastest for Tableau queries when only using the local computer.


          I don't have a computer with enough memory to store all my data completely in memory, so I do not know if a MySQL Heap would be faster.

          • 2. Re: In memory data source and load.
            Richard Leeke

            Just expanding on Joe's comments a bit.  I've been experimenting with various options for loading large CSV files recently, so I'll summarise the pros and cons of the various approaches I've been trying.  I hadn't considered the MySQL MEMORY table option, but having had a quick read about it I'll certainly give that a try - it will be interesting to see how that performs for data that fits in memory.


            The options I've tried for large CSV files are:


            1) Break the file into <4GB chunks, create a custom SQL statement which joins all the chunks with UNION ALL and create a data extract from that.


            2) Create a MySQL CSV engine table to access the CSV file (basically you just create a CSV engine table and then replace the empty CSV file with an existing CSV file).  Then connect to that from Tableau desktop and create a data extract.


            3) Load the CSV file into Infobright (column store database) and connect directly to that with Tableau, leaving the data in Infobright.


            4) As 3) but then create a data extract from the Infobright database to try to get the advantages of the Tableau Data Engine back.


            5) Load the CSV file into InfiniDB (column store database) and connect directly to that with Tableau, leaving the data in InfinDB.



            Very brief pros and cons of each option, comparing performance on 100 million row example I've been using, on my high spec laptop, are as follows.  This was on an 8 core i7, 8 GB laptop running Windows 7 64 bit.  Your mileage WILL vary.



            1) Slow to load (would be many hours for 100 million rows, I gave up at 20 million), hard to track down data errors on loading (JET quietly ignores them, which can lead to misleading results), no ability to load data incrementally, analysis performance very good, especially when adding calculated dimensions as these can be added to the extract with the "optimize" option without needing to reload.


            2) Need to define schema by hand (unlike option 1 where Tableau/JET auto-detect format), quite sensitive to data format, no support for NULL values (i.e. you can't have ',,' in your CSV file), should reduce the load time when compared with JET because it skips the schema detection stage.  I tried this for the first time last night on that 100 million row file and twelve and a half hours later it claims still to be going, but it looks to me as if it has hung - the MySQL engine is using 100% of one CPU repeatedly retrying reading at end of file.  I might try a small one to see if the approach works.


            3) Need to define schema by hand, fast to load (26 minutes for 100 million rows), small resulting database (1.1 GB), incremental load available, load is very sensitive to data errors (aborts at first error, or sometimes with no error, unreliable reporting of error line), good resulting analysis performance (probably not as good as data engine - that's what I was trying to test with option 2), no ability to add 'materialised' calculated fields without reloading and an ETL process.


            4) Failed to load due to requirement for vast amounts of temporary storage to extract the data.  A 20 million row, < 1 GB  Infobright database would have required 350 GB of temporary space, which was more than I had available.


            5) Need to define schema by hand, fastest to load (15 minutes), much larger resulting database (18 GB), incremental load available, best approach to data quality (error lines reported to an exception file, load continues till error count reaches specified threshold), quite good analysis performance but a lot slower than Infobright for most analysis I tested,  no ability to add 'materialised' calculated fields without reloading and an ETL process.



            I'll have a go with the MySQL MEMORY engine when I get a chance and report back.


            As Joe mentioned, version 6.1 looks set to change the balance significantly here.

            • 3. Re: In memory data source and load.
              Richard Leeke

              I just tried option 2) (creating a Tableau Data Extract by reading a CSV file using the MySQL CSV engine) on the first 1 million rows of my big file.  That worked fine and completed in almost exactly 1 minute.  So the process works - not sure why it hung on the bigger file.


              I'll try the same just using the default Tableau method (the JET engine) to get a comparison on timing.

              • 4. Re: In memory data source and load.
                Richard Leeke

                Results of a few more experiments.


                I compared the time to load the first 10 million rows of my big sample file directly with Tableau (using JET) and using the MySQL CSV connector.


                JET: 21 minutes

                MySQL: 14 minutes


                Personally I don't think the speed difference there is enough to warrant the extra complexity and manual intervention needed to go the MySQL route for files which will load directly using JET (i.e. under the 4 GB file size limit).  It might be worth going that way for files larger than 4 GB - but beware of the issue I hit yesterday, where MySQL started looping at end of file on a much bigger file.


                From the hints I've picked up about 6.1, I think the story may change here, soon, anyway.



                I also had a bit of a play with the MySQL Memory engine.  Unfortunately the table structure for my test table has some large varchar fields, for which the typical length is much less than the maximum.  From what I've gathered from a quick look, the Memory engine seems to allocate maximum row sizes - so I found I couldn't get anywhere near as many rows in memory as the physical CSV file size would suggest.  I didn't experiment with varying column widths, though.


                I allocated 4 GB for the memory table and could only load somewhere between 1,000,000 and 2,000,000 rows.  1,000,000 rows was only a 150 MB CSV file.


                Loading 1,000,000 rows into memory (with no indexes) took between 5 and 6 seconds.  Analysis response times were very good, though I didn't do much with it.


                Load performance tailed off sharply as I tried to increase the size.  Allocating a maximum of 6 GB and loading 2,000,000 rows took 45 seconds - which is probably because I didn't have enough physical memory and started paging (I'm guessing, I didn't check).


                So this looks like a potentially viable way of loading a reasonable sized data set very quickly, as long as you have enough physical memory and not too many large varchar columns.  It's easy to script the load so that the memory table is automatically loaded when MySQL starts, and also to refresh the table simply by replacing the CSV file and recreating the memory table from it.


                Again, I think I'd only consider going this way for very specialised requirements, where being able to refresh the load of a few million rows in a matter of seconds rather than minutes was important.  Unless loading speed is really that critical, I think the data engine is still likely to be the best bet, because of tight integration, speed of analysis and ease of use.  Your 4 million row csv file is well within the comfort zone for the data engine. 


                But it was a very interesting question - I can definitely see times when the memory table option would be useful.

                • 5. Re: In memory data source and load.
                  Jay Janarthanan

                  Since my goal is to use Tableau Data Engine I am going to replicate the data into MySQL (using RabbitMQ)  and then do an extract. Ordered a motherboard with 32 GB of Ram (cant believe how cheap ram these days, around 400 bucks for 32 gigs) , once I get it by this weekend will do a test run and report back.


                  I have never tried InfiniDB but looks like instead of me replicating to MySQL I can send the data to InfiniDB. I will test that out today.


                  Another option is CSV in memory and load it from memory using RamDisk.