1 2 Previous Next 20 Replies Latest reply on Jun 5, 2013 1:47 PM by Jeffrey Shaffer

    Losing rows when creating an extract from a large CSV file

    Richard Leeke

      Losing quite a lot of rows, actually.

       

      I have a large CSV file (21 million rows, 12 GB), from which I created a data extract.  The resulting extract has 7 million rows.  I'm wondering what happened to the others.

       

      I know that it's not just a case of invalid data, because I created the workbook and an extract from about half of the data initially.  The data is about 1 million rows per day.  The initial extract covered a period of ten days in January and all days were there in the extract (though I don';t know for sure that all rows were there).

       

      Having spotted some interesting patterns, I concatenated in data for a selection of days last year and regenerated.  The resulting extract now only shows data for 3 of the original days in 2011 - and also doesn't have all rows for the dates in 2010.  I've checked that the raw data is still present in the input file with command-line tools (grep, word count, etc).

       

      I know it's not just that Jet is only returning the first 7 million rows, because I know the order in which I joined the original source data (2 files per day) and the missing dates are scattered through the file.

       

      I looked at the Tableau logs for the extract creation and it says it added 7 million rows and doesn't report any errors that I can see.

       

      I'm running Windows 7 64 bit with 8 GB memory and 500 GB disk, so I don't think there should be any resource constraints here.

       

      Has anyone had any joy with creating extracts from text files of that size?  I know Tableau have demonstrated with 100 million row extracts, but I suspect they may be from proper databases.  I'm just wondering if I need to import into a database first.  Any thoughts on whether SQL Server, Firebird or MySQL is likely to be the easiest/quickest if I go that route?

       

       

       

      I'll raise a support issue on this, but I fear it will be really difficult for support to answer until I can get the data to them - and I've promised the analysis for a meeting first thing in the morning.  ;-(

        • 1. Re: Losing rows when creating an extract from a large CSV file
          Joe Mako

          Yes, I have seen the same situation, and I blame it on MS Jet.

           

          I wonder if Tableau has any plans to create their own data loader?

           

          If you do not want to load the data into a proper database, find the first row that it not see, and move it to the end of the file, and attempt connection again, and repeat until all rows are seen by MS Jet.

           

          So in your CSV file, you have a header row, and when you connect Tableau live to the file, and drop "Number of Records" onto the Text shelf, you get a number like 7,123,456. Move the line 7,123,458 to the end of the file (accounting for the header row in the text file), and then Data->Refresh Tableau. If all rows are still not seen, repeat until are are seen, and then you can perform your extract.

           

          I cannot tell you why, but the technique I described worked for me. I claim it is MS Jet at fault because when I used another application the used MS Jet for data loading, it stopped at the same row, but when I used an application that has its own custom data loader, it loaded all rows with no issue.

           

          For me the biggest pain in loading data into a database is setting up the metadata for the table. I prefer data loader that looks at the entire data file before guessing at the metadata.

          • 2. Re: Losing rows when creating an extract from a large CSV file
            Richard Leeke

            Great, thanks Joe.

             

            My first suspicion was MS Jet.  And the second and third, come to that.

             

            The fact that another application using Jet stopped at the same row seems pretty darned conclusive evidence to me.

             

            If I could just paraphrase your workaround, you are saying that I should shuffle the rows to the end, one by one, with an hours worth of failed extract generation in between each time.  As it fails at 7 million out of 21 million, I rather fear that I'll die of old age before that approach bears fruit (the fact that it worked for you speaks volumes for your perseverance, though!).

             

            I agree 100% on the metadata pain with loading into a database.  I've had it on my to do list for ages to write myself a utility to scan a file and try to make a better job than Jet does of guessing the metadata - though it's actually trickier than it sounds.

             

            I would love to see a Tableau built-in loader, too - it may be tricky, but it has to be a walk in the park compared to the data engine.

            • 3. Re: Losing rows when creating an extract from a large CSV file
              Richard Leeke

              So having confirmed it's reproducible I'm now trying to move lines around in the file as you suggest.  Too big for any of my editors so a bit of perl needed...

              • 4. Re: Losing rows when creating an extract from a large CSV file
                Richard Leeke

                Hmmm.  So much for that idea.

                 

                Before shuffling the line to the bottom it loaded 7245185 rows.  So I moved the next row to the bottom and this time it only loaded 7234335.

                 

                But I think that's given me the clue as to what is happening.  I noticed that my perl script to try Joe's shuffle had increased the file size slightly (by replacing unix format new-lines with DOS CR/LF).  So with each row slightly longer it stopped slightly sooner.  Some rough sums say it stopped at about 4 GB - so it looks as if there must be a 4GB file-size limit on text files for Jet.  All made more confusing because I'd got some typos in my script for building the source file so the rows weren't in the order I thought - so I thought I'd seen it wasn't just taking the first part.

                 

                So looks like I need to go through the pain of loading into an intermediate database.  I might just try breaking it into chunks and using a UNION query first though - if it really is a physical file size issue that might just work...

                • 5. Re: Losing rows when creating an extract from a large CSV file
                  Joe Mako

                  If you want a data loader that just works, this software does just one thing, and it does it very well:

                  http://www.stattransfer.com/

                   

                  If you want to split a large text file in to chunks, with consideration for line breaks, check out this tool (it does require a bunch of options set):

                  http://www.gdgsoft.com/gsplit/

                   

                  Here is a link to the steps for using gsplit to split a large text file: http://www.freewaregenius.com/2009/07/30/how-to-split-a-very-large-text-or-csv-file-by-a-specific-number-of-lines-rows/

                  • 6. Re: Losing rows when creating an extract from a large CSV file
                    Robert Morton

                    Hi Richard, Joe,

                     

                    Yes, Jet does have a file size limit around 4 GB -- it came up in a prior forum discussion here:  http://www.tableausoftware.com/support/forum/topic/date-related-data-not-showing-results

                     

                    This is an issue that affects many folks, including those of us at Tableau who enjoy exploring large, public data sets which Jet cannot handle.  Keeping in line with our goal to empower the individual knowledge-worker, this is a pain point we would like to address.  Stay tuned.

                     

                    -Robert

                    • 7. Re: Losing rows when creating an extract from a large CSV file
                      Richard Leeke

                      Thanks Robert.  I surprised I haven't hit this one before (or noticed that post).

                       

                      The UNION ALL approach did allow me to get all the rows in (well actually I seem to have missed 1 row out of 21 million).  I split the data into 4 files (roughly 3 GB chunks) and used a custom SQL connection to get Jet to glue them together.  I left that running overnight so ought to have had the analysis ready for my meeting this morning.

                       

                      Unfortunately I forgot to add the 4 chunks into my schema.ini - so the data is trashed - not sure if I have enough time to regenerate.

                       

                      But that UNION ALL trick can be useful for avoiding the pain of an intermediate database load.

                       

                      Joe: Thanks for all those links - I'll check them out.

                      • 8. Re: Losing rows when creating an extract from a large CSV file
                        Robert Morton

                        Hi Richard,

                        With UNION ALL you may find that Jet gets confused about whether or not each file has a header row -- perhaps that's where you're losing a single row?

                        -Robert

                        • 9. Re: Losing rows when creating an extract from a large CSV file
                          Richard Leeke

                          Good thought, but I don't think that can be it.  I regenerated the files with each file having a header row and allowed for the headers when counting rows with "wc -l".  Jet must be treating the 4 files as 4 separate tables and recognising the headers - the UNION query explicitly names the columns in each of the sub-queries.

                           

                          One more thought on this.  I know that the 4 GB limit is down in Jet so there's nothing that Tableau can do to fix the problem (short of writing your own version) - but wouldn't it be nice if Tableau checked the filesizes of the files it's asking Jet to read and either warned or disallowed files that are past the 4 GB limit?  That sounds like about 20 lines of code to me - or is even the directory/file-chooser wrapped up in Jet?  I know that some of the file management is at the Jet level (as per the other post today about the schema.ini file getting filled up with unwanted files http://www.tableausoftware.com/support/forum/topic/hanging-when-connecting-database-text-file).

                          • 10. Re: Losing rows when creating an extract from a large CSV file
                            Robert Morton

                            Hi Richard,

                            Giving a warning is a good suggestion, I'll make note of it internally.

                            -Robert

                            • 11. Re: Losing rows when creating an extract from a large CSV file
                              Matt Morgenroth

                              Hello Richard,

                              Just saw your email come in to support confirming the findings on this thread.  I agree that a KB would be good, but if we better identified the connection in the product as inclusive of JET, then perhaps better information sharing would occur since there is a huge pile of threads in these forums on JET annoyances that should be readily known.  Again, the KB is a good idea, I will submit that as well.

                              • 12. Re: Losing rows when creating an extract from a large CSV file
                                Richard Leeke

                                Robert and Matt - thanks for picking up on those suggestions.

                                 

                                On the knowledgebase one, Matt, it's interesting - I actually did several searches across the forums and the knowledgebase of different combinations of JET and 4 GB and such like last night didn't find the threasd Robert pointed to this morning confirming the 4 GB limit.  I very vaguely remembered Robert having posted something on Jet limits a while back, but couldn't flush it out.  There's a wealth of information buried in the forum, but even when you know it's there it can be hard to pinpoint sometime - so pulling points like this into appropriate articles is really useful.  There does seem to be a bit of a forum harvesting process going on which feeds into ideas for the Knowledgebase - which is great.

                                 

                                One thing that has made it harder to find things in the forum is that with the forum changes the search no longer picks up the names of people who have posted on a thread.  I can often remember who said something useful, so putting "Morton" or "Blakemore" or whoever when trying to find an old post used to narrow it down a lot ("Mako" and "Baker" were never quite such effective filters though, which had nothing to do with quality of input!).

                                • 13. Re: Losing rows when creating an extract from a large CSV file
                                  Richard Leeke

                                  Just confirming that with the schema.ini in place I have now successfully loaded the 21 million rows (still 1 solitary row unaccounted for, but I don't care).  So the UNION ALL approach is definitely a viable workaround for the JET 4 GB limitation.

                                  • 14. Re: Losing rows when creating an extract from a large CSV file
                                    guest contributor

                                    I'm using the "UNION ALL" command for my text files, what worked fine so far. However, it seems that at a certain numbers of "UNION ALL" commands (50 commands), Tableau returns me a "query too complex" error. Is there anybody else, who also encountered this issue?

                                    1 2 Previous Next