Yes, I have seen the same situation, and I blame it on MS Jet.
I wonder if Tableau has any plans to create their own data loader?
If you do not want to load the data into a proper database, find the first row that it not see, and move it to the end of the file, and attempt connection again, and repeat until all rows are seen by MS Jet.
So in your CSV file, you have a header row, and when you connect Tableau live to the file, and drop "Number of Records" onto the Text shelf, you get a number like 7,123,456. Move the line 7,123,458 to the end of the file (accounting for the header row in the text file), and then Data->Refresh Tableau. If all rows are still not seen, repeat until are are seen, and then you can perform your extract.
I cannot tell you why, but the technique I described worked for me. I claim it is MS Jet at fault because when I used another application the used MS Jet for data loading, it stopped at the same row, but when I used an application that has its own custom data loader, it loaded all rows with no issue.
For me the biggest pain in loading data into a database is setting up the metadata for the table. I prefer data loader that looks at the entire data file before guessing at the metadata.
Great, thanks Joe.
My first suspicion was MS Jet. And the second and third, come to that.
The fact that another application using Jet stopped at the same row seems pretty darned conclusive evidence to me.
If I could just paraphrase your workaround, you are saying that I should shuffle the rows to the end, one by one, with an hours worth of failed extract generation in between each time. As it fails at 7 million out of 21 million, I rather fear that I'll die of old age before that approach bears fruit (the fact that it worked for you speaks volumes for your perseverance, though!).
I agree 100% on the metadata pain with loading into a database. I've had it on my to do list for ages to write myself a utility to scan a file and try to make a better job than Jet does of guessing the metadata - though it's actually trickier than it sounds.
I would love to see a Tableau built-in loader, too - it may be tricky, but it has to be a walk in the park compared to the data engine.
So having confirmed it's reproducible I'm now trying to move lines around in the file as you suggest. Too big for any of my editors so a bit of perl needed...
Hmmm. So much for that idea.
Before shuffling the line to the bottom it loaded 7245185 rows. So I moved the next row to the bottom and this time it only loaded 7234335.
But I think that's given me the clue as to what is happening. I noticed that my perl script to try Joe's shuffle had increased the file size slightly (by replacing unix format new-lines with DOS CR/LF). So with each row slightly longer it stopped slightly sooner. Some rough sums say it stopped at about 4 GB - so it looks as if there must be a 4GB file-size limit on text files for Jet. All made more confusing because I'd got some typos in my script for building the source file so the rows weren't in the order I thought - so I thought I'd seen it wasn't just taking the first part.
So looks like I need to go through the pain of loading into an intermediate database. I might just try breaking it into chunks and using a UNION query first though - if it really is a physical file size issue that might just work...
If you want a data loader that just works, this software does just one thing, and it does it very well:
If you want to split a large text file in to chunks, with consideration for line breaks, check out this tool (it does require a bunch of options set):
Here is a link to the steps for using gsplit to split a large text file: http://www.freewaregenius.com/2009/07/30/how-to-split-a-very-large-text-or-csv-file-by-a-specific-number-of-lines-rows/
Hi Richard, Joe,
Yes, Jet does have a file size limit around 4 GB -- it came up in a prior forum discussion here: http://www.tableausoftware.com/support/forum/topic/date-related-data-not-showing-results
This is an issue that affects many folks, including those of us at Tableau who enjoy exploring large, public data sets which Jet cannot handle. Keeping in line with our goal to empower the individual knowledge-worker, this is a pain point we would like to address. Stay tuned.
Thanks Robert. I surprised I haven't hit this one before (or noticed that post).
The UNION ALL approach did allow me to get all the rows in (well actually I seem to have missed 1 row out of 21 million). I split the data into 4 files (roughly 3 GB chunks) and used a custom SQL connection to get Jet to glue them together. I left that running overnight so ought to have had the analysis ready for my meeting this morning.
Unfortunately I forgot to add the 4 chunks into my schema.ini - so the data is trashed - not sure if I have enough time to regenerate.
But that UNION ALL trick can be useful for avoiding the pain of an intermediate database load.
Joe: Thanks for all those links - I'll check them out.
With UNION ALL you may find that Jet gets confused about whether or not each file has a header row -- perhaps that's where you're losing a single row?
Good thought, but I don't think that can be it. I regenerated the files with each file having a header row and allowed for the headers when counting rows with "wc -l". Jet must be treating the 4 files as 4 separate tables and recognising the headers - the UNION query explicitly names the columns in each of the sub-queries.
One more thought on this. I know that the 4 GB limit is down in Jet so there's nothing that Tableau can do to fix the problem (short of writing your own version) - but wouldn't it be nice if Tableau checked the filesizes of the files it's asking Jet to read and either warned or disallowed files that are past the 4 GB limit? That sounds like about 20 lines of code to me - or is even the directory/file-chooser wrapped up in Jet? I know that some of the file management is at the Jet level (as per the other post today about the schema.ini file getting filled up with unwanted files http://www.tableausoftware.com/support/forum/topic/hanging-when-connecting-database-text-file).
Giving a warning is a good suggestion, I'll make note of it internally.
Just saw your email come in to support confirming the findings on this thread. I agree that a KB would be good, but if we better identified the connection in the product as inclusive of JET, then perhaps better information sharing would occur since there is a huge pile of threads in these forums on JET annoyances that should be readily known. Again, the KB is a good idea, I will submit that as well.
Robert and Matt - thanks for picking up on those suggestions.
On the knowledgebase one, Matt, it's interesting - I actually did several searches across the forums and the knowledgebase of different combinations of JET and 4 GB and such like last night didn't find the threasd Robert pointed to this morning confirming the 4 GB limit. I very vaguely remembered Robert having posted something on Jet limits a while back, but couldn't flush it out. There's a wealth of information buried in the forum, but even when you know it's there it can be hard to pinpoint sometime - so pulling points like this into appropriate articles is really useful. There does seem to be a bit of a forum harvesting process going on which feeds into ideas for the Knowledgebase - which is great.
One thing that has made it harder to find things in the forum is that with the forum changes the search no longer picks up the names of people who have posted on a thread. I can often remember who said something useful, so putting "Morton" or "Blakemore" or whoever when trying to find an old post used to narrow it down a lot ("Mako" and "Baker" were never quite such effective filters though, which had nothing to do with quality of input!).
Just confirming that with the schema.ini in place I have now successfully loaded the 21 million rows (still 1 solitary row unaccounted for, but I don't care). So the UNION ALL approach is definitely a viable workaround for the JET 4 GB limitation.
I'm using the "UNION ALL" command for my text files, what worked fine so far. However, it seems that at a certain numbers of "UNION ALL" commands (50 commands), Tableau returns me a "query too complex" error. Is there anybody else, who also encountered this issue?