8 Replies Latest reply on Jan 10, 2014 7:37 AM by Steve Goldsby

    Why are CSV extracts slower than Excel extracts

    Steve Goldsby

      I can't figure this one out.  I have a CSV file wtih about 920k rows and 25 columns.  Naturally I create a Tableau extract to work with this information.  After extraction, I change the data type of certain fields since Tableau guessed incorrectly, making most of them strings.  After fixing the date and number fields in the extract, I optimize the extract and get to work.  However, my viz's run painfully slowly, some taking over a minute to render.

       

      HOWEVER, if I import this same CSV into Excel, immediately save it as an xlsx file, and use that file to create my Tableau extract, my viz's render in a few seconds.Generally, there's no need to change any of the fields because Tableau uses Excel's definition for date, string, number fields as best I can tell.

       

      Since the schema hasn't changed between the CSV and Excel files, I would expect the resulting extracts to run identically from a performance perspective.

       

      Also of note, the csv extract (.tde file) is 165,105KB.   The Excel extract (the .tde file) is 110,062KB.    So apparently the extract engine did something differently.

        • 1. Re: Why are CSV extracts slower than Excel extracts
          Robert Morton

          Hi Steve,

           

          Tableau uses the Microsoft JET engine to connect to CSV and Excel files. It's actually Jet that guesses the field types incorrectly, not Tableau. You will find many references to Schema.ini settings on these forums as well as the broader internet that explain how to dictate the proper data types for each field in your CSV file.

           

          Within Tableau, try changing the data type of each field prior to extracting. Does that improve performance? If not, try creating a Schema.ini (and to be safe, quit and re-open Tableau to force Jet to notice the new Schema.ini file), and then use Tableau to connect and extract.

           

          I hope this helps,

          Robert

          1 of 1 people found this helpful
          • 2. Re: Why are CSV extracts slower than Excel extracts
            Phillip Burger

            Thumbs up to Robert Morton reply.

             

            There are many sources of information for schema.ini to be sure, but this one link contains everything you need to get your schema.ini together:

             

            Schema.ini File (Text File Driver)

             

            Even for simple and small data sets in plan text format, i.e., tab delimited and CSV, I consider using a schema.ini.  It's a little bit more work up front but eliminates data type problems always. It gives control to the human. For large text files, you have to be prepared to always use a schema.ini.

             

            Listed below is an example of a schema.ini file I use for text I import into Tableau. Everything in the schema.ini is important. If your data sill looks wonky in Tableau after using the schema.ini, I bet it's a problem in the data types or column ordering of the columns in the schema.ini.

             

            [tableau-input-with-ques-mapping-2.tab]

            ColNameHeader=True

            Format=TabDelimited

            MaxScanRows=0

            CharacterSet=OEM

            Col1="surveyId" Integer

            Col2="quesNumber" Integer

            ... (lines deleted here to make post shorter)

            Col23="facilityName" Char Width 20

            Col24="rpLastName" Char Width 10

            Col25="rpFirstName" Char Width 10

            • 3. Re: Why are CSV extracts slower than Excel extracts
              Steve Goldsby

              Gentlemen, Thank you very much for you help.  I learned something new today! 

               

              My understanding of the inner workings of the TDE was obviously faulty.  I assumed (and you know what they say about assuming) that once the data was in the tableau extract, if you fixed the field types in tableau then optimized the extract the schema and data types would be fixed.  A poor man's way to recast dimensions.  I was wrong.  Here's what I learned:

               

              1. Using Schema.ini drastically improves import performance.  Importing 920k records x 25 columns (365meg) on a low performance laptop:  using schema.ini the extract took 50 seconds vs. 15 minutes without
              2. using the CSV extract created without schema .ini:  one particular viz took 3 minutes to render
              3. using an excel version of the same data in an extract:  the viz took about 4 seconds to render
              4. Using a schema.ini driven extract of the csv file:  the viz rendered instantly. 

               

              My counsel to anyone consuming CSV files:

               

              1. Don't bother converting your CSV's to excel.  Performance is better using schema.ini for TDE extract and eventually you'll hit that 1 million row limit in excel.  Save yourself the trouble.
              2. Place the schema.ini in the same directory as your csv file (that's where the driver will look for it at extract time).  If you move csv files around remember to move your schema.ini
              3. Be careful using some of the schema.ini data types shown in Schema.ini File (Text File Driver).  I had issues using DateTime (my dates came back null in the extract).  Using "Date" fixed this with no loss of fidelity.

               

              Thanks again.

              1 of 1 people found this helpful
              • 4. Re: Why are CSV extracts slower than Excel extracts
                Robert Morton

                Hi Steve,

                I'm glad things worked out for you, and thanks for following up here to let us know.

                -Robert

                • 5. Re: Why are CSV extracts slower than Excel extracts
                  Steve Goldsby

                  I may have spoken too soon.  I was still pointing to a different extract after I imported the CSV file to a new extract.  Once I pointed my worksheet to the csv datasource, the date issues came back (NULLs)

                   

                  I'm having some issues with proper DateTimeFormat strings.  I keep getting JET ODBC abends because it doesn't like my DateTimeFormat strings I've tried.  My dates in the CSV are of the form "m/d/yyyy HH:MM", that is month is 1 or 2 digits; day is 1 or 2 digits; year is always 4 digits; 24 hour time format (e.g. 23:12)

                   

                  If I set all Date fields to "DateTime" and try the following formats I get the results shown:

                  1. DateTimeFormat=mm/dd/yyyy H:mm  Jet says "Datetimeformat string is invalid"
                  2. DateTimeFormat=mm/dd/yyyy hh:mm  Jet says "Datetimeformat string is invalid"
                  3. DateTimeFormat=m/d/yyyy h:m  Jet says "Datetimeformat string is invalid"
                  4. DateTimeFormat=mm/dd/yyyy   all dates imported as null
                  5. DateTimeFormat=m/d/yyyy   all dates imported as null
                  6. comment out DateTimeFormat=all dates imported as null

                   

                  And if I set the date fields to "Date" instead of DateTime, the data imports, but again, all NULLs.

                   

                  Any ideas?

                  • 6. Re: Why are CSV extracts slower than Excel extracts
                    Robert Morton

                    Have you tried with all caps, MM/DD/YYYY, etc.?

                    -Robert

                    • 7. Re: Re: Why are CSV extracts slower than Excel extracts
                      Steve Goldsby

                      Well, there's 6 hours of my life I'll never get back but I finally found the answer.  The correct format string is:

                       

                      DateTimeFormat="yyyy/mm/dd hh:nn:ss"

                       

                      Yes, "nn" for minutes, not "mm" as any sane human would think.  It's odd that most of the MSJet discussions are so old that google doesn't put them near the top of your searches, nor does Microsoft.  Given enough keywords though and you find a few jems like Those Little Things.

                       

                      I amend my previous performance claims:

                      1. Using Schema.ini drastically improves import performance.  Importing 920k records x 25 columns (365meg) on a low performance laptop:
                        1. using schema.ini the extract took 87 seconds (1.5 minutes)
                        2. without schema.ini the extract took seconds 64 seconds (1 minute) but all fields came in as strings, requiring recasting them.  The resulting viz's were painfully slow for reasons we've already discussed. See previous post.
                        3. extract created from XLSX of the same data took 585 seconds (10 minutes) and a few fields had to be tweaked.  Performance of resulting viz's remained good.

                       

                      Bottom line:  Use schema.ini.  I'm including my sample here for convenience.

                       

                      [IMO_DEC13fast.csv]

                      CharacterSet=65001

                      ColNameHeader=True

                      MaxScanRows=0

                      Format=CSVDelimited

                      ; date format is "2013/12/03 20:15:08"

                      ; DateTimeFormat="YYYY/MM/DD HH:MM:SS"  ERROR

                      DateTimeFormat="yyyy/mm/dd hh:nn:ss"

                      col1="Number of Incidents" Integer

                      Col2="Incident ID" Text

                      ....<redacted to save space>

                      Col6="Submit Date" DateTime

                      Col7="Priority" Text

                      Col8="Previous Month" DateTime

                      ...<redacted to save space>

                      Col25="Assignee" Text

                      1 of 1 people found this helpful
                      • 8. Re: Re: Why are CSV extracts slower than Excel extracts
                        Steve Goldsby

                        Great tip.  One note - in testing you don't have to restart tableau.  You can modify schema.ini on the fly and it will continue to re-read it each time you refresh the extract with Tableau 8.1 anyway)

                        1 of 1 people found this helpful