Skip navigation

Highlighting Potential Data Quality Issues When Using JET (Excel,Text,...)

score 34
You have not voted. Archived

From this post: http://community.tableausoftware.com/thread/110812

 

When we bring data in from Excel or Text files, we rely on JET to make appropriate assumptions about the data type. If JET gets it wrong, we stand to lose data. You can write a schema.ini to force assumptions, but the majority of users will have no idea about this, resulting in data quality issues.

 

An example:

  • Pull in 1000 rows with an Employee ID column
  • The first 900 rows are a numeric ID
  • The last 100 rows are alphanumeric
  • JET sees the first few rows, assumes the column is numeric
  • The last 100 rows now have NULL as the ID

 

To realize this has even happened requires work by the user. To fix it requires writing a schema.ini. I guarantee that there is a group of users that have data quality issues because of this, but have no idea data is missing.

 

When JET is used, Tableau should, at the very least, provide a data quality dialog box upon completion, highlighting exceptions - for example

 

Warning: 100 Nulls in Employee ID out of 1000 rows

 

Ideally, the next step would be for the user to be able to pop up a list of the imported columns, indicate what type they should be, and for Tableau to write the schema.ini and reimport.

 

 

Text of original post:

 

I have had plenty of issues with the Jet database engine - dates not being imported correctly most commonly. But I know and am aware of this issue and will commonly create a schema.ini. However, for less advanced users, Tableau's use of Jet is causing severe data quality issues. For example, I was working with a colleague who had pulled some data in from Excel. We were using one of the dimensions (location) when I noticed a much larger than usual NULL category (the data had some legitimate NULLS in it). It turns out the first 50,000 rows of the data had a numerical tag for the locations, and the last 20,000 were alphanumeric. Jet, having decided location was only numeric then replaced these latter locations with NULLs. I guarantee you (Tableau) that this is happening more often than you think and people are making decisions based on bad data. I cannot say this strongly enough; this is Tableau's worst failing by far. Suggestions for partially fixing this (in the absence of replacing Jet):

  1. If someone changes a data type (number to string), prompt the user that the Jet engine may have missed some data out. Automatically write a schema.ini out based on what Tableau now knows about the data, and offer to reimport
  2. Upon import, list the columns out and inform the user what Jet thinks they are, show the number of NULLs returned in each column. Allow the user to make changes, automatically create a schema....
  3. Dramatically increase the number of rows checked - I would also hope that the settings are set to Import Mixed Types to Text (rather than majority: http://office.microsoft.com/en-us/access-help/initializing-the-microsoft-excel-driver-HP001032159.aspx)
  4. After import, Tableau would look at any columns with a suspiciously high number of NULLs and do a fake import of just those columns this time checking many more rows (if not all), write a schema.ini out if there's a problem, and reimport.

I reiterate, this is not a problem for many of us on this forum who are used to the failings of JET. However, this is a problem for many. I would like members of the Tableau team to comment on this post please.

Comments

Vote history