2 of 2 people found this helpful
Here's some information on data sampling below. However to answer your specific question, based on another post by Tableau Product Manager Isaac Kunen; I'm getting that they are pulling in the actual rows being sampled and no extract is being created behind the scenes; the output step is then bringing in all rows at time of generation.
Roughly, we look at the schema for the table and try to make a guess at the number of bytes in a row. We then adjust the number of rows we pull in to hit a target size. So we'll pull in fewer rows for wider tables with larger data types (like strings), and more rows for narrow tables.
BTW, we make the determination *after* looking at the choices you make in the input step. So if you can cull the columns that you pull in, we'll pull in more data for the columns that remain. And we also sample *after* the filters you apply in the input step; so adding filters there will give you a sample of the rows you actually care about.
Hope this helps,
Data sampling defaults
Tableau Prep quickly determines if a sample is necessary (and the default number of rows to bring into the sample) based on the number and type of fields present in the data. When a step is added to the flow, you can see the indication that the data is sampled along with the number of rows included in the sample.
Input step for text file
Clean step showing sampled number of rows
In most cases, data over one million rows will likely be sampled; the default sample amount is based on the number of fields, and the data types of the fields, not number of records. Data sets with more fields will result in a sample with fewer records (rows) than data sets with fewer fields. This means if you have 300 fields, you'll get fewer rows in your sample than if you had 5 fields. Data type is also a factor. Fields with a string data type are usually larger than a numerical data type. Text-heavy data sets will therefore return a smaller number of rows when sampled than data sets that are predominantly numerical.
Although Tableau Prep has helpful defaults for sampling, you may find that you need to adjust the sample, for reasons like:
- You need a more representative sample (i.e. the default settings only pulled data from 2005 when the data set covers 2005-2018). This is common when you have data that is ordered by date, or if you are using a wildcard union.
- You want to generate an even smaller sample (you know the data well and want to streamline the prep experience as much as possible).
- You want to generate a larger sample or use all of the data (there may be too many irregularities to clean the data effectively with a small sample).
Using the data sample options
Once you’ve trimmed unnecessary fields and values from the data set, you may still want to change the amount of data in the sample or how the sample is generated.
These settings are available on the Data Sample tab in the Input step:
Amount of Data: This option determines how much data is brought into the flow.
Default Sample Amount: The amount of data included in the default sample configuration. This isn’t a fixed number of rows, rather how many records are returned depend on the characteristics of your data.
Fixed amount: Alternatively, you can specify a specific number of records to include in the sample, increasing or decreasing from the default.
Use all data: If you don’t want the data to be sampled, you can select this option to force Tableau Prep to retrieve all rows in your data.
This option determines how the records are chosen from the data source.
Quick select: By default, the database returns the number of rows requested as quickly as possible. This might be the first rows based on how the data is sorted, or the rows that the database had cached in memory from a previous query. While this is almost always a faster result than random sampling, it may return a biased sample (such as data for only one year rather than all years present in the data, if the records are sorted chronologically).
Random sample: The database looks at every row in the data set and randomly returns records until it reaches the number of rows requested, making the sample more representative. However, this will impact performance when the data is first retrieved because the entire data set must be scanned (rather than just the first N results like with Quick select). This can be useful if the quick select sample doesn’t contain the data that you need, are performing a wildcard union and want records from each file, or if joining two sampled tables returns few records.