Incremental Extracts in Tableau[1]

Version 2

    Purpose:

    This document contains scenarios to process incremental data loads in tableau. By default, tableau allows to load the full data set every time the report refreshes. This may at times be cumbersome and degrade the performance while processing reports. The document will capture 4 approaches depending on the structure and processing technique of the source data.

     

    Approaches for incremental data processing

    By default, every time a refresh runs all records are replaced with the data in underling tableau data source.

    The tableau reports are refreshed after the scheduled data extraction process for a report is completed. The data extraction process currently imports entire data set and replaces the previously extracted data set available on the server. Processing entire data extract in tableau can be an overhead in two scenarios:

    1. Large and continually increasing data sets
    2. Data sets are modified daily/hourly/or every minute.

    Incremental load ideally

    1. Captures changes to the existing data
    2. Adds new records to the existing data
    3. Updates the records for any modifications to the existing

     

    Note: Below approaches are highly dependent on the technique of data loading and processing

     

    Below are the approaches that can be programmed within tableau to perform Incremental refreshes.

    1. 1.    Full load with incremental refresh
    2. 2.    Incremental load for SCD
    3. 3.    Incremental load for UPSERT operations to underlying data model
    4. 4.    Append data from file

     

    Full load with incremental refresh

    By default, every time a Tableau refresh executes, all records in already extracted data set on tableau server are replaced with new data. An alternative to replacing the data extract is to add new content to the existing extract on the server. The extract can be purposed to add new records. Below are the steps to set incremental refresh http://onlinehelp.tableau.com/current/pro/desktop/en-us/help.htm#extracting_refresh.html

    1. Select a data source on the Data menu and then select Extract Data.
    2. In the Extract Data dialog box, select All rows as the number of Rows to extract. Incremental refresh can only be defined when you are extracting all rows in the database.
    3. Select Incremental refresh and then specify a column in the database that will be used to identify new rows. For example, if you select a Date field, refreshing will add all rows whose date is after that last time you refreshed.
    4. Alternatively, an ID column can be used, that increments, as rows are added to the database.
    5. When finished, click Extract.

     

     

    Incremental data load for Slowly changing dimensions

     

    Tableau data source is being refreshed daily for new data, every time the data source is refreshed its rebuild the entire data. The approach being discussed below, works seamlessly with the new data that is being added to the tableau extract particularly with Slowly changing dimensions.

    Below is a scenario for incremental loads for SCD type 2

    • Below is test data for “ORDERS” table consisting orders data. OrderID and MODIFICATION_WID is a unique identifier and FROM_DATE determines the active date for an order.

     

    ORDERID

    MODIFICATION_WID

    FROM_DATE

    PAYMENT

    103

    3003

    5/5/2017 12:00:00 AM

    2000

    103

    3004

    5/7/2017 12:00:00 AM

    20001

    104

    4001

    5/8/2017 12:00:00 AM

    3000

    104

    4002

    5/9/2017 12:00:00 AM

    3000

    100

    1001

    5/1/2017 12:00:00 AM

    1000

    100

    1002

    5/2/2017 12:00:00 AM

    1020

    101

    2001

    5/1/2017 12:00:00 AM

    2000

    101

    2002

    5/3/2017 12:00:00 AM

    2000

    101

    2003

    5/4/2017 12:00:00 AM

    2000

    • Incremental load in tableau should be able to capture the latest updated order information, if there are any modifications to the order data.
    • This can be accomplished by calculating the end date of the previous record for an order. the calculation is performed in tableau to determine the existing and new records.
    • Create calculated field “RecentOrderID” and add the below tableau code to the calculation.

              { FIXED [Orderid]:MAX([Modification Wid])}

    • Calculate the end date for an existing record if the record already exist in tableau extract.

     

    • Calculate the end date for an existing record if the record already exist in tableau extract. Create calculated field End_date and add the below code to the calculation

     

         IF ATTR([Order Wid]) = ATTR([RecentOrderID]) then

         NULL

         ELSE

         LOOKUP(ATTR([From Date]),1)

         END

     

    • Drag all the column that need to be displayed in tableau report.
    • Filter on the records where enddate of the order = null.
    • The tableau report will now have all valid / new and updated data.

    Incremental load for UPSERT operations to underlying data model

     

    ORDERWID

    PAYMENT

    LAST_UPDATE_DATE

    100

    500

    5/1/2017 12:00:00 AM

    102

    1500

    5/2/2017 12:00:00 AM

    103

    2500

    5/7/2017 12:00:00 AM

    104

    2500

    5/6/2017 12:00:00 AM

    105

    2000

    5/9/2017 12:00:00 AM

    106

    500

    5/10/2017 12:00:00 AM

    107

    1500

    5/12/2017 12:00:00 AM

    Fact_orders fact table

     

    ORDERWID

    ORDERID

    ORDERSTATUS

    100

    1001

    NEW

    102

    1002

    NEW

    103

    1003

    SHIPPED

    104

    1004

    SHIPPED

    105

    1005

    SHIPPED

    106

    1006

    NEW

    107

    1007

    SHIPPED

    D_Orders dimension table

     

    The above scenario describes data model where dimensional data is updated frequently without CDC (change data capture) in data warehouse. Therefore, the order status (in D_ORDERS) is modified without capturing the time when the modification was made.

    • Extract the data and set the extract to incrementally process on the last_update_date field from fact table.
    • Generate a “calculated date” calculated field in Tableau

    { FIXED [Orderwid]: MAX([Last Update Date])}

    • Generate a calculated field “DataValidationFlag”

              IF [Last Update Date] >= [calculated date] THEN

                   1

              ELSE

                   0

              END

    • Generate a report with required fields on the reporting pane and drag the two-new field “calculated date” and “DataValidationflag” in report.
    • Filter on the “DataValidationFlagField” to display records with filter condition “1”.
    • This solution will filter all the old records available in existing tableau extract.

     

     

    Append data from file

    To add content to the existing extract a manual option has been made available by tableau. This approach requires updates access to tableau desktop.

    Below are the steps to append data to the existing extracts

    1. Right click datasource --> extract --> Appen Data from file
    2. Select the file to append the data from the extract. The file could be csv, excel, twb or tde extension. Therefore, it is possible to append an extract to the existing extract.

     

    Cons:

    1. Appending data to an extract is a manual effort and needs to be accomplished via Tableau desktop.
    2. The merged data source need to be published to the server every time there is changes to the new data source.
    3. Refreshing the existing data source will replace the content of the merged data.