Connect Tableau Extracts to the ETL process of the Data Warehouse

Version 1

    Why?

    As you might know, this is the hot topic of integrating ETL and Tableau in order make them work together. By default, we able to schedule Refresh of extract, subscriptions using default schedule. For example, we schedule ours have data extract to run at 6 AM in the morning because our DBA told to us that Data Warehouse (DW) is finished usually at 5.30 AM. We connected  our extract with the Refresh Schedule.

     

    This is the perfect world, where DW is working like a clock. Unfortunately, reality is different. Our DW could finish at 8AM, 10AM and etc. It could finish at 5.30 AM but with lots of errors and confused data. As a result, business users will waste their time without data or they even can build reports and send to the managers or maybe managers will get their report by subscriptions. To make the long story short, this is completely failed for the BI/DW team and this happened a lot.

     

    I saw lot's of discussions how to do it, but I didn't saw detail step by step information. Let's fill this gap.

     

    What?

    First of all, let's define what we want to get - we want to build simple mechanism using open source technology in order to refresh our extracts of any other operations, such as publish packaged workbook or pdf reports to the shared desired folder.

     

    What do we need?

    It will depend on the way what tool we use (GUI, scripts). For example, I will cover one of the most popular ways when ETL writes log to the database table.

    There are required the following components:

    1. ETL logs in one of the DW tables
    2. The way or tool to join SQL query and TABCMD
    3. The way or tool to querying ETL tools all the time

    How?

    Let's do this.

    Firstly, you should understand ETL logs in order to find what row/rows represent finish of the particular process. It is simple to figure out. Look at your data source and ask ETL developer what ETL process are responsible for them. Usually, the FACT table is going after DIM tables.

     

    For example, we have a table ETL_LOGS with all logs and we know that step of loading fact table calls 'FACT_TABLE'

     

    select * from ETL_LOGS

    where etl_module='FACT_TABLE'

    and  start_date > trunc(sysdate)

    order by finish_date;

     

    This query will return for me all rows that were logged at nightly load process for my fact table. Maybe in your case, it will be differently.

    In addition, we can find the last row, that means for us the end of the process. For example:

    IDetl_moduleetl_stepstart_datefinish_dateloaded_rows
    error_rows
    87346FACT_TABLEcollect_statistics23-SEP-2016 05:40:0123-SEP-2016 05:45:079586560093482980

     

    In other words, 05:45:07 is a perfect time to refresh extract or run other activities.

     

    The next step is to upgrade our initial SQL query in order it failed all the time, till it meets the TRUE condition. My example is for Oracle but it can be adapted for any database:

    select

    to_number(decode(case when count(*)=1 then 1 else 0 end,0,'X',1)) --X will failed to_number function

    from etl_logs

    where

    start_date > trunc(sysdate) --get only actual data

    and etl_module='FACT_TABLE'

    and etl_action='collect statistics' --the last in FACT_TABLE.

    and finish_date is not null --check that if finished

    and

    (select sum(case when error_rows is null then 0 else error_rows end) from dw.etl_log

    where start_date > trunc(sysdate)

    and etl_module = 'FACT_TABLE')=0 --this subquery will check that for FACT_TABLE there arent any errors

    and sysdate-finish_date < 5/24/60; --it just tell us the diff between sysdate and finish date less than 5 minutes

    I added comments for each row. The main idea of the query, that it will fail all the time if query returns not 1 because of function to_number that doesn't like string value as an argument. Another interesting condition is sysdate-finish_date < 5/24/60. It just means that difference between system date (current date time) and finish date should be less than 5 minutes. Why 5 minutes? Because I am going to schedule this query runs each 5 minutes and it could be successful only once per day. I don't want to refresh my extract multiple times.

     

    The next step is to create the batch file for the tabcmd. In addition, I would like to get an email when refresh is finished. For this purpose, I will use free command line email tool that is already in used by tableau users - febooti command line email.

     

    There is an example of tabcmd command and email command:

    echo %DATE%

    echo %TIME%

    set datetimef=%date:~-4%_%date:~4,2%_%date:~7,2%_%time:~0,2%_%time:~3,2%

    "D:\Tableau Server\9.3\bin\tabcmd" login -s https://localhost -u danoshin --password-file "C:\Users\danoshin\Desktop\hello.txt" --no-certcheck

     

     

    start "My Extract"  /b /wait "D:\Tableau Server\9.3\bin\tabcmd"  refreshextracts --datasource "My Extract" --project "My Project" --no-certcheck --synchronous

     

    "D:\Program Files\Febooti Command line email\febootimail.exe" -SMTP my.smtp.com -PORT 25 -AUTH AUTO -USER myuser -PASS ******* -TO dmitry@gmail.com

    -FROM  tableau_boss@tableau.com -SUBJECT Woot Woot  -TEXT Tableau Data Source My Extract is successfully refreshed. Good Vibes, dude!

     

    Finally, we need a way or tool to connect SQL part and Batch script together and schedule it with Windows Task Manager. I used Pentaho DI community edition. You don't need to install it and it works for Windows, Linux, and Mac. However, we should add parameters such as JAVA_HOME and JRE_HOME. It has 3 components:

    It has 3 components:

    • Spoon - GUI
    • Kitchen - command line tool to run jobs (I used it)
    • Pan - command line tool to run transformations

    Using Spoon we can quickly create job with 2 steps:

    Screen Shot 2016-09-23 at 6.13.53 PM.png

    Let's see what is inside:

    Execute SQL Script Step:

    Screen Shot 2016-09-23 at 6.16.30 PM.png

    Execute Shell Script Step:

    Screen Shot 2016-09-23 at 6.14.48 PM.png

    Spoon will save our job with KJB extension in XML format. Now we can create another batch file in order to run our job each 5 minutes via Windows Task Manager:

     

    C:\Users\danoshin\Desktop\data-integration\kitchen.bat /file:C:\Users\danoshin\Desktop\jobs\extract.kjb /level:Basic

     

     

    The Final step is to create new task in Windows Task Manager:

    Screen Shot 2016-09-23 at 6.19.01 PM.png

    Screen Shot 2016-09-23 at 6.19.12 PM.png

    That's all. Don't forget, that this machine should work all the time. You even can schedule it on the Tableau Server.

     

    For the ETL tools with GUI, you can simply create the additional step to run tabcmd when ETL is finished.