As you might know, this is the hot topic of integrating ETL and Tableau in order make them work together. By default, we able to schedule Refresh of extract, subscriptions using default schedule. For example, we schedule ours have data extract to run at 6 AM in the morning because our DBA told to us that Data Warehouse (DW) is finished usually at 5.30 AM. We connected our extract with the Refresh Schedule.
This is the perfect world, where DW is working like a clock. Unfortunately, reality is different. Our DW could finish at 8AM, 10AM and etc. It could finish at 5.30 AM but with lots of errors and confused data. As a result, business users will waste their time without data or they even can build reports and send to the managers or maybe managers will get their report by subscriptions. To make the long story short, this is completely failed for the BI/DW team and this happened a lot.
I saw lot's of discussions how to do it, but I didn't saw detail step by step information. Let's fill this gap.
First of all, let's define what we want to get - we want to build simple mechanism using open source technology in order to refresh our extracts of any other operations, such as publish packaged workbook or pdf reports to the shared desired folder.
What do we need?
It will depend on the way what tool we use (GUI, scripts). For example, I will cover one of the most popular ways when ETL writes log to the database table.
There are required the following components:
- ETL logs in one of the DW tables
- The way or tool to join SQL query and TABCMD
- The way or tool to querying ETL tools all the time
Let's do this.
Firstly, you should understand ETL logs in order to find what row/rows represent finish of the particular process. It is simple to figure out. Look at your data source and ask ETL developer what ETL process are responsible for them. Usually, the FACT table is going after DIM tables.
For example, we have a table ETL_LOGS with all logs and we know that step of loading fact table calls 'FACT_TABLE'
select * from ETL_LOGS
and start_date > trunc(sysdate)
order by finish_date;
This query will return for me all rows that were logged at nightly load process for my fact table. Maybe in your case, it will be differently.
In addition, we can find the last row, that means for us the end of the process. For example:
|87346||FACT_TABLE||collect_statistics||23-SEP-2016 05:40:01||23-SEP-2016 05:45:07||958656009348298||0|
In other words, 05:45:07 is a perfect time to refresh extract or run other activities.
The next step is to upgrade our initial SQL query in order it failed all the time, till it meets the TRUE condition. My example is for Oracle but it can be adapted for any database:
to_number(decode(case when count(*)=1 then 1 else 0 end,0,'X',1)) --X will failed to_number function
start_date > trunc(sysdate) --get only actual data
and etl_action='collect statistics' --the last in FACT_TABLE.
and finish_date is not null --check that if finished
(select sum(case when error_rows is null then 0 else error_rows end) from dw.etl_log
where start_date > trunc(sysdate)
and etl_module = 'FACT_TABLE')=0 --this subquery will check that for FACT_TABLE there arent any errors
and sysdate-finish_date < 5/24/60; --it just tell us the diff between sysdate and finish date less than 5 minutes
I added comments for each row. The main idea of the query, that it will fail all the time if query returns not 1 because of function to_number that doesn't like string value as an argument. Another interesting condition is sysdate-finish_date < 5/24/60. It just means that difference between system date (current date time) and finish date should be less than 5 minutes. Why 5 minutes? Because I am going to schedule this query runs each 5 minutes and it could be successful only once per day. I don't want to refresh my extract multiple times.
The next step is to create the batch file for the tabcmd. In addition, I would like to get an email when refresh is finished. For this purpose, I will use free command line email tool that is already in used by tableau users - febooti command line email.
There is an example of tabcmd command and email command:
"D:\Tableau Server\9.3\bin\tabcmd" login -s https://localhost -u danoshin --password-file "C:\Users\danoshin\Desktop\hello.txt" --no-certcheck
start "My Extract" /b /wait "D:\Tableau Server\9.3\bin\tabcmd" refreshextracts --datasource "My Extract" --project "My Project" --no-certcheck --synchronous
"D:\Program Files\Febooti Command line email\febootimail.exe" -SMTP my.smtp.com -PORT 25 -AUTH AUTO -USER myuser -PASS ******* -TO email@example.com
-FROM firstname.lastname@example.org -SUBJECT Woot Woot -TEXT Tableau Data Source My Extract is successfully refreshed. Good Vibes, dude!
Finally, we need a way or tool to connect SQL part and Batch script together and schedule it with Windows Task Manager. I used Pentaho DI community edition. You don't need to install it and it works for Windows, Linux, and Mac. However, we should add parameters such as JAVA_HOME and JRE_HOME. It has 3 components:
It has 3 components:
- Spoon - GUI
- Kitchen - command line tool to run jobs (I used it)
- Pan - command line tool to run transformations
Using Spoon we can quickly create job with 2 steps:
Let's see what is inside:
Execute SQL Script Step:
Execute Shell Script Step:
Spoon will save our job with KJB extension in XML format. Now we can create another batch file in order to run our job each 5 minutes via Windows Task Manager:
C:\Users\danoshin\Desktop\data-integration\kitchen.bat /file:C:\Users\danoshin\Desktop\jobs\extract.kjb /level:Basic
The Final step is to create new task in Windows Task Manager:
That's all. Don't forget, that this machine should work all the time. You even can schedule it on the Tableau Server.
For the ETL tools with GUI, you can simply create the additional step to run tabcmd when ETL is finished.