thanks. I had not until you sent it, but it's remarkable how similar it is to mine in terms of dynamically setting the column defs. See attached for what I did, someday I'll get around to blogging about it too maybe once I have it fully working , the Python script works for the creation of the tde, the part that I'm stuck on now is the append of the tde onto an existing tde that was created with tableau desktop. It seems that the metadata has to match exactly so if the tableau desktop extract has calcs added or fields renamed, then the calcs must be part of the tde to be appended too. Well, back to the drawing board. Also, cursors seem too slow.
the main script is testscript.py which is 100% generic. And then config_sql_bdi_call.py is passthrough to whatever sql query such as config_contentkeyword_base.sql
P.S. Congrats on being named zenmaster, is it new for you?
Python_dataextract_api.zip 3.1 KB
Yes, if you'd like to move exiting workbooks to tde that's pretty challenging. The best option is to build the tde first, then put the data source on top of it, and add the calculations to workbook level.
What is your use case here, why do you want process SQL with TDE instead of Server?
Regarding python and SQL/cursor performance. First of all you should always use fetchmany with around 1000-5000 records instead of fetchone. Fetching records one by one is waste of resources: it causes tons of context switches in the background. Second, this is a typical producer-consumer pattern, so you should consider implement threading with async io/message passing or use python coroutines. I'm not a huge python fan but as far as I know coroutines are supported natively.
Due to python's GIL (global interpreter lock) it will never be a good language for hi-speed data loads, but with increased cursor size it should be better.
It's interesting to see that you are using GPDB as well. I'm curious about your experiences and I can share mine as well (maybe in Vegas). And finally thank you, yes, this zen stuff is new for me
our warehouse is on GP, not my first choice, but also not my decision.
I'm no expert in Python, so I don't know the full array of options available. But thanks for the tip with the fetchmany, I will try it and will lookup how to then write each row out to the tde.
In any case, I have authored a few designs to create a fully automated "back dated incremental" refresh process. The latest design concept is to save a base copy of the data as of 90+ days ago and then use this as the starting point each night for the extract. And then the append is to add a new month onto the base extract on the 5th of each month.
The value in this is for datasets that have late arriving period start dates, we are right now required to do a full extract refresh (150 million rows). We've looked at triggering the incremental off of create_ts, though if data is updated and has an updated create_ts, then it will be duplicate in the extract. So, wouldn't it be great if we could start with a static copy of the data (e.g. 90 days ago) and then increment off of this I have a lightweight arch blueprint for this that I'd be happy to share.
1 of 1 people found this helpful
I wouldn't use the extract API for this particular use case. It's easier if you manage different extracts using Tableau Server and download/alter/refresh/rename them with the REST API.
Like you can have a scheduled refreshe every month for your base for this limited time period (using parameters or hard coded relative filter values).
Every day you start a processing with:
- Copy this "base" data source as "daily" one - overwriting the old data
- Trigger refresh
There are other tricky ways to hide that interim period where the old data source is republished and not yet refreshed. You can use automated Desktop refreshes or a temporary datasource.
So I don't see any challenges with your requirements, it's simply question of few lines of script.
Instead of struggling with spaghetti python scripts why you don't use a low cost / free ETL tool?
thanks for your suggestions, what you describe (copy the base data source) was exactly the path I was headed toward but then I ran into issues in fully automating this as I could not pass a parameter into the tabcmd refreshextract to indicate whether its base, incremental, or Yearly/Quarterly rolled up time period data. So, I was looking at the data extract API as one alternative to avoid having to download the datasource, alter hack the parameter and then reupload. I have a few ideas though that will allow me to revert back to the original solution and keep it automated.
do you happen to have an example of a fetchone / fetchmany loop? I'm thinking that I want to use a fetchone at the very beginning so that I can read column types from GP to setup the extract and then after this the fetchmany can get the rest.
Sure, I will prepare something in the next days. Due to data15 it's not so easy to find time for anything
thanks, let me know if you want to meet for 20 minutes at TC15, and we can work on together. I tried doing it myself, but get errors when trying to traverse through the columns within the cursor.fetchmany loop.