3-step script solution for fast Google BigQuery Extracts

Version 3

    This is a follow-up to my previous post: High Throughput Google BigQuery extracts with Simba JDBC driver

     

    Here you will find an alternative high performance approach to refresh Tableau Hyper Extracts off Google BigQuery datasets, based purely on Google and Tableau APIs (and Python), thereby completely bypassing Tableau's Native or JDBC connectors. In this case, by leveraging Google's APIs, we are able to Extract data from BigQuery, as well as transfer it to the VM where Tableau Server lives, at fantastic speeds. It also allows extracts to be taken from Google Service accounts.

     

    As an example, on my tests a Refresh job of 4.,7GB of BigQuery data (34M rows, 20 cols) takes around 41minutes with Tableau's native connector vs 9 minutes with this approach. That's a whooping 4.5 times improvement!

     

    You will see this a quite extensive guide and so many moving pieces, so it's only recommended if you really need to generate huge extracts and can't bare waiting for it to complete with the native connector.

     

    The solution works in three steps:

     

    1 - Extract Data from Google BigQuery (as CSVs) and store it on a Google Storage Bucket

    2 - Transfer these CSV files over the air to the local hard drive where Tableau Server lives.

    3 - Kick off a Extract Refresh job off the local CSVs

     

    The main bottleneck here is in step 2, as we will be transferring gigabytes of data over the wire and storing it in disk. On my tests, I was able to transfer and save 11 CSV files totalling 5GB of data from Google Datacenter in the US into a AWS VM also hosted in the US in less than 3 minutes (avg. transfer speeds over 500Mbps). As we are going to be storing huge amounts of data in the local disk drives,  write speeds play a crucial role here, so make sure your disks can sustain fast write rates.

     

    Note: This solution has only been tested in Windows, but I believe it works in Linux as well. Please let me know otherwise!

    Note2: Feel free to tweak this code to your taste. By no means this is best practice!

     

    To run this solution, follow this steps:

     

    First things first: Setting the environment up:

     

    • All code is implemented in Python 3.7+ 32bit
      • Once you have Python installed, please install the following packages (note: the command to install packages is:  pip install <package-name>):
    • Install and initialise gsutil tool
      • This is needed for downloading data from Bucket to local hard drive from the Tableau Server VM.
      • Note:  Although we could leverage Google Cloud Client libraries to interact with Cloud Storage directly in Python, the only way to leverage all network bandwidth and disk throughput is by leveraging the multi-threading feature of gsutil tool, which is not (yet) available as a Python command. Nevertheless, we will call the gsutil command line utility from withion our Python code. Again, that's why it's critical you have fast network speed and disk, otherwise this solution will not yield the expected performance benefits.
      • Step-by-step install guide : Install gsutil  |  Cloud Storage       |  Google Cloud
      • After installation, initialise gsutil and authenticate Google Cloud SDK, as per these instructions: https://cloud.google.com/storage/docs/gsutil_install#authenticate

     

    • Setting up the Service accounts needed for extracting data from BigQuery and storing into a Bucket:

    Screenshot 2019-06-26 at 15.05.04.png

     

     

    Configuring the source code to your environment:

     

    • Update configuration variables from these python files

     

    1.  File 1_export_data.py

    # --- START YOUR CONFIGURATIONS HERE --- #
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="<your-private-key-file.json>"
    
    project = "<your-google-project-id>"
    dataset_id = "<your-dataset-id>"
    table_id = "<your-table-id>"
    bucket_name = "<your-google-cloud-storage-bucket>" #the extracted files will go here
    filename_pattern = "<exported_files-*.csv>"
    # --- END YOUR CONFIGURATIONS HERE --- #
    

     

    2. File download_bucket_files.py

    # --- START YOUR CONFIGURATIONS HERE --- #
    bucket_name = '<your-google-cloud-storage-bucket>'
    file_format = 'csv'
    destination_path = '<path-to-your-local-folder>' #local path where you want to save the files downloaded
    # --- END YOUR CONFIGURATIONS HERE --- #
    

     

         3. File 3_refreshExtractByName.py  - No changes needed to this one!

     

         4. File RunThis_1_2_3_in_one.py

     

    os.system("python refreshExtractByName.py -s <your_Tableau_Server_URL_here> -u <your_admin_username> -pw <your_admin_password> -dn \"<your_unique_datasource_name_here>\" ")
    

     

     

     

    If you are still following me until this point , you may be wondering how are we going to run an Extract Refresh Job for a Tableau Datasource that still doesn't exist on Tableau Server.  One approach would be to use the Extract API 2.0 to generate this Extract programatically, but I will instead do it manually once with Tableau Desktop. In the near future, Tableau will release a new version of this API (the so called Hyper File API), so I'll wait for that. For now, we need to run manually only once scripts 1_export_data.py and download_bucket_files.py, in order to get get a first batch of csv files, then we are going to use Tableau Desktop to create the datasource that points to these csv files and publish it to Tableau Server manually and create a Refresh job schedule for it. Once that has been done, any following runs are fully automated by my code, which in turn can be scheduled (e.g. cron job or Windows Scheduler). More details on this below:

     

    One-time manual run to create the Tableau Datasource:

     

         1. Manually run 1_export_data.py

    Note: This code has beed designed to always overwrite existing files!

    Screenshot 2019-06-27 at 15.17.13.png

    Now we have several files created on our Google Cloud Storage Bucket:

    Screenshot 2019-06-27 at 15.19.49.png

     

         2. Manually run 2_download_bucket_files.py

    Screenshot 2019-06-27 at 16.36.46.png

     

    3. Create the datasource from these CSV files in Tableau Desktop:

     

    Use the union function and add a pattern, so that any new CSV files matching that pattern are automatically picked when the extract is refreshed:

    Screenshot 2019-06-27 at 16.39.20.png

     

    4. Generate an empty extract now in Desktop and publish it to Tableau Server. Follow this guide for a step by step: Easy empty local extracts  | Tableau Software

     

    5. Create a "dummy" Extract Refresh Schedule in Tableau Server. We need that, so that our code knows what is the right extract refresh job we want to trigger.

    Screenshot 2019-06-27 at 16.47.38.png

     

     

    I like to call this Schedule: "Extract Refresh for API runs".  Now the trick here is to keep this task "DISABLED", as it will never run under Tableau's scheduling platform, but rather only when triggered via our Python code. Here is how to Disable this Schedule:

     

    Screenshot 2019-06-27 at 16.52.16.png

     

    Add the Published Datasource to this schedule.

     

    Screenshot 2019-06-27 at 16.53.52.png

     

         6. Update our RunThis_1_2_3_in_one.py script, now pointing to the name of the published data source. In my Example, that would be "prices_paid_database (local CSV files with Union) ".

     

    Great, now we have all pieces of the puzzle in place and we are ready to run the actual script that ties all steps 1, 2 and 3 in one.

     

    --> python RunThis_1_2_3_in_one.py

     

    This script basically wraps the other 3 files, it simply calls each one of them at a time and then magic happens.

     

    Screenshot 2019-06-27 at 17.27.52.png

     

    Done! Your data has now been extracted from Bigquery, transfered over the internet to your server and now Tableau is Refreshing the extract in the background.

     

    To see the status of Tableau Extract Refresh jobs, check the Admin view that ships out-of-the-box with Tableau Server:

     

    Screenshot 2019-06-28 at 09.14.00.png

     

     

     

    Here are some benchmark figures comparing this solution vs Tableau's Native Connector vs the other a solution I presented here that leverages the Simba JDBC driver with the High Throughput Storage API:

     

    Screenshot 2019-06-27 at 17.30.51.png

     

    As you can see, the method presented here is significantly faster than other options, but it comes at the expense of code/IT management costs.