1 Reply Latest reply on May 16, 2016 9:50 AM by John Sobczak

    Combing data from two "big" tables within Tableau

    Damian Lilla

      Hello dear Tableau community!

       

      SETUP

      We are trying to create dashboards where we need to combine data from multiple tables. The underlying database we use is HP Vertica. We have the following tables:

      • payments (an entry is created when the user buys the product's internal currency)
      • items_bought (one entry each time an item is bought with the product's internal currency)
      • logins (every time a user logs into the product he creates an entry here. We can aggregate this to the level that each user has one entry a day.)

       

      Example how the payment table would be defined (blue=dimensions, green=metrics):

      payments_table.jpg

      Example of logins table (pre-aggregated each user has only one entry per day, logins shows the count of logins for that user each day)

      logins_table.jpg

      Each time an user interacts with the product he will generate at least one login for that day but not necessarily will he do a payment or buy an item.

       

      We would like to provide a live connection to the database and show a history of about 4 months.The amount of data those tables would approximately have in a 4 month period is:

      • payments ~ 200.000
      • items_bought ~ 10.000.000
      • logins ~ 24.000.000

       

      GOAL

      Using the payments table as an example we need to have the following information from the logins table:

      • logins
      • unique logins (calculated within tableau based on count(pid))

      With that information we would like to create a simple graph of how many dollars the product earned per unique user for a particular event_date. So we need to be able to use columns from the logins table together with columns from the payments table in calculated fields. We would like to allow the user to filter by the dimensions registration_country, os, device etc. This should filter both tables payments and logins.

       

      APPROACHES/PROBLEMS

      • Data blending
        • We defined payments as our primary data source and blended with logins on pid, event_date, registration_country, os, device
        • Registration country has about 200 unique values, device about 1000 and os about 2000
        • We can use COUNT to count the unique logins as each user has one unique entry in the logins table for each event_date
        • The more fields we blended by the slower the query got and sometimes gave us an error saying query to complex. Also where payments is not that big the table for the items is even bigger than the logins table.
          • Checking the query that was sent we saw that tableau generated a very big where clause. It seems to us it filters the payments first, checks from which registration_countries, os' and devices we had payments and then puts the values that occur in a very long where clause for the logins query. It does that for every event_date that we use on the graph.
      • Parameters in the WHERE clause
        • We added the same parameters for the fields registration_country, os, device in both data sources and did a left join of logins to the payments by pid and event_date
        • The limitation that we ran into here were that we could not let the user choose "all" and he could only perform a single select (deal breaker for us)
      • Action filters / Dashboard filters
        • We kept payments and logins as separate data sources and created sheets listing registration_country, os, device. We then used them as filters for the other two data sources
        • In this setup however we can only filter the metrics from both data sources separately and can not combine them in calculated fields (deal breaker for us)

       

      SOLUTIONS/QUESTIONS

      So following these steps we have a couple of questions:

      • Is data blending a viable solution in this scenario? We found several people posting that data blending is mostly only for a quick adhoc analysis where the user needs to combine two data sources quickly (of which one is small) and where performance is not an issue.
      • Any possibility to have multi select parameters or any other approach we could take to filter both sources for registration_country, os, device and then do a join on pid, event_date with the already filtered data?

       

      The only solutions we now see are:

      1. Do a full outer join of both payments and logins on pid and event_date. In this case the payments will grow to at least the size of the logins table
      2. Pre-calculate the unique_logins, group them by all the possible combinations of registration_country, os, device. So create unique_logins_os, unique_logins_os_device, unique_logins_os_device_country etc. Then left join those values to the payments table.
        1. The problem here is that we then need to use parameters to switch between the different types of unique_logins that are needed in the calculations. So for instance if the user wants to see the payments per unique_logins by country then we need to use the unique_logins_country and if he wants to see them by country and os then switch to payments per unique_logins_country_os etc.
        2. It is less flexible as we need to pre-define the dimensions by which the unique logins need to be grouped by

       

      Are the any ideas/suggestion what the best approach would be to solve this in Tableau? Did we go wrong on any of the issues?

       

      I hope someone can help us a bit with this.

      Thank you and best regards,

      Damian