10 Replies Latest reply on Oct 6, 2016 4:57 PM by Russell Christopher

    Server Level Data Sources: Best Practices?

    Michele Brooks

      Hi there Tableau Community,

       

      My team and I have spent months trying to fine tune our Tableau Server to optimize it's ability to store, access and extract data for our BI users. We've tried every possible approach we can think of and discussed solutions with Tableau, our DBAs and a Tableau/ Vertica consultant. No consistent answers from anyone so I'm turning to this community. Here are the following solutions we've tried to get our Tableau - Vertica environment running smoothly:

       

      NOTE: We're working with massive amounts of data. 46+ billion rows in Vertica views is a standard table we use for our business. As per Tableau's advice, we created multiple, extensive ETL processes to aggregate these tables into more digestible tables. This detracts from the "on-demand" data that we would like our users to have access to and has added 50+ new tables to our catalog. We cannot realistically create ETL processes to create tables for use in each distinct Tableau tool because this would muddy up our database environment further.

       

      General Development Process:

      1. Create development workbook and build data source.
      2. Push data source to server
      3. Connect to server level data source in production workbook.
      4. Push production workbook to server when ready.

       

      Different Development Approaches:

      • LIVE Connetions
        1. APPROACH: Tableau server level data source w/ live connection to Vertica; Calculated fields in server level data source
        2. APPROACH: Tableau server level data source w/ live connection to Vertica; Calculated fields in workbook connected to server level data source
      • EXTRACT Connections
        1. APPROACH: Tableau server level data source w/ extracted connection to Vertica on refresh schedule; Calculated fields in server level data source
        2. APPROACH: Tableau server level data source w/ extracted connection to Vertica on refresh schedule; Calculated fields in workbook connected to server level data source

       

      In all of these approaches we are seeing varying levels of issues on our Vertica side. Our Vertica environment has 80+ direct users. Our BI team is the access point for anyone else in the company so we are really trying to balance that load and make sure our Vertica system is not over-loaded by the queries to Vertica from Tableau. Tableau sends off a dozen queries for any one click in the LIVE workbooks. In the EXTRACT workbooks, Tableau still seems to send off queries to Vertica for the calculated fields and hold them in temp tables on the Tableau Server.

       

      If anyone has any light to shed on the best practices for creating server level data sources, please share! Thanks!

        • 1. Re: Server Level Data Sources: Best Practices?
          Toby Erkson

          Michele Brooks wrote:

           

          ...

          Different Development Approaches:

          • ...
          • EXTRACT Connections
            1. APPROACH: Tableau server level data source w/ extracted connection to Vertica on refresh schedule; Calculated fields in server level data source
            2. APPROACH: Tableau server level data source w/ extracted connection to Vertica on refresh schedule; Calculated fields in workbook connected to server level data source

           

          ...In the EXTRACT workbooks, Tableau still seems to send off queries to Vertica for the calculated fields and hold them in temp tables on the Tableau Server...

          It sounds like your extract connections aren't 100% extract solutions as you have the workbooks connect live to the data source for calculated fields, correct?  That doesn't make sense to me at all but want to confirm.

          • 2. Re: Server Level Data Sources: Best Practices?
            Michele Brooks

            Hi Toby,

             

            That's right, we generally create extracts on the Tableau Server in terms of how they connect to Vertica, but then have live connections to those server level extracts from our workbooks. This is mainly so we can control the refresh schedules of the data sources and have it apply to all workbooks that rely on that data source rather than having to set extracts per workbook as well or create duplicate data sources for each workbook that connects to that server level data source.

             

            Michele

            • 3. Re: Server Level Data Sources: Best Practices?
              jegan.sivaraj

              We had a few instances of tableau queries locking up production databases. One of the solutions we talked about was a read only mirror db for reports. In the meantime we have tried to optimize our extracts several ways

              1. Reduce the number of refresh schedules

              2. Make sure calculations are not part of the extract. This may reduce performance on the workbook but it is a huge gain on the time to refresh the extract. Especially if there are string/date calcs.

              3. Never schedule workbooks and always connect to the server instance of the extract.

              4. Hide unused fields before creating the extract.

               

              So far this seems to work for us. Like Toby I'm confused why your extract workbooks would send queries to the db. I'll be interested to learn more on that.

              • 4. Re: Server Level Data Sources: Best Practices?
                Toby Erkson

                Jegan Sivaraj & Michele Brooks, let me get this straight:  You have workbooks that use an extract AND a live connection, both pointing to the same data, in order to keep calculated fields out of the extract.  Correct?

                • 5. Re: Server Level Data Sources: Best Practices?
                  jegan.sivaraj

                  I don't have a live connection just the extract.

                  • 6. Re: Server Level Data Sources: Best Practices?
                    Russell Christopher

                    (Spelling and grammatical errors ahead..this will take a while and I'm not going to stop for grammar)

                     

                    Hi Michele

                     

                    At the highest level what you're trying to do is...well, it's just hard.

                     

                    If you want to think of this in terms of a cold - warm - hot strategy where...

                     

                    • Cold = Data Lake, Everything and anything..The ability to ask ANY question about unknown things
                    • Warm = Big Data Warehouses (Vertica) data marts so you can ask "known" questions and get "unknown" answers
                    • Hot = In memory / acceleration layers (Extracts, Vertica. OLAP-like technologies), aggregates, ability to ask known questions and retrieve known answers

                     

                    ....it sound like you want to cover both "warm and hot".  Because of technologies like the Tableau Data Extract and alike, users have gotten...(well, I'll just say it)...a bit spoiled. It really is sort of magic that you can create an extract against many / most data sources and then just "get to work" with a fast, reliable data source. But at the scale you're talking about, it's just not that simple (yet...there are lots of smart people trying to figure this out right now across lots of companies).

                     

                    At scale, "fire and forget" (a single extract) just won't cut it. The approaches you're considering (various flavors of aggregates: Aggregate Tables, Extracts, and/or "new and improved OLAP-ish stuff like AtScale) are what people generally do....and none of these approaches are particularly clean, and all of them require much more expertise than just creating an extract against that relatively small 100M row fact table you used to worry about.

                     

                    You didn't come right out and ask this, but the tone of your question makes me think that you may be wondering "Am I missing something? This should be easier". I don't think you're missing anything, and it's not easy   (yet) Sorry. Hard truth.

                     

                    A couple thoughts on your possible approaches: Physical aggregate tables in Vertica will totally work, but are labor intensive. The folks who own Vertica tend to be a different group than the Tableau Server people, so getting work done  can also be more difficult from an organizational perspective. From a pure "I want to stay in control of as much of this process as possible" angle, extracts may be a better approach IF you can limit the columns you bring back the bare minimum and always aggregate to a fairly high level. You will not be successful trying to create a 200+ column - 4B row extract...You will do OK with a 50-column, 250M row extract (if your batch window is big enough to allow the extract to occur).

                     

                    It is not unusual for an org to create 3-4 different extracts of the same data, each at varying levels of granularity. These are then used together by viz authors to drive charts at varying levels of granularity in a dashboard, and tied together with action filters. The same thing can be done with live,  in-database aggregate tables as well - multiple data sources pointing at different agg tables, etc. etc. Some folks will MIX the two approaches, using mainly aggregated extracts and then 4-5 levels down a single LIVE connection to the data source which shows leaf-level rows for the occasional foray into "root cause analysis". Any questions against this data source will be heavily filtered by the time a user arrives through 4-5 levels of action filters.

                     

                    (And you wonder - doesn't this making authoring much more difficult? Yep. I certainly can - this authoring approach is more "hot" than "warm", and certainly not "cold")

                     

                    You mentioned Tableau can thrash your server with many parallel connections:

                     

                    • First, this means you probably want to take a look at the complexity of your dashboards. Sounds like some of them may just be too "busy". Simplicity is good.
                    • You can control the maximum number of connections Tableau will open up against a single data source to avoid this. You can do so for both desktop and Server, but it's easiest through server because you only have to make a "change" in one place. Consider forcing all Tableau access to Vertica through a Tableau Server Data Server data source and then change the max # of connections. This will make your users unhappy of course, because they'll have to wait longer Configuring Parallel Queries in Tableau Server | Tableau Software

                     

                    You also mention that Tableau seems to be hitting Vertica even when extracts are in use. It shouldn't. I'd take some time to track down the offending workbook and explore it deeply, I suspect you'll find a mix of extract and live data sources. Once a data sources has been flipped to "extract" mode, it doesn't touch the data source unless refreshed.

                     

                    Hope this helps a little - if nothing else, take solace that you're on the right path...

                    2 of 2 people found this helpful
                    • 7. Re: Server Level Data Sources: Best Practices?
                      Damien Lesage

                      Hello,

                       

                      Just to add a few points that might help (but I agree with Russel, it's just not that simple).

                       

                      A bit of context:

                      - We use Tableau (200 publishers + 800 users daily on the server) and Vertica (billion lines in a single table).

                      - Vertica is queries by Tableau Desktop, Tableau Server and all our analysts can also do ad-hoc queries.

                      - We have almost no incidents with Vertica (and I don't think any of them was directly related to a Tableau query).

                      - There is a unique team managing Tableau Server and Vertica.

                       

                      What we do:

                      - Half our connections to Vertica are lived, half are extracts (no more than 50 million lines, most are even smaller). We found out that mixing the two brings the most value.

                      - We have dedicated Resource Pool in Vertica, especially for limiting the execution time for queries and the resource they are allowed to use (https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/SystemTables/CATALOG/RESOURCE_POOLS.htm). This avoids a lot of bad queries that could endanger our Vertica cluster.

                      - When possible, we tell our users to work on Tableau Desktop after having deactivated automatic refresh or to generate a very small extract to work offline.

                      - We have developed a tool that allows our analysts to create their own aggregate tables in Vertica.

                       

                      It's not perfect but we have managed to have a stable environment for more than a year. We are still improving though. Whatever you do, you should expect continuous effort to keep your environment alive and well.

                      • 8. Re: Server Level Data Sources: Best Practices?
                        Michele Brooks

                        Thanks Jegan Sivaraj!

                        1) We limit these to update after the ETL processes I mentioned have run. We have a couple cases where we really need live connections, but settle for extracts because they were sending off 700+ queries per day with the volume we were seeing. These updated hourly.

                        2) Thanks you! This is the question I've been trying to get an answer to from Tableau and consultants and no one could give a straight answer. They would say we should really use live connections, but could never say how to fix our queries coming through on vertica that looked like a mess of SQL from the calculated fields.

                        3) Done

                        4) Done

                         

                        We started seeing the calculations coming through as queries and stored in temp tables on the Tableau server once we optimized the extracts and pushed them to the server. This was the extracted data sources with the calculations in the data source using strictly Tableau join architecture (no custom sql). It cleaned up the extract query so that it didn't send the calculations through while refreshing the extract, but it was still sending them when the tool was accessed on the server. Here is an example of what we were seeing coming through on the Vertica side for those data sources (Russell Christopher, you may find this interesting as well):

                        [ SELECT t3.__measure__1 AS temp_calculation1__3120847275__0_, t6.__measure__3 AS temp_calculation1__331530080__0_, t0.temp_attr_ad_weekstart_ok__1360203883__0_ AS temp_attr_ad_weekstart_ok__1360203883__0_, t0.temp_attr_ad_weekstart_ok__2345850288__0_ AS temp_attr_ad_weekstart_ok__2345850288__0_, t0.temp_attr_calculation_160159319692910592_nk__3808161248__0_ AS temp_attr_calculation_160159319692910592_nk__3808161248__0_, t0.temp_attr_calculation_160159319692910592_nk__593359778__0_ AS temp_attr_calculation_160159319692910592_nk__593359778__0_, t0.temp_attr_holdingcompanyname_nk__3010321490__0_ AS temp_attr_holdingcompanyname_nk__3010321490__0_, t0.temp_attr_holdingcompanyname_nk__3997540029__0_ AS temp_attr_holdingcompanyname_nk__3997540029__0_, t0.sum_spend_ok AS sum_spend_ok, t0.wk_ad_weekstart_ok AS wk_ad_weekstart_ok, t0.yr_ad_weekstart_ok AS yr_ad_weekstart_ok FROM ( SELECT DATE_PART('WEEK', merchantcleaning_wks_tab_ext_calc.AD_WeekStart) AS wk_ad_weekstart_ok, DATE_PART('YEAR', merchantcleaning_wks_tab_ext_calc.AD_WeekStart) AS yr_ad_weekstart_ok, ]

                         

                         

                        Toby Erkson:

                        We have created multiple versions of the same data source using live and extracted versions as well as tableau join architecture and custom sql in order to test performance for each possible combination. We don't have any user-facing workbooks that connect to multiple forms of the same data source.

                         

                         

                        Russell Christopher

                        Thank you for the extensive answer! You are spot on with our use cases and frustrations. Glad to know that we aren't missing the obvious.

                         

                        I've passed your answers on to my team and we'll start exploring these options. Your recommendations to create multiple extracts at varying levels may help us in several workbooks and help us to gain better control of our filtering actions.

                         

                        As far as the parallel connections, I believe our settings are to only allow two connections at a time. And while the 'junk' queries that Tableau is sending over aren't a huge load, they do take up those channels and make for trying to wrangle our Tableau connections via Vertica a huge mess when you dig into it.

                         

                        The extracts hitting Vertica is as baffling to me as it is to you. I've included the scenario in which it starts to hit Vertica again and a sample of the query passing through when it does. It only started happening once we optimized the extracts before pushing them to the server. It also only seems to happen using Tableau's architecture for joining tables. We have a Custom SQL version of the same data source to test performance between custom SQL and tableau joins and it does not send the mess above for Custom SQL extracts with calculations in the server level data source.

                         

                         

                        Damien Lesage

                        Your success story gives us some light at the end of the tunnel. We will certainly start trying to implement some of your limits for extracts and discuss the resource pool approach with our DBAs. In general, our BI users have their own resource pool for both SSRS and Tableau connections to Vertica. But perhaps we need to explore even dividing those resources pools for easier management.

                         

                        Generally, our analysts aren't using any live connections or server-level extracts. They use either SQL or SAS to create their own tables in a temporary Vertica schema. When their workbooks get to a stage that needs server level, automated resources, our BI team generally takes over to optimize the data and workbooks. So it sounds like we've got that approach in line with yours

                         

                         

                         

                        Thank you all of you for your responses! This thread has been incredibly informative. Hope it helps other users struggling with establishing a healthy Tableau - Vertica environment.

                        • 9. Re: Server Level Data Sources: Best Practices?
                          Toby Erkson

                          Remember that Custom SQL with a live connection will be a performance hit, typically a big one.

                          1 of 1 people found this helpful
                          • 10. Re: Server Level Data Sources: Best Practices?
                            Russell Christopher

                            Hey Michele -

                             

                            The behavior you're seeing vis-a-vis "live" queries being sent to your database isn't correct. Could you open a support case? The fine folks in support will probably need to get some logs which capture:

                             

                            • Server refreshing the extract
                            • A user interacting with one of the vizzes this extract serves after the fact (on Server) , so that the "live" queries you're seeing get generated. We should be able to see this in the logs.

                             

                            Quick question - You've ruled out these queries coming from a random Desktop somewhere (Does Vertica log the hostname/device connecting? I don't know) and/or a "Live" data source which exists on the Server that you don't know about?