3 Replies Latest reply on Mar 24, 2015 12:13 PM by Matt Coles

    Published data source - performance adversion

    Jeff Strauss

      I have recently discovered (via the ?:record_performance=yes) that "connect to datasource" and "query behavior" are different when running against an embedded extract vs. a published data extract.  In fact the view is a lot faster to load when doing an embedded extract because I think it bundles the embedded extract in with the load of the workbook and is essentially querying against memory.  Can anybody confirm this?

       

      Second question.  Is there a way of making a published data extract (extract once, sharable across all workbooks) to act like a local extract?  I've tested doing an embedded extract from a published datasource extract, and this does help though it seems like an extra unnecessary step as the tde files (regardless of embedded or published) are all saved under the dataengine folder on server.  This would save me the step of having to re-extract though from my DW.

        • 1. Re: Published data source - performance adversion
          Matt Coles

          We noticed the same performance difference on 8.3--it was particularly concerning to us because we have begun moving toward the creation of curated and documented published data sources. However, since we've been running 9.0 on our server, the performance differences we had noticed between embedded extracts, and published datasources are now negligible on Server--which we are very happy about!

           

          Not sure I understand the second question? Yes, you can extract published data source data (backed by an extract, or not) to a local extract in your workbook. It will get around the performance issue you mentioned, but also duplicates the data on Server, as you noticed.

          • 2. Re: Published data source - performance adversion
            Jeff Strauss

            Matt, thanks for the insights.  I tested a simple case with superstore data on 9.0 beta 9, and it seems the results are negligible here as you say.  that's really good news as long as it holds up in practice with larger extracts.

             

            The gist of the second question is focused on seeing if there is any way of taking the published extract data source and making it act like an embedded extract.  From studying the ?:performance_recording, it kind of looks like the embedded extract is loaded as part of the workbook into memory when a user requests a dashboard.  But the published data source seems to act more like a live database connection where it has to make a connection, read the metadata, and then query the columns it cares about.  From this perspective, it looks like the local embedded extract will be more efficient aka faster, but then perhaps I am looking at it in the wrong light and you can clarify?

            • 3. Re: Published data source - performance adversion
              Matt Coles

              I haven't looked at the difference in behavior that closely under the hood before, unfortunately. But I do know that if a TDE is involved, then you're going through the Data Engine process to query it. That means that your request still goes something like Browser->Apache gateway->VizQLServer->Data Engine->VizQLServer->Apache->Browser. The component you avoid if you keep your extract local to your workbook is Data Server, in between VizQLServer and Data Engine. (I'm sure it's more complicated than that in practice, but that's the simplified version of things). So there'd be a little less overhead, but it's probably not worth avoiding once you live in a 9.0 world.

               

              There are a bunch of other things I like to think about in terms of the differences between an extract-based datasource and a local extract. Not intended to be comprehensive, but hopefully there's some nuggets of usefulness:

               

              1. If your workbook contains multiple extracts, they'll be refreshed serially when the refresh for the workbook is kicked off. If the connections were all to published data sources, the data sources could be refreshed in parallel. That might be helpful if they're being refreshed very frequently and take a long time to complete, or if you have a certain window they need to run in.

              2. For data that's really only useful to one workbook, or perhaps you're in the early phases of playing with it, you can optimize the local extract by hiding the unused fields--a low-risk way to drastically reduce the data footprint of the extract, thereby decreasing Server's overall footprint, and CPU / mem / disk / network resource consumption on Server AND whatever back-end system it's pulling the data from. For a published datasource, you're generally going for a "whatever everyone finds useful and makes sense to have here" sort of resource, so it will inevitably be larger and consume more resources.

              3. Published data sources can make use of caching (esp in 9.0 with the external query cache) that could actually help improve the performance for workbooks based on them if they're frequently queried. You may not get the same benefits by keeping extracts local to the workbook.

              4. There is so much to be said for building a data source that has comments for the fields, fields that are nicely foldered, and has a definitive owner responsible for its content. It's a great way to get new users of Tableau easy access to a clean set of data that lets them jump in and start analyzing, and it's a lot easier for the Admin to address issues with a single datasource than with 15 workbooks all pulling the same extract data from a given database.

              5. Consolidating resources with data sources saves space, processing power, etc etc.

               

              Hope at least some of that's helpful, Jeffrey!