5 Replies Latest reply on Oct 25, 2017 3:09 PM by Chris McClellan

    Statistics for customizing data extracts to reduce size

    Katherine Woods

      We have several data extracts that are getting quite large, >=10G in size. If you ask users, they ALWAYS say that they need ALL of the data. Is there a way to get statistics from the PostgreSQL DB on actual usage of data inside published data sources, say date ranges or fields used? I seem to remember seeing an article or a presentation where someone showed a way to do this. If anyone knows where I can find a KB article, sample dashboard or video on doing the above, please let me know. It would be very handy to have some statistics when going into meetings with users regarding streamlining the data sources we publish.

        • 1. Re: Statistics for customizing data extracts to reduce size
          Chris McClellan

          It's an argument that can go on forever

           

          I'd tackle it in a few ways:

           

          1) disk is cheap (relatively), so who cares ?

          2) Dashboards run faster when the extract is smaller.  I've seen a dashboard running slowly on 450mill rows (that covered about 5 yrs of data), they created ANOTHER extract for this year only (250mill rows) and the dashboard ran a lot quicker .... but you now have 2 extracts, not 1.

          3) Test 10.5 and the Hyper engine.  In theory your extracts should be smaller and faster using Hyper.

          • 2. Re: Statistics for customizing data extracts to reduce size
            Katherine Woods

            While I agree with you that disk is cheap, so who cares, I do have some concerns:

             

            1) Many of these large extracts are updated daily and are full refresh, so the bigger they are the more time they take to refresh

            2) I wave watched the size of these extracts greatly increase in size as the version of Tableau Server changes.

                    For example, they were about 5G in size when we were on version 9. Now that we are on version 10.3 they are 11G in size. While I am sure that we have increased our   business a bit over the years, I am sure we did not double it, and I have not changed the parameters of the data extracts. Since they are all full refresh and not incremental refresh, with fixed date ranges, I can only assume that Tableau Server updates have added to the size. For example, I have two servers. A production and a dev server. Last week I took our dev server from 10.1.3 to 10.3.0. The exact same extract on version 10.1.3 is 10G in size. On 10.3 it is 11 G in size. The only difference is the Tableau Server version. I would like to understand why that is. Also, I would like to rebuild some of these extracts, making them smaller, if I can get some statistics on the actual data records being used on any regular basis. Hence, the question.

            • 3. Re: Statistics for customizing data extracts to reduce size
              Chris McClellan

              Don't worry, I understand where you're going but Hyper is coming in 10.5 and the "rules" will change completely.

              • 4. Re: Statistics for customizing data extracts to reduce size
                Katherine Woods

                Chris, may I ask an addition (related) question? I created a data extract and saved it as a packaged workbook. The total size for this packaged workbook is 3.69G. However, when I published the data extract to the server for user access, the disk space used on the server (and the related extract size in the data\tabsvc\dataengine folder) is 8.45G. Why the doubling in size here? I am seeing this difference across the board for my extracts and the difference seems to be getting bigger with every update. Is this due to changes youa re making in the product to prepare for the 10.5 Hyper change or is it something else?

                • 5. Re: Statistics for customizing data extracts to reduce size
                  Chris McClellan

                  Hi, I'd raise a ticket with Support.  I don't look at Server that precisely, but if you're not using 10.5 then there wouldn't be anything about Hyper involved there.