6 Replies Latest reply on May 8, 2013 5:47 AM by Michael Daivs

    Average volume over all days

    Michael Daivs

      I need to show the average number of occurrences per day for many events where data is sparse.

       

      Example:

      Event A occurred on 3/1, 3/1, 3/2, and 3/6. Thus the occurrence per 7 day week during that week would be 4/7 ~ .571. However, COUNTD() returns 4/3 ~ 1.333 thus seriously skewing the data.

       

      In order to automate this I created a calculated field which followed the below format.

       

      Count([Number of Records]) / (DATEDIFF('day', MIN(<Date Field>), MAX(<Date Field>)))

       

      This seemed like a good work around, but I was wondering if there a more standard way to go about preparing this data?

        • 1. Re: Average volume over all days
          Mark Holtz

          Hi Michael,

           

          It sounds like you essentially want to include values that are not present in your underlying data in a calculation. This subject is known as Domain Completion or Domain Padding.

           

          It's not the easiest thing to learn--I still struggle with it, but there is plenty of content on the forums around it.

          I know it was the topic of a Think Data Thursday not too long ago...

          http://community.tableau.com/groups/think-data-thursday

           

          In my opinion, the simplest solution is actually to force a value for every day into your data using custom SQL. We have a table we join to regularly that has every day (including the future through the next year) since our company started.  That way, we can join our data pulls to that source to ensure at least a record with no values gets included in the data.

          • 2. Re: Average volume over all days
            Michael Daivs

            How would you then need a secondary calculation (to replace "Number of Records") that corrected for the padded data?

             

            I thought about doing that while passing through the back end but didn't see a clean way to count if I padded each type of data.

             

            If my confusion isn't clear below is an example:

             

            For the below data I would like to look at the average volumes per day by site and product.

             

            Date Sold     State     Product

            1-1-13          NY          A

            1-1-13          CT          A

            1-3-13          NY          A

            1-4-13          NY          C

            1-5-13          NY          A

            1-5-13          CT          B

             

            To pad it out completely I would get the following table (truncated for space)

             

            Date Sold     State     Product     Padding

            1-1-13          NY          A               No

            1-1-13          CT          A               No

            1-1-13          NY          B               Yes

            1-1-13          NY         C               Yes

            1-1-13          CT          B               Yes

            1-1-13          CT         C               Yes

            1-2-13          CT          B               Yes

            1-2-13          CT          A               Yes

            1-2-13          NY          A               Yes

            1-2-13          NY          B               Yes

            1-2-13          NY         C               Yes

            1-2-13          CT          A               Yes

            1-2-13          CT         C               Yes

            ...

             

            This would cause both a programming and storage nightmare particularly with data sets of dozens of columns but ignoring that issue entirely how would I be able to count the transaction rate at this point? is the only way to do it having an identifying column telling me that the value is added for padding and not to count it? wouldn't the exclusion of those rows put me back into the same place I was without the padding?

             

            Thanks

            • 3. Re: Average volume over all days
              Mark Holtz

              Hello Michael,

               

              Generally, when completing a domain, you would cross join all dimension combinations possible and then leave NULL the measures (I prefer this over setting to 0 so that they are omitted in aggregations like SUM and AVERAGE.)

               

              So If you have 5 dates, 2 states and 3 products, you produce the cartesian product and generate 30 placeholding records. Then left join to your actual data and you will still get 30 records, but only those with data will have non-NULL measures--so the NULLs can be easily filtered back out if you don't want the padded data.

               

              Does that help?

              • 4. Re: Average volume over all days
                Michael Daivs

                That answers the question I had.

                 

                However, I would like to return to the issues I ignored initially. I can get more storage space so we can ignore that piece, but because I really have years worth of dates specific to the hour, and about 30 data fields each with at least 8 categories the Cartesian product would be on the magnitude of 8,000,000 rows and growing daily, wouldn't that much padding cripple the speed?

                • 5. Re: Average volume over all days
                  Mark Holtz

                  Tableau has handled 30M+ record data sets for us before.

                  If you create an extract, it shouldn't be too bad. The only thing I can say is try it and see. It does depend on how complex you make your Tableau views. If you have a lot of parameter-based calculated fields, that might not be so snappy, but if you're just doing reading and rendering, it shouldn't be crippling, as you put it.

                   

                  Additionally, you don't need to store the exploded table in your source system--just create a dimension table for each field (your date-time to the hour will be the largest) and cross join them in custom SQL as your Tableau source.

                   

                  The cross join becomes a "scaffold" upon which you attach real results data via a left join.

                  Jonathan Drummey's blog has had a couple segments around this too, if you are looking for more info.

                  http://drawingwithnumbers.artisart.org/wiki/tags/domain-padding/

                  • 6. Re: Average volume over all days
                    Michael Daivs

                    That sounds like a good methodology. Thank you.