A topic that I would like to explore is best practices for handling large quantities of daily POS data, getting that from its source into Tableau with relatively fast performance for the Tableau end-user. What range and granularity of dates is practical to deal with, what would be the best data source to use in Tableau, what ETL methods can keep up with the volume, where should the joins occur, what transforms are required to maintain type 6 dimensionality for retail data, etc...
Another topic is how to handle barcodes for retail. They are often composite codes with subsections that represent vendors, categories, etc... They also often have zeros on the front or back that are optional and obscure character replacement rules such as if it starts with 9, it's seafood, etc...
I'm looking forward to participating in the discussion,
Thanks Matt Shoemaker for the suggestions. Both are certainly topics the larger group would be interested in. Personally I'd love to hear more about best practices for handling the massive POS data as we get ready to tackle a similar project. We've long had that rich data feeding through near-real-time, but producing a Tableau dashboard that will perform for our end-user community can still be a challenge. Like you, I'd love to hear more from someone who has tackled this.
The barcode topic could be interesting as well. In my personal case, all of the item level data I work with is fully qualified with 100's of attributes since it's in our 'warehouse zone' but I'd definitely love to learn a little more about how others are working with this data when that's not the case.
Thanks so much for the suggestions and welcome to the community.
A system that I work with gets about 5 million POS rows per day. That results in 1.8 Billion rows for each year of data so if you want to do YoY comparisons for the last 5 year's you're working with a table with nearly 10 Billion rows. And that's with the data at a relatively coarse granularity (1 record for every combination of day/product/location). Other analysts would like to use shopping cart-level data (1 record for every item scanned in every shopping cart at checkout). That data set is several orders of magnitude larger than the 5 million row per day feed.
The traditional BI approach would be to create a data mart and extract out small-ish sets of data that are just intended to support certain reports and to store the data in a form that is optimized for retrieval (e.g. a Tableau Data Extract or in Vertica, Teradata, etc...). However, this extraction of reporting sets results in an inefficient process.
Having access to the full, raw set of data directly in Tableau would give the analyst the most potential. I'm wondering if there is a realistic approach to storing the data in a single repository and querying it directly from there without all of the extra steps of creating those subsets and extractions, and still get acceptable performance. Can any combination of current database technology and Tableau keep up with the larger cart-level data set I described earlier?
I know there are some good retail examples and stories here and I'm looking forward to checking those out https://www.tableau.com/stories/topic/retail-wholesale
I'll also keep a look-out for retail sessions and "big data" sessions at this year's Tableau Conference.
Matt Shoemaker thanks for the detail on your situation. My DW environment is structured pretty similarly and we've had to work through many of those issues. (Disclaimer: I'm 80% Analyst/20% Tech, so I'll avoid trying to solution, we'll all be better off)
We've had a lot of success with putting our data in MSAS cubes. We've been able to recreate our entire DW (all fact groups: basket level Sales, Inventory, Markdowns, etc. at the lowest levels w/ 3 historical years) and provide query response time on dashboards in the 5-10 second range. This is roughly a 2.5 TB cube that is heavily bloated with the inventory facts. If we isolated just the transaction level sales fact, it would be sub-5 sec for everything. So, that's one idea, but it's not without it's drawbacks. If you go the cube route, there are some tradeoffs in what you can do w/ Tableau. Not sure if you've worked with cubes in Tableau before, so apologies if you already know this. So many of the the tricks in my tableau toolkit rely on being able to create dimensions on the fly. (for highlighting, for calcs, LODs, swapping views within containers, etc.) With cubes, you either build the dimension into the cube, or you don't have it. So that's the one "gotcha".
All of that to say, we've had success with cubes with a very similar use case. Not sure if this is an option for you, but that's one example that has worked well for us. We serve the whole cube up to our entire user base, and it is able to handle the load of our entire merchandising community (~450 Users) in the 5-10 sec range.
Jeff Huckaby is another Retail resource who is a bit more technical who might be able to give you some alternate suggestions.
Separate community for retail! Its awesome
I have just started working in retail domain and this would be one-stop-solution for all my questions. Already loving it
Hello Matt and welcome.
In the grand scheme of things, ~ 2 billion rows of data is not a lot of data.
In regard to extracts, a general rule of thumb is 100M's rows of data, great for Tableau extracts, 100B's of rows of data, go to the source system/database.
You have to consider how wide your data is for the extract as well, how much the data is different or distinct (such as a SKU or UPC). How much text you have in the columns, etc.
We certainly have company's that have billions of rows of data in the extract, but the data table is not too wide. So width and type of data need to be considered.
I am not a fan of having marts, but having the data architected in a way that people can access all of the data, and the queries are very fast (under 3-5 second SLA).
But to answer your question, it is very realistic and people are definitely storing their data into single systems such as SQL Server, Amazon RedShift, Teradata, Cloudera/Hortonworks, etc. and querying live data, very, very quickly. We have a customer, who is using a combination of Hortonworks and AtScale, both a partner of ours, and they are querying billions and billions of rows of data, and getting results in Tableau within their 3 second SLA. I've also heard of good performance out of Teradata and Tableau as well, and row counts were in the billions and billions as well. Many customers have also had great success with the Redshift/Tableau combination.
Never a company to stand still and not innovate, our developer team will be adding capability to the data extract engine and enhanced in the future with new technology obtained from our acquisition of HyPer.
I included a recent article that we posted in regard to query performance gains within SQL Server 2016.
SQL Server 2016 - Query Performance
I hope this helps Matt
Yes Sai - we are excited as well. There are a lot of people in retail that struggle with analystics, and when they discover Tableau, they get really, really happy, and really, really excited.
Retail is a fun area to work in, welcome to the family!
Thanks Jeremy and Jeff for the excellent information.
I had tried using Oracle Essbase cubes about 5 years ago with Tableau and I did find the limitations to be prohibitive at the time. I'm particularly interested in the scalability of non-cube solutions such as Redshift and BigQuery and the various flavors of Hadoop that are available. The promise of Redshift is that you can just add more nodes and upgrade to the bigger nodes to handle the scale but I haven't tried it with those larger raw data sets yet. We're currently using Redshift as a Mart with 10s of millions of rows in each reporting source table and we're using 4 of the smaller-size nodes for that project. Performance in Tableau is decent when connected to those sources (3-15 second wait times with most in the 5-10 second range).
It will be great to hear some other user stories about analyzing big retail data with Tableau and to hear what systems have worked well and which systems couldn't keep up. I'll also continue to post more on my experience with it if I get assigned to analyze those larger data sets such as shopping-cart data.