4 Replies Latest reply on Mar 29, 2018 11:46 AM by Jonathan Drummey

    How do you deal with dirty data?

    Tracy Rodgers

      I always think that getting my data in the format and cleaned the way that I need it to bring it into Tableau is always the most difficult part. I'd love to hear more about some of the frustrations and issues you run into when you're preparing your data for yourself or for your organization.

       

      - How much time on average do you think you spend on preparing your data before bringing it into Tableau Desktop?

      - How often do you have to go back and update a data set that you've already cleaned, prepared, etc?

      - How do people in your organization share this prepared data? Is the same data set being cleaned by multiple people to be used for analysis?

       

      - Do you run into issues because people within the organization are unfamiliar with the data? How do you overcome these? What governance policies do you have in place (if any)?

      - Who in your organization is tasked with preparing the data you use in your analysis? If you have to rely on someone else, how long do you have to wait to get a data set you can use?

       

      What else do you run into?

       

      Thanks for sharing!

       

      -Tracy

        • 1. Re: How do you deal with dirty data?
          Cathy Bridges

          - How much time on average do you think you spend on preparing your data before bringing it into Tableau Desktop?

               Thankfully, not as much as I used to. There are others in my organization who are tasked with prepping data and bringing it into our data warehouse. Previously, I spent several days using Excel (on a monthly basis), a few days using MS Access, and then a few hours using Alteryx.

               Sometimes I get Excel files that need to be in Tableau and I have to pivot it or clean it up first. Can take an hour or several hours.

           

          - How often do you have to go back and update a data set that you've already cleaned, prepared, etc?

               Often, and it's a bit of a pain. If we need to bring in additional columns, we have to go through a long and arduous process. If I need totals vs. breakout, I need to duplicate the data source. If I have null dimensions, but still need to see them, I have to strongarm Tableau into showing the missing data (and will ultimately have to request that the data be backfilled with zeros whenever new accounts are added).

           

          - How do people in your organization share this prepared data? Is the same data set being cleaned by multiple people to be used for analysis?

               The vast majority of our data is available in our warehouse, available to anyone who requests and is approved for access. I don't think multiple people are trying to clean the same data, but they might be.

           

          - Do you run into issues because people within the organization are unfamiliar with the data? How do you overcome these? What governance policies do you have in place (if any)?

               Absolutely we have issues. If you don't know any SQL at all, you're lost and have to rely on IT to create a custom view or write custom SQL for Tableau. We have governance but people still end up using their own (sometimes incorrect) calculations.

           

          - Who in your organization is tasked with preparing the data you use in your analysis? If you have to rely on someone else, how long do you have to wait to get a data set you can use?

               I believe it is the EDW team. When changes are needed, it takes weeks at a minimum and often months.

          • 2. Re: How do you deal with dirty data?
            Tracy Rodgers

            Thanks for sharing this insight Cathy!

             

            So the EDW team prepares the data and puts it into the data warehouse, but it sounds like you have to go through a lot of steps to get it into a format where you can analyze it in Tableau. Are you exporting the data to an Excel sheet or can you connect directly to the data warehouse?

             

            Do most people in your org use Tableau or are they doing the analysis wherever they can?

            • 3. Re: How do you deal with dirty data?
              Cathy Bridges

              Hi, yes, our EDW team cleans and loads the majority of our data into the warehouse. However, sometimes there are additional processes or calculations that happen outside of warehouse, and that may come to me in a format that is designed more as a report than as a flat file that is useful in Tableau. So I have to reformat it. There are a few reports which are currently only provided in Excel, but the data being used to create the reports is from the warehouse, so we're working on making those EDW-resident reports so they can be both better managed and also structured the way I need them. It's no fun having to swap out Excel files every time they're updated, and they can't just be overwritten because the data is on separate tabs so we have to make a summary tab that consolidates them.

               

              We have, I believe, hundreds of Tableau licenses, so there is a good deal of analysis being done in Tableau. Some are connecting directly to the data warehouse, but often times there is just too much data for a direct connection. For instance, we're capturing second by second information on phone calls that is collected through our phone software, then cleaned and imported into our data warehouse. But when business users want the ability to see year over year totals but also drill down to seconds and person handling the call... well suffice it to say some of the dashboards simply stopped rendering. We had to trim the data down, push some of the joins back to the data server rather than do them in Tableau, etc.

               

              When people need a substantial amount of data or multiple complex joins, we encourage them to have an intermediate object built in the warehouse that already has a lot of the filters and joins, and is structured in a way that is helpful to Tableau. But there are major downsides to that, first being that it takes a LONG time to see the finished product and a LONG time to implement any necessary changes down the line.

               

              We do have users who need to consolidate disparate data sources - warehouse data, their own server data, data scraped from the web, excel reports from outside sources - and they've had great success using Alteryx to manage that.

              • 4. Re: How do you deal with dirty data?
                Jonathan Drummey

                Hi Tracy!

                 

                I'm responding for PATH here where we've got 200+ projects (and many with various sub-projects) active at one time and my team's work where our primary tools are Tableau, Alteryx, Excel, PowerQuery, and R.

                 

                Average time for most sources is several hours with a very high standard deviation. Some projects take several hours to several days, others just a few minutes. Note that this is assuming structured captured data, the process of getting to that point can take months of effort.

                 

                Updating data sets: There are roughly three inter-related cycles of updates...one is during the data prep/analysis phase where there will be updates with improved data quality (see next item). Another is more dimensions & measures as in new metrics, more healthcare facilities, etc. And a third is additional periods of data e.g. next week/month/quarter.

                 

                Same data set cleaned by multiple people: yes, it's typically that there's someone downstream (often me) and someone(s) upstream and the upstream person(s) is/are taking feedback from the downstream person to do cleanup, usually around data quality issues & availability of supplemental data sets like spatial coordinates, local population for denominators, etc.

                 

                Sharing data: presently cleaned data sets are mostly distributed as CSV/Excel files (eventually EXASOL) with some publishing to Tableau Server.

                 

                Issues w/unfamilarity: the upstream people are usually familiar with the data in general but haven't had the expertise or had available tools for rapid iteration like Tableau or Alteryx to be able to really get a handle on deeper issues in the data. So there are conversations around that.

                 

                Governance processes: Tableau is still relatively new at PATH and we typically have data management plans in place that don't always fully capture the kinds of things that are possible with Tableau so governance is evolving. At a higher level there are governance requirements from donors & country partners that also affect how data is managed, and there are projects with research components that need to meet IRB requirements as well.

                 

                Who is tasked with preparing data? Each project has one or more staff who are tasked with preparing data.

                 

                Jonathan