2 Replies Latest reply on May 18, 2018 7:11 PM by Tom Fuller

    Remove Duplicate Values

    William Graf

      Is there any way to use Maestro to remove duplicate values from a data set?

       

      I've got an employee census file that contains multiple entries for certain employees with a unique "employee ID" identifier. Essentially, I want to be able to remove rows with duplicate employee IDs.

        • 1. Re: Remove Duplicate Values
          Joshua Milligan

          William,

           

          Sure!  There are a couple of possible ways and which you choose may depend on exactly which record you want.  Let's talk about a couple of cases.

           

          Case 1: Exact Duplicates - All values for duplicate records are the same

           

          Employee IDNameDate Hired
          1Pentecost4/24/2012
          2Ryrie3/2/2015
          3Walvoord5/1/1997
          3Walvoord5/1/1997
          4Chafer2/27/1982

           

          Here, you can just use an Aggregate step and Group on Employee ID while Aggregating all other fields as MIN or MAX:

           

          The output is nicely deduped, because the aggregation only gives you one record per Employee ID:

           

           

          Case 2: Duplicate IDs with various values and you want to keep a specific row

              

          Employee IDNameDate HiredStatus UpdatedStatus
          1Pentecost4/24/20124/12/2018Full Time
          2Ryrie3/2/20154/12/2018Full Time
          3Walvoord5/1/19974/11/2018Part Time
          3Walvoord5/1/19974/12/2018Full Time
          4Chafer2/27/19824/12/2018Full Time

           

          Here, the duplicate occurs because the second row indicates an update of status.  Likely you'll want to get the most recent record.  At times there may be other logic, but it will likely follow the same pattern (see details here https://vizpainter.com/latest-snapshot-in-tableau-and-maestro)

           

          Basically, you'll use the aggregation step to group on Employee ID and get the MAX update date (or Row ID or other field or MIN of the field that identifies the row you want to keep).  You won't worry about other fields yet.

           

          Then, you can inner join that back to the previous step in the flow on both the Employee ID and the Status Updated fields.

          It looks a bit strange, but you can see that the row on 4/11/2018 is excluded by the join.  The net result is a nicely deduped data set:

           

          I've attached a sample packaged flow with these examples.

           

          Hope that helps!

          Joshua

          • 2. Re: Remove Duplicate Values
            Tom Fuller

            This is a great tip that will come in handy for a lot of people! A slight variant of case 1 would be when you have multiple records per ID, and some are exact duplicates.  If your goal is to remove ONLY the exact duplicate rows, you can group by ALL fields. In the case of the sample file, either approach works.