2 Replies Latest reply on Aug 10, 2018 12:30 PM by Jim Dehner

    Calculation error due to messy data


      Hi All,


      I have a lengthy two part question so bear with me please.



      I'm pulling in data from SQL and due to the nature of the data my table gets blown out resulting in duplicate rows. The data are test results for individuals. Each individual gets a distinct ID number and tied to that ID number are 41 [Targets] which each have a numerical [Result]. What I haven't included in the example workbook (privacy concern) is another column that gives descriptive statistics based on that ID. That column causes the table to give me duplicate rows for each test result, not the end of the world since the numerical result is the same for each [Target] so I can use ATTR or aggregate using Avg to get the correct numbers. It is however causing a problem when I am performing a series of calculations on the data.



      The end user wanted the data to be transformed and scaled using these steps:

      1. Take the results that fall within the 20th and 80th percentiles for each gender

      2. Take the average and standard deviation for the results that fall into that range for each gender

      3. Perform a z-score transformation (x - mean) / standard deviation (using mean and standard deviation of results in the range)

      4. Scale the data as such : (z-score/4)+2


      You'll see in my workbook a folder under Measures called "A10 Calculations", there you will see how I separated out the results by gender (A10 (female) and A10 (male)) and then performed the calculations stated above. So part one of my question is if this series of calculations can be done in a more condensed way. As you'll see it takes 14 calculated fields to come to the answer [Normalized A10] and I would hopefully like to trim that down.


      The second part of my question is that when I take the average of the results that fall into the percentile range I am getting an incorrect answer. In Sheet 1 you'll see the values for target A10 that fall into the range. In the summary to the right you can see the correct average of those values to be 826.9. In Sheet 2 I pulled in the [AVG (female)] calculated field and it shows 799.5. The reason the value is incorrect is because of the duplicate value issue stated above. my calculated field [AVG (female)] is taking the average of the summed duplicate rows when it should be taking the values showed in Sheet 1. Is there any way to resolve this?


      If you're still reading this thank you, I know it is an extremely lengthy post. Any insight would be wonderful.



        • 1. Re: Calculation error due to messy data
          Patrick Van Der Hyde

          Hello James,


          Can you use a version of the data source that has single rows for each individual for these statistics that require the unique id values?  Then run the rest of the report from  a second data source as needed?  you could then use data blending to bring them together. 



          1 of 1 people found this helpful
          • 2. Re: Calculation error due to messy data
            Jim Dehner

            good afternoon


            Sent a lot of time on this one - didn't chase it all the way through but stopped when I got to the avg female  result you referenced



            is the calculation


            after spending a lot of time with this it was the way the duplicate records were aggregated that was creating the problem - first need a "faux" disaggregate (that's the numerator based on a min value by ID) then summed over the total number of records in the range and the denominator divides by the number of records in the range



            If this posts assists in resolving the question, please mark it helpful or as the 'correct answer' if it resolves the question. This will help other users find the same answer/resolution.  Thank you.

            1 of 1 people found this helpful