3 Replies Latest reply on Jul 12, 2018 9:56 AM by Deepak Rai

    How to deal with data with multiple entries for an ID



      I have data in an csv/excel form that looks something like what i have attached, but with a larger number of records.


      Explanation of data:

      id is the id for a person.

      date is the year of a particular entry (this will vary more, but for simplicity i just chose 3 random dates in january at different years..there could be more or less entries per id)

      issue is did this person have an issue with customer service on this date

      issue any year is just the logical OR of all the years

      prob of issue overall is based off of a statistical model that was developed to predict if this person would have an issue with customer service.  This value was computed using all the data for all the dates.  it's not valid for every date entry, but rather as a total for all entries, but for simplicity, the data was replicated for every entry id.


      What I'm trying to accomplish:

      I am trying to create a treemap that will show me a number of things:

      I am going to bin the probability measures by groups of .2 (basically 20 percent chunks of the probability output).  And i want to count how many ids fall into the probability bin category.  Then, when i hover over the bin category, I want in the tooltip to show the 'truth', that is, how many people actually did have a problem with customer service (in this example data, the last two entries should be an example of a false positive, and false negative).


      What I can do so far:

      I am able to bin the probability measure, but I'm concerned about the count of people because i must make sure not to count an id more than once.  After doing some searching, I think that using the Fixed option might be the way to go, but I'm not entirely sure how to accomplish this.  Also, I'm not sure about how to deal with the Truth data which is given in a true or false way and how to properly count, since this field is also a replicated value like the probability score.  The dataset that i actually have is in the few millions of rows, and from my experiments, it seemed like doing fixed calculation on id was taking some time to process.  I'm wondering if this is just how its going to be, or if there's something i can do to speed it up.


      Message was edited by: Aaron Albin