14 Replies Latest reply on Dec 25, 2017 7:16 PM by Shinichiro Murakami

    Data quality and Data available

    Sergio Brígida

      Hello people

       

      I wanted to do two things in the data:

       

      The first objective is:

      Calculat the percentage of measured errors of the total, where the measured variable is the flow by  id_sensor and I have data for several years.

      The errors that may exist are the flow below zero.

       

      The second objective is:

      Calculate the availability of the data that represents the coefficient between the expected data volume and observed data volume.

      This sensor collected flow data every 5 minutes during 2012 through 2016. I would like to make a calculation of the sum of all the values that gave me 288 records per day and the number of values that gave me below that number.

      The goal is to make the data available in percent per day (288 records) per month (288 * 31 or 288 * 30 or 288 * (28 or 29)) and by Year for a given attribute.

       

      The following equations express the receavailability coefficient for each parking area:

      2.PNG

      In these equations, i is a day where i (symbol belongs) N and N is the number of days in the dataset, in this case 366. Vi is a boolean variable that considers a day valid if RPi >= 0.9. RPi represents the record percentage as a coefficiente between the number of observed records in a day i and the expected number of records. Finally, the availability is a coefficient between all days that are considered valid (with record percentage higher than 0.9) and the total number of days, N. In sum, the availability coefficient is equal to the number of days with a record percentage higher than 0.9 divided by the total number of days, 366. In order to chose how to consider a specific day valid (with a high record percentage), the record percentage was calculated for each parking area. Results showed that all parking areas had an average between 70-90% of record percentage for most of the days, but hardly any parking area had more than 90% in any day. Due to this reason, it became clear that demanding a record percentage higher than 90% would render all parking facilities invalid, and the value of 0.9 was chosen as the required record percentage. Figure 3.3 shows the availability coefficient for each parking area. As shown in the figure, most parking areas have a coefficient higher than 0.8. This means that the majority of the parking areas have more than 80% of their days with a high record availabilty.

       

      I reduced the data and attached it here.