7 Replies Latest reply on May 18, 2015 2:42 PM by Mary Solbrig

    How? Compare cell value to cell value in all other rows, count matches

    daniel.ayoub.0

      Data is structured as follows

       

      Row ID | Latitude

           1           32            

           2           32.1

           3           32.12               

           4           35.1      

           5           35.3                               

       

      For each row, I want to take the current Latitude value, and compare it to all the other latitude values, and count how many times it is +/- 0.5 from the other values.

       

      Example of what the results should look like:

       

      Row ID | Latitude | Calculation

           1           32             3

           2           32.1          3

           3           32.12        3      

           4           35.1          2

           5           35.3          2



      In excel this can be done with a simple COUNTIFS(RANGE,CRITERIA)...how can I do this in Tableau?


      Thanks

      Daniel

        • 1. Re: How? Compare cell value to cell value in all other rows, count matches
          Mary Solbrig

          Best way I can think of doing this exact thing would be to use a self-join. Instructions below assume that this is an Excel file:

           

          Step 1: Create the connection

          1. Connect to data using the legacy Excel connection. See steps 1-3 of the following blog for instructions:

          https://www.interworks.com/blogs/kfontenot/2014/07/24/custom-sql-tableau-82

          2. Drag Sheet 1 onto the Data Connect pane twice.

          3. Click on the join-circles and define a join as ID <> ID (Sheet $1)

          4. Alias the fields as ID, Latitude, ID 2, and Latitude 2.

          5. Go to sheet

           

          Step 2: Create the Calculations

          1. Create the following calculated fields:

          Close:

          IF ABS([Latitude] - [Latitude 2]) < .5

          THEN 1 ELSE 0 END

          Count Close:

          WINDOW_SUM([Close])

          First Filter:

          FIRST() = 0


          Step 3: Create the view

          1. Drag ID to the Rows shelf

          2. Drag ID 2  and Count Close to the Columns shelf

          3. Right-click on Count Close and select Compute Using > Table Across

          4. Drag First Filter to the Filters shelf

          5. Right-click on First Filter and select Compute Using > Table Across

          6. Select True on the pop-up dialogue and select OK

          7. Right-click on ID 2 and deselect Show Headers

           

          Another option is to enable an R connection and use the following calculation:

          SCRIPT_REAL("

          x1 <- .arg1

          out <- rep(0,length(x1))

          for(i in 1:length(x1)){

            out[i] <- sum(abs(x1-x1[i]) < .5)

          }

          out

          ",

          AVG([Latitude])

          )

           

          The question I really have though is what question are you trying to answer? If you are just trying to find out are close to any individual point, then a parameter could be used to choose which point you were referring to. This would avoid a lot of hassle, but you would only be able to see the results for one point at a time. Another option might be to just round the points to the nearest whole number, and then count how many points lie in each whole-number bin. This is not quite the same information though.

          • 2. Re: How? Compare cell value to cell value in all other rows, count matches
            Jonathan Drummey

            I'd started on a response to this one, and then came back to it finding that @Mary had written a post that covered the two solutions I hadn't generated options for. Here's my description and a third option:

             

            COUNTIFS() is a function that demonstrates the flexibility of Excel's cell-based model. In a given cell we can write a COUNTIFS() calculation and have it reference any other range of cells, so for each Row ID it's easy to have it reference all the other Row IDs.

             

            Tableau has a different model where each record is aggregated at the level of detail of the view i.e. the dimension(s) in the view to generate marks. In this case, the level of detail would be the Row ID. Tableau does have ways to aggregate across the marks, techniques include table calculations, level of detail expressions, computed Sets, and data blends. However, this particular case is an edge case where Tableau's calculations don't quite have the native ability to do what we want. The way I think of this is it's "compare the value of each to each other" problem, in that for each Row ID we want to compare the value of the Latitude for this Row ID to the value of Latitude for every other Row ID.

             

            In Tableau terms, that means every Row ID ultimately needs to have all the values of Latitude for other Row IDs available to the computation for the individual Row ID for computation. Tableau doesn't have something like a THIS() function or an iterator function that would let us do that in a simple way. There are three approaches I can think of:

             

            1) Generate a cross product for all Row IDs in the data connection, then do the calculation in Tableau. This could be done using the self-join that Mary described, Custom SQL, or a view in the data source. A variation on this would just do all the calculation & aggregation in SQL.

             

            2) Use Tableau's R integration. I haven't worked out all the details of the R script, however I'm certain it's possible since R's vector operations are built for this sort of thing. Mary provided a solution for this as well.

             

            3) Pad out the data in Tableau using Tableau's densification behavior. http://drawingwithnumbers.artisart.org/comparing-each-against-each-other-the-no-sql-cross-product/ shows an example of this, and I used the same technique to build out this view:

             

            Screen Shot 2015-05-11 at 5.19.09 AM.PNG~~~screenshot

             

            Here's how it works:

             

            - With Row ID on Rows and Row ID (copy) on columns and a table calculation with the right Compute using, Tableau pads out the combinations of Row & Row ID using what is called 'domain completion' to generate N^2 cells.

            - One table calculation can the pad out the values of Latitude across each Row by having a Compute Using on the Row ID (copy). Here's the formula: PREVIOUS_VALUE(WINDOW_MAX(MAX([Latitude]))). The key part is the WINDOW_MAX(), the PREVIOUS_VALUE is an optimization that definitely helped improve performance through v8, I haven't heard about any changes for v9 so I'm guessing it's still useful.

            - Another table calculation pads out the values of Latitude down each column of Row ID (copy). It has a Compute Using on Row ID and uses the same formula as the prior calc.

            - A third nested table calculation with a compute using on Row ID (copy) does the comparison of the Latitude (across) to the Latitude (down) and returns 1 if they are within 0.5, that is summed up across all the Row ID (copy) values and finally only a single non-Null result is returned for Row ID (copy). Here's the formula:

             

            IF FIRST()==0 THEN

                WINDOW_SUM(IF ABS([Latitude (down for others)] - [Latitude (across this row)]) <= 0.5 THEN

                    1

                END)

            END

             

            Here's the workout worksheet where you can see this:

             

            Screen Shot 2015-05-11 at 5.19.18 AM.PNG

             

            To build the final view, I duplicated the workout worksheet, dragged off the unneeded calculation pills, Ctrl/Cmd+Dragged a copy of the Count pill to the Filters Shelf and set it to filter for non-Null values, turned off the header for the Row ID (copy) pill, added a Count Header field with a single value 'Count' and used that to generate header text.

             

            A v9 workbook is attached, though this technique would work from at least v7 onwards (if you need it in an earlier version, let me know and I can duplicate it pretty quickly).

             

            Finally, a note on volumes & performance:

             

            - With enough memory the domain completion solution I proposed can work, though it will get slower and slower to refresh, I'm guessing it would likely start running out of memory somewhere between 1K and 10K Row IDs because that would translate into 1M to 100M marks. Allan Walker has demonstrated a view with 600M marks, and that took something like 45 minutes to render on a 64GB machine.

             

            - The SQL cross product can definitely handle more Row IDs, though if Excel-based that may run into some limits due to MS JET limitations. Other DBs could handle a larger data set.

             

            - The R solution should scale the best of all the options, since instead of N^2 records (cross product) or N^2 addresses (domain completion) it's using N*2 vectors.

             

            Jonathan

            1 of 1 people found this helpful
            • 4. Re: How? Compare cell value to cell value in all other rows, count matches
              daniel.ayoub.0

              Jonathan, Mary,

               

              I really appreciate the answers; they both helped me answer many of the questions I had while looking for a solution.

               

              I tried the SQL self-join, however I have ~100,000 rows and the database quickly grew too large. Doing this in R looks like it could be the right path -- going to give this a try tonight.

               

              Thanks

              Daniel

              • 6. Re: How? Compare cell value to cell value in all other rows, count matches
                daniel.ayoub.0

                Thanks Jonathan! I ended up using the R script Mary wrote to do the calculation -- this worked when the tableau sheet was viewed at the level of detail of an individual row in the database (i.e. unique row id | R calculation). Your explanation was extremely thorough however, and I think both of these would be incredibly helpful to anyone looking to do these types of calculations.

                 

                New question: When I aggregate the data up 1 level, tableau no longer sends the row-level data to R. Instead it sends the aggregated data to tableau, which yields the wrong answer in this case (and maybe this is a topic for a different thread),

                 

                To summarize:

                 

                This works with an R-script calculated variable when viewing at the row-level level of detail:

                Row ID | Widget | Latitude | Calculation (R-script)

                     1            A           32             3

                     2            A           32.1          3

                     3            A           32.12        3     

                     4            B           35.1          2

                     5            B           35.3          2


                When I aggregate at the Widget level, I would like to take an average of the calculation column calculated at the row-level, however in this case it just re-does the same calculation with pre-aggregated data. That is to say, Tableau is passing R the AVG(Latitude) data pre-aggregated at the widget level. I would like to pass it the row-level Latitude, do the calculation, and then aggregate the calculation values to produce the following:

                Widget | avg(Latitude) | avg(Calculation)

                     A             32.07           3

                     B              32.1            2

                • 7. Re: How? Compare cell value to cell value in all other rows, count matches
                  Mary Solbrig

                  To clarify: Given the following data:

                   

                  Row IDWidgetLatitudeR Script
                  1A32.03
                  2A32.14
                  3A32.55
                  4A32.64
                  5A33.03
                  6B35.12
                  7B35.02
                  8B361

                   

                  Would you like to first compute the near values and then average, or average and then compute the near values?

                  As in the following would be computing the number of near values and then averaging:

                  A = (3+4+5+4+3)/5 = 3.8,

                  B = (2+2+1)/3 = 1.667?

                  The following would be averaging, and then computing the number of values within .5 of the average:

                  avg(A) = 32.5, Count near Average(A) = 5

                  avg(B) = 35.366, Count near Average(B) = 2

                   

                  Oddly enough the latter is easier to do (since it doesn't require comparing every value to every other value):

                  IF ABS(Latitude - {FIXED Widget: AVG(Latitude)}) <=.5 THEN 1 ELSE 0 END

                   

                  The former requires aggregating the results of an R call, or returning values at a different level of aggregation than you sent it.

                   

                  The most straight forward, and probably best solution for this situation:

                  Use a trick similar to the one in the article below: Keep Row ID in the view, either in the rows or details shelf, and then use a FIRST() filter to hide the duplicate marks:

                  Aggregating By a Dimension Not in the View | Tableau Software