3 Replies Latest reply on Jun 27, 2018 11:13 AM by Jason Kim

    Join of two csvs results in unexpected cross product

    Jason Kim



      I'm using Tableau Prep to join two datasets together but it is creating almost a million rows when there should be less than 30k total. Based on this behavior and the join preview, I believe a cross join is being done rather than a regular join because I see several copies of the same record being generated. 


      I have two data sources that I want to do an outer left join so that the left table (clean-collisions2) contains fields from the right table (TPS KSI dataset). This right table contains lat, lon, and neighbourhood name. The left table contains lat, lon, and descriptive info for car collisions. I'd like to essentially add this neighbourhood name column to the left table where there is a valid match between the two sets of rounded coordinates so that I can show in what neighborhoods collisions tend to occur.


      For both sources, I created calculated fields which round the latitude and longitudes to 3 places, then I did a left join on these rounded values for lon and lat. Rounding is necessary since the two tables have slightly different coordinates. 


      My problem is that joining on the rounded coordinates (lat and lon) leads to several hundreds of thousands of extra rows being created, when there should be less than 30k since there are 30k rows in the left table.


      Any idea why this is occurring and how I can get Prep to do what I want it to do?


      I've attached my flow file as well as the two datasets for reference.



        • 1. Re: Join of two csvs results in unexpected cross product
          Jason Kim

          Don't know why the attached flow file has zip but here it is again without

          • 2. Re: Join of two csvs results in unexpected cross product
            Joshua Milligan



            This is a very difficult type of geospatial problem that doesn't have an easy solution.  And I'm not an expert when it comes to solving, but hopefully I can give some pointers.


            You are trying to match individual points (collisions) to other individual points (multiple lat/lon points for each ward/hood).  But in reality, (if I understand what you are trying to do) what you really want to do is find out whether a given point (collision) is within an area (hood/ward).  Assuming you had the polygons/shapes for the hoods and wards, then there are platforms that would allow you to perform this geospatial match.


            With Tableau Prep, you will be able to approximate a match, but it won't be 100% perfect.  There are a few approaches you might take:

            1. Rounding Lat/Lon
            2. Matching all points to all points, calculating distance, and keeping shortest distance for each collision
            3. Hexbins
            4. Finding Min/Max latitude and longitude for each hood/ward, matching where the collision is between the extremes, discarding the duplicates.


            Here are my thoughts on each:

            1. You attempted #1 and found that you ended up with a lot of extra records.  This is actually not unexpected (though definitely not desired).  The reason is that multiple points are getting rounded to the same value and so instead of getting a 1-to-1 match (1 collision per ward/hood) you end up getting many-to-many (lots of collisions matched to lots of different ward/hood points) and that's why your record count exploded.  Furthermore, you'll get a lot of points that don't find a match at all.
            2. This is actually the solution that might get you the best accuracy.  However, if you join all-to-all (cross-join) you'll have a ton of records and that might not be tenable as a solution if your starting data sets are large.  But you can find some calculations for distance here: https://vizpainter.com/mapping-distances-an-alternative-approach
            3. The idea of the hexbin is that you can group the points together into bins (of a hexagon shape).  Depending on how you scale, you'll either have overlap or gaps (or both) and have to de-duplicate your data afterward.  Tableau Prep does include the HEXBINX() and HEXBINY() functions, so this is a possible approach, but I think it will probably get complex and frustrating.
            4. This is the approach I took to solve your particular case.  It's good and results in fairly decent looking matches, though quite a few points on the edges don't find matches (a more complete set of data for hood/ward might alleviate that issue)


            Here's the final result:


            As you can see in the lower left status bar, I have all the collisions and the matches to Wards/Hoods looks pretty good.  But it's obviously not 100% perfect.  The black marks on the south and west sides didn't fall within the known boundaries of Ward/Hood.  Is it enough for good analysis?  You'd have to decide.


            Also, you could visually supplement your data by manually selecting some of the unmatched points, exporting the data, and manually assigning a ward/hood and then including that in your Tableau Prep Flow.



            Here's how I did it:


            There's a lot here and I won't be able to describe it all.  I've attached the flow so you can look and ask specific questions as needed.


            The broad outline is this:

            For the Hoods/Wards:

            • Create a Key combined from Ward ID and Hood ID to match later (Step: Create Key)
            • Get a maximum and minimum for both Lat and Lon for each unique key (Min/Max Lat/Lon)
            • Also, make sure I only have one record per unique Ward and Hood (Unique Ward/Hood)


            For the Collisions:

            • One branch to retain all the original data (All Collisions)
            • Aggregation to get one record per Colission ID, because the data contained multiple records for every individual in the wreck (Lat/Lon per Colission)



            • Join the Collisions and Hood/Ward where the colission Lat/Lon was between min and max Lat/Lon of the Ward/Hood
            • Since there will be cases where a collision will find multiple matches, we aggregate to choose one match (this is somewhat arbitrary and loses some accuracy)
            • Match back in the ward & Hood details
            • Match back in the collision details
            • Cleanup extra fields and give a "<no match>" description to collisions that didn't find a ward/hood


            Hope that helps!


            1 of 1 people found this helpful
            • 3. Re: Join of two csvs results in unexpected cross product
              Jason Kim

              Wow Joshua, thank you so much for your detailed run-down of the scenarios. As someone who is new to using geo-spatial data, it helps a lot.


              I think your solution is pretty good to use for analysis -- I will need to do some more detailed analysis to see whether excluding the literal "edge cases" where matches are harder to make will heavily skew the data.


              I am working on something similar to solution #2 by calculating the Euclidean distance between a collision point and the closest point belonging to a hood -- this should be doable within Prep as far as I know.


              The best, most accurate solution I've found is outside of Prep/Tableau... I can use GIS software to see which points are inside a neighbourhood polygon. I don't believe there's a way for me to do the same spatial join in Tableau? I've made several dual axis maps using shapefiles in Tableau but spatial joins don't appear to be supported.


              I've found this article on how to do spatial joins in Tableau prior to when it supported shapefiles so maybe I'll study this more and see what I can come up.


              Thank you again!!