It’s all well and good to do calculations on geography within one table of data, but a lot of the fun analysis is the result of combinations of data from multiple tables – like distance from customers to stores, or intersection of hurricane force wind polygons with Census tracts, etc.  So, let’s explore how we can use PostgreSQL to combine and compare geographic data across tables.

 

Good things to keep in mind:

 

  • You’re often combining tables. Think carefully about how many rows are being compared in your query.  Every row against every row?  That adds up (actually, multiplies!) quickly.

 

  • The more spatial calculations you do, the slower the analysis.  Spatial indexes will help minimize unnecessary spatial calculations.

 

    • For use in Tableau, if the calculation doesn’t have to be dynamic, do it as a pre-processing step.  That will speed things up when you want to introduce interactivity in Tableau.   If the calculation does have to be dynamic, know that you’ll get a performance hit by having to pass the query back and forth a bunch to PostgreSQL, so you’ll want to be cognizant of the size and complexity of your datasets.

     

    When working with complex analyses that need to be done in Tableau, I suggest sticking with Initial SQL if possible, and not relying on custom SQL or RawSQL, if possible. That makes it so that the result becomes a new data source and the query isn’t re-run every time you update a viz. It’s a bit less ‘chatty’ than Custom SQL or RawSQL and you can use more complex queries more easily.  The query will update when you open the workbook, but otherwise will just result in a static data source.

     

    There are some limitations of Initial SQL, though – for instance, you can’t dynamically adjust the query with parameters on your worksheets.  You *can* do that with Custom SQL and RawSQL, however.

     

    What sort of things can we do?

    In this post we’ll walk through examples for a few common analysis functions and break them down based on spatial question...  To keep the examples simple I'm really going to just work from ONE table of data, but will create a second table using a subquery.  For instance, most of the examples will rely on a US Counties dataset and a subquery that contains a single county or a buffer around a county that we will use as our 'second table.'  If you want to try similar queries with a similar dataset, you can download some good cartographic boundary files from the US Census.

     

     

    Where are the nearest x things...

    What are my neighbors?  A common question is ‘what are the K closest points to another point of interest?’ For instance – where is the closest coffee shop?  Or, what are the three closest distribution facilities? What are the closest fire stations to each reported incident?  We can answer this type of K Nearest Neighbor (KNN) problems easily with SQL.    But, this is a big enough type of question that I’ve written a few separate posts on just this topic!

     

     

    These posts mostly discuss finding the nearest neighbors with Euclidean distance.   If you want to find the nearest point along a road network, you can use PGRouting in PostgreSQL.  Here are some additional posts discussing those (and related) problems:

     

    What is here / near / overlapping / inside...

     

    While questions about distance between points is super useful, there are additional interesting proximity-based questions that we can explore with polygon data.  Since polygons are two-dimensional we can look at spatial questions that involve containment (like inside or outside), adjacency (touching), as well as questions related to distance.

     

    If you want to get super-geeky talking about spatial relationships, take a look at the DE-9IM (Dimensionally Extended 9-Intersection Model). The model can be used to classify possible spatial intersections.

     

    I’m not going to get super geeky and will try my best to just cover the basic interesting spatial relationships and how you can use them.

     

    Here are a number of good spatial operations you can consider for your analytic workflow.  All of these spatial operations will require multiple geographies as input, and most can work with points, lines, or polygons as input (though remember that for anything involving overlap, you need an AREA to overlap with, so you'll want a polygon to define the overlap area).  Most of the time the geographies in these queries are going to come from different tables in your database and you’ll be looking for relationships between those tables.

     

    To make things easy for these examples, I'm just going to use geography from one dataset, but I'm going to create a little temp table with a subset of the geography so that it acts like two tables being compared.

     

    Evaluating adjacency

    ST_Touches (https://postgis.net/docs/ST_Touches.html)

    (GEOMETRY)

    Returns a BOOLEAN indicating whether the GEOMETRY touch

     

    ST_Touches can be used to look at relationships between polygons, lines, and points except for point-to-point relationships.  For this to be true, the only points in common between the features must be on a boundary (e.g., polygons that share an edge, but do not overlap at all).

     

    In this example, we'll use ST_Touches to find all of the neighbors for a specific polygon of interest.  For instance, if we want to compare attributes for a county to all of its directly adjacent neighbors. The example below compares area of a selected county (Powder River County) to all of its neighbors.  The result tells me that it is smaller (land area) than all but two of its neighbors.

     

    To run this query, I'm creating a temporary selection of ONE county using a subquery that returns a single polygon based on the geographic ID (that is the 'With' statement in the query below)

     

    A query to identify touching polygons (and to label the results as being the ‘selected polygon’ or a ‘neighboring polygon’ would look like this:

     

    -- select a single polygon as our target location

    -- This is particularly nice when paired with a parameter action in Tableau

    -- to allow a user to just click on a polygon to update the parameter

    with selected as (

        select wkb_geometry as selected_geom   

        from tl_2017_us_county

        where geoid = <Parameters.selected county>

    )

    -- now find everything that touches it

    select *,

    -- case statement so we can label the counties as

    -- selected, neighbor, or not neighbor

        case

            when geoid = <Parameters.selected county>

                then 'Selected'

           when ST_Touches(selected.selected_geom, county.wkb_geometry) then 'Neighbor'

           else 'Not Neighbor'

        End as neighbor_type

           

    from tl_2017_us_county as county, selected

     

    That would let us build up a visualization like this in Tableau - where we have a selected county and its neighbors highlighted, and a bar chart showing the area of the selected county compared to the neighbors.  Color encoding and the selection of counties to include in the bar chart are based on the 'neighbor_type' data returned in the query above (that's what we get from the case statement)

     

    Evaluating overlap

    ST_Intersects (https://postgis.net/docs/ST_Intersects.html )

    (GEOMETRY or GEOGRAPHY)

    Returns a BOOLEAN indicating whether the GEOMETRY or GEOGRAPHY share any portion of space.

     

    ST_Contains (https://postgis.net/docs/ST_Contains.html )

    (GEOMETRY)

    Returns a BOOLEAN indicating whether the GEOMETRY of B is completely inside the GEOMETRY of A

     

    ST_Intersects and ST_Contains give us two ways to look at whether or not two polygons, or a polygon and point or lines share space.  With ST_Intersects, the geometries just have to overlap.  With ST_Contains, the geometry for one feature has to be completely inside the geometry of the reference polygon.

     

    Here is an example of the difference (ST_Intersects on the left, ST_Contains on the right).  These examples find all of the polygons that intersect or are completely contained by a 150km buffer drawn around a selected county.

     

     

    The queries for these look very similar to the one that we used to find all of the neighbors that TOUCH - but our selected polygon is the BUFFER around the selected county.  See the bold text below for the differences:

     

    INTERSECTS

    -- select a single polygon as our target location

    -- This is particularly nice when paired with a parameter action in Tableau

    -- to allow a user to just click on a polygon to update the parameter

    with selected as (

        select ST_Buffer(wkb_geometry, 150000) as buffer_geom 

        from tl_2017_us_county

        where geoid = <Parameters.selected county>

    )

    -- now find everything that intersects with it

    select *,

    -- case statement so we can label the counties as

    -- selected, neighbor, or not neighbor

        case

            when geoid = <Parameters.selected county>

                then 'Selected'

           when ST_Intersects(selected.buffer_geom, county.wkb_geometry) then 'Neighbor'

           else 'Not Neighbor'

        End as neighbor_type

          

    from tl_2017_us_county as county, selected

     

     

    CONTAINS

    -- select a single polygon as our target location

    -- This is particularly nice when paired with a parameter action in Tableau

    -- to allow a user to just click on a polygon to update the parameter

    with selected as (

        select ST_Buffer(wkb_geometry, 150000) as buffer_geom 

        from tl_2017_us_county

        where geoid = <Parameters.selected county>

    )

    -- now find everything that intersects with it

    select *,

    -- case statement so we can label the counties as

    -- selected, neighbor, or not neighbor

        case

            when geoid = <Parameters.selected county>

                then 'Selected'

           when ST_Contains(selected.buffer_geom, county.wkb_geometry) then 'Neighbor'

           else 'Not Neighbor'

        End as neighbor_type

          

    from tl_2017_us_county as county, selected

     

     

    Creating new geography based on overlap

     

    What if we want to create new geography based on the overlap or intersection of geographic features?  In other words, what if we want to clip roads based on an outline or separate polygons based on an intersection?  We can use ST_Intersection for that...

     

    Here is an example in SQL Server that I wrote up a while back if you want more use cases:

     

     

    Here is what the intersection would look like in Tableau - notice that Park County is now split into two polygons...one red polygon that intersects our buffer, and one grey that is outside the buffer.

     

    And here is an example of what the query would look like in PostgreSQL with the counties dataset from other examples in this post - it looks a little more complicated because Tableau only allows use of a single geometry type at a time - and the intersection can return polygons and multipolygons (so we force everything to a multipolygon with ST_Multi), and because we have to reinforce the geometry type and coordinate system for the resulting geometry (that is the ::geometry(multipolygon, 4269) part of the query), but really it's not that difficult:

     

    with selected as (

        select ST_Buffer(wkb_geometry, 150000) as buffer_geom

        from tl_2017_us_county

        where geoid = <Parameters.selected county>

    )

    select

         counties.*,

         ST_Multi(ST_Intersection(counties.wkb_geometry, selected.buffer_geom))::geometry(multipolygon, 4269) as Neighbor,

         selected.buffer_geom as geom_buffer

    from

         tl_2017_us_county as counties, selected

    where ST_Intersects(selected.buffer_geom, counties.wkb_geometry)

     

     

    By calculating out the intersecting areas, we can then do other fun calculations like proportionally allocating attributes to the two segments of the polygon.  As an example - if we were looking at the area impacted by hurricane-force winds and wanted to know how many people were impacted, we might over-estimate if we just used the entire population for an overlapping county. BUT, if we know that only 20% of the county intersects the hurricane force wind zone, then we can make a new calculation and just use 20% of the population for that county.  It's easy to do that by comparing the area of the intersection with the area of the entire county.

     

     

     

    That is just a few options of how you can combine geographies to make selections.  There are plenty of other interesting operators to look into and I encourage you to explore the options!