There are a lot of spatial data sources that need to be geocoded to point lat/lon locations in order to map them easily in Tableau.  I’ve always done my geocoding as a data prep process using Python.  This week I’ve decided to get all fancy and see what how my workflow would change with the new Python integration in Tableau Prep Builder 2019.3.1 (still in pre-release as of 7/26/19 when I’m writing this).

 

All in all, it’s pretty nice to be able to build geocoding directly into a Prep workflow!  There are still some limitations (I think, maybe you don’t see them as limitations). Since I’ve taken some time to figure out how to make this all work, I figured that I’d do a quick write up to document in case it helps someone else.

 

The short version of the steps (more detail on each later in this post!):

  1. Install TabPy and get it running.  Make sure you’re using Tableau Prep Builder 2019.3+ with support for Python scripts
  2. If you want to use Mapbox or Google geocoding, get yourself an API key (Google and Mapbox)
  3. Learn a bit about how Python will work in Prep
  4. Add in your address data and either format like my sample dataset, or prepare to do some minor adjustments to the Python script so that it works with your address format (I’ll explain how the script works; it isn’t hard to edit)
  5. Add the Python script as part of your workflow.  I’ve written three sample scripts to get you started – they all look pretty similar and just call out to different services.  I’ll explain how they work later on in this post.

 

Geocoding_Mapbox

Geocoding_Google

Geocoding_OSM

 

And some sample data that I used to test

 

Step 1:  Install TabPy & get it running

 

Go to the Tableau Github repo and check out the code for TabPy.  Follow the instructions to install and get it running. 

 

To use it with Prep, I believe you need to be using Python 3.6 or later.  Part of the incentive for me to figure out geocoding in Prep was that I was finally getting around to setting up my Python environments on a new computer, so I was starting fresh…I can offer no help if you need to upgrade everything first.  My general solution has always been to just wipe everything out and start new – but I don’t have many Python environments to set up. 

 

While you’re at it, make sure you’re using a version of Tableau Prep Builder that supports Python (2019.3.1 pre-release does)

 

Step 2: Get your API keys

 

I’ve written three scripts for geocoding: Google Maps API, Mapbox API, and Open Street Map.  Google and Mapbox will require an API key to geocode using their service:

 

Google API Key Info

Mapbox API Key info

 

While you’re at it, read the terms of service for the different geocoding services and make sure that your use isn’t going to violate any of the terms!

 

Google terms of service

Mapbox terms of service

Open Street Map terms of service

 

Step 3. Learn a bit about how Python a python script will work in Prep

 

I will admit that I haven’t read much of the documentation, I just jumped in and tried to get things working.  Some other helpful blog posts if you want some more info:

 

The big things that I learned along the way:

  • Remember to start the TabPy server, otherwise your script won’t run.  Not that this problem happened to me or anything…
  • Data is passed from Prep as a Pandas data frame. 
  • Data is returned from Python as a Pandas data frame, and Prep will update the columns in your data source according to the return.
  • If you want a new column (or two or three) to be created, they either:
    • need to be specified in a get_output_schema function in your Python script, or
    • need to exist in your Prep data source before they are returned by your script, otherwise they just disappear into the Python void (I think of this as the Tableau python eating them like tasty little data morsels).  If they aren’t already in your data source, add these new columns in before you add your Python script to the workflow. In the example below, I added Latitude, Longitude, and Accuracy right after dropping in my table of addresses.

 

  • And one last thing that was super helpful to me… I wrote all of my scripts and tested them outside of Prep before converting and using in Prep.  This made them way easier to debug.  Maybe it’s just me, but I find it difficult to debug in Prep.  Since the input and output to the script are both Pandas dataframes, it’s pretty easy to write and test the initial script – just read in your data to a Pandas dataframe, write and test your processing, and then do some minor modifications to get it into Tableau (mainly remove your step to read in your data to Pandas dataframe, and change the input on your function to bring in a dataframe, since that is what Tableau is going to give you….)

 

Step 4. Drop it all in Prep and run it! (the actual step-by-step of what I did…)

 

Start with a data source that needs to be geocoded. 

 

I didn’t have my own dataset that I needed to geocode, so I downloaded a set of addresses from here.  I converted the .json to an Excel file so that it was easy to work with in Tableau. That converted data source is here if you want to play along.

 

Add the Excel file to Tableau Prep – I’m using the sheet called ‘json’ (silly naming, but it made sense at the time because my data was originally a big json string )

 

 

Check out the format of the table and either add in the columns we want for our geocoding output, or make a note of the existing schema for the table and get ready to add that into your Python script in the get_output_schema() function).

 

Here are two examples of this...

 

Option 1 - Add your new columns in advance (this is what I'll use for the examples in this post)

Here I’m adding Latitude, Longitude, and Accuracy.  Note that case of the field names is important!  This has to match in your data table and in your Python script (otherwise the returned data values will disappear like tasty little data snacks for your friendly Tableau Python).

 

 

Add all three fields as calculated fields (I just put in default values of 99.99 for latitude and longitude and ‘accuracy goes here’ for accuracy – that makes sure they have the right data type for the values that I’ll return from my Python script):

 

 

After adding all three new fields I have this

 

 

Option 2 - Use the get_output_schema() function in your Python script

If you prefer, you can just add a new function into your Python script that will let you return new fields without pre-creating them in Prep before you add the Python script.  If you opt to do that, you'll need to add a function that specifies the schema for the ENTIRE table you want to return and have available in Prep.  So, if you only want two fields to show up in the returned result, you just have two fields in the get_output_schema() function.  If you want the entire input table to be returned + any new fields, you'll have to define ALL OF THE FIELDS. 

 

I would opt to do whichever takes less time.  If you have a super-simple set of calculations being returned, it's probably faster to just add the new fields in advance as calculations in Prep.  If you have a large number of new fields being returned, it might be easier / faster to define the new schema (unless you're starting out with an enormous number of columns in your data source to begin with... do whatever is easiest for you).

 

The function looks like this - note that I've included it in the scripts on Github for reference (it's commented out so that it won't run by default), but am not using it in my example below.

 

 

 

 

Back to our step-by-step (using the method with the fields added in Prep before running the script)...

 

Now that we have the table set up, let’s look at the Python script that we’ll run (I’ll walk through adding it to Prep after we see what the script is going to do…).  Since all three of the scripts are pretty much the same, I’ll just use the Mapbox API geocoding as the example to explain the basics of how it works.

 

Here is the whole script, then we’ll break it down and walk through it part by part…this is just an image, but you can copy the individual text below – or just take a look on Github:

 

 

What is this doing?

 

First – we define the function that is going to be called by Tableau Prep, take in a data frame as input, and return a data frame to Tableau Prep as the output:

 

def Geocode_Mapbox(df):

<the code that does stuff goes here>

return df

 

Add in your mapbox API key.  To make it easier, I dropped a constant up top and just refer to the key by name (MAPBOX_TOKEN)

 

MAPBOX_TOKEN = YOUR_API_KEY_GOES_HERE

 

Then we fill in the magic geocoding in the middle.  We’ll walk through every row in the table from row 0 to the end (we get the count of rows from looking at the length of the data frame):

 

for i in range(0, len(df)):

 

For each row, we’ll build up an address string that we can pass to the geocoding service as part of a URL. To do that, we concatenate all of the individual address parts.  If your address is already in one column, you can skip this part and just move on to replacing the spaces with ‘+’ to use in the call to the geocoding service. 

 

loc = df.iloc[i]['Address'] + ' ' + df.iloc[i]['City'] + ' '  + df.iloc[i]['State'] + ' ' + str(df.iloc[i]['Zip'])

 

and then replace the spaces between words with a +

 

loc = loc.replace(' ', '+')

 

Then we create a URL to use to call out to the geocoding service -it’s just squishing the location (loc) into the right place in the URL and then tacking our geocoding API key onto the end:

 

geocode_url = 'https://api.mapbox.com/geocoding/v5/mapbox.places/' + loc + '.json?access_token=' + MAPBOX_TOKEN

 

Then we use the urllib Python library to send that URL out to the service and get our geocoding results back. The return is a json string that we load up so that we can read it easily using the Python json library.

 

req = urllib.request.Request(geocode_url)

r = urllib.request.urlopen(req)

req_body = r.read()

j = json.loads(req_body)

 

Then we just grab the details we want from the JSON string.  Sometimes there are multiple returns from the geocoding query in Mapbox, so we’ll take the first (feature index 0 in the json string):

 

feature = j['features'][0]

 

So long as there is an accuracy value that was returned, we’ll add that to our data frame:

 

if 'accuracy' in feature['properties']:

df.set_value(i, 'Accuracy', feature['properties']['accuracy'])

 

Same with the latitude and longitude – if there is a coordinate returned, grab the values and add them to the data frame.  The feature ‘center’ returns the coordinate as an array with longitude as the first element and latitude as the second element.

 

if 'center' in feature:

df.set_value(i, 'Longitude', feature['center'][0])

df.set_value(i, 'Latitude', feature['center'][1])

 

After running through all of the rows in the data frame, we return the results to Tableau Prep and enjoy having geocoded data:

 

return df

 

Now that you know a bit about how the script works, let’s add it into Tableau Prep.

 

Set the connection type to TabPy Server and connect. I run mine on localhost using port 9004 (the defaults).  Use whatever you set up for your own TabPy Server!

 

Now it’s as easy as selecting your script location and telling Tableau Prep what function to call…

 

The script should automatically run and will populate the table with new values:

 

If there are any problems with the script, an error message will come up on the Alerts icon in the upper right corner of the Prep window. 

 

And that is all there is to it!  I just walked through the details on the Mapbox geocoding example, but the Google and OSM examples are pretty much the same – just walk through the data table one row at a time, create a URL to send off to a geocoding service, get the result back and parse it into columns to return to Tableau.  There are a few minor differences in how the data is parsed because every service returns the matching addresses in a slightly different way, but otherwise the process is the same!

 

And, remember, you can either pre-create the columns for your returned values OR use the get_output_schema() function to define the schema for the returned dataframe. 

 

 

A few more things to think about

  • Geocoding services typically charge per address.  Both Google and Mapbox offer a generous number of free geocodes per month for accounts, but after that there is a small charge per address.  When you use these scripts as part of a Prep workflow, be cautious with how often the script runs.  You don’t need to keep geocoding the same data over and over – it’s not likely the location attached to an address will change! I recommend geocoding, then saving the result as a .hyper file and then just using that .hyper in the rest of your workflows so that you don’t re-geocode the same data repeatedly.

 

  • You could also just do a little bit of modification and run these outside of Prep and just hard-code the values into your table before even starting your data processing in Prep.  Do whatever works for you!

 

  • Geocoding services may all return slightly different location values. There are multiple ways to obtain latitude and longitude values.  Some addresses will match in some geocoding services and not in others.  Sometimes the results are just different. Don’t freak out if you geocode with multiple services and the results differ a little bit.  We can geek out on the meaning of ‘accuracy’ in geospatial analytics some time at TC. I have a lot of strong feelings on the topic and I’m not afraid to share them with people.

 

And, of course, there are many other ways to geocode your data and some good Python libraries to help out (e.g., the geocoder library is pretty cool). Use whatever works for your process! 

 

Good luck and good geocoding!