There are various ways you can make this quicker - though none of the ways I can think of has the point-and-click simplicity of using built-in geocoding.
The way that Tableau geocoding works is that it has a Firebird database with tables for each of the geographic roles (Country, State, ZipCode, etc). When you want to geocode something it retrieves all rows for the relevant role and loads them into memory to match up with the values in your data - in this case ZipCode. Note that "all rows" means a lot more rows than you really want for two reasons.
Firstly, it retrieves all rows for all countries, even if your viz is just concerned with a single country.
Secondly, the geocoding database has lots of aliases for various of the roles. The way the query is structured, that means that it retrieves multiple copies of the rows for any role with aliases.
So for Zip Code, you end up retrieving all Zip Code / Post Code rows for all countries, including multiple copies for countries with multiple aliases. In the case of the United States there are 42,000 Zip Code rows, but those are retrieved 9 times, because there are aliases like "United States", "United States of America", "America", "US", USA", "United States (the)", etc. So the United States accounts for 380,000 of the 500,000 rows you've seen loading.
As an aside - the number of aliases has been creeping up since version 7 first came out. There were originally 7 aliases for the United States, but "United States (the)" and one other have been added since. They will have slowed the gecoding step down by another 20% or so. There's a real trade-off here between functionality and performance. The aliases (and some other smart fuzzy matching that goes on under the covers) really make geocoding much more effective and useable - they mean that geocoding just works with more data sources without having to worry about translating identifiers to match the specific name that Tableau happens to have chosen. But there's a performance penalty, as you've seen,
I've noticed that I usually only see that 10 or 15 second delay with Zip Codes the first time I load a viz and then only occasionally, if at all, after that. This is because Tableau caches the 500,000 rows and doesn't need to go back to Firebird to get the rows again, unless it has run out of space in it's cache and had to ditch the cached geocoding. It sounds as if in your case it is retrieving it nearly every time. That may mean that you have more data loaded in your view competing for the space, or maybe that you just have less memory available. I'm running on Windows 7, 64 bit with 8 GB or memory, which means that Tableau has as much memory as possible available (it's only a 32 bit application, so it can only access a bit over 3 GB) in total. If you are on a 32 bit version of Windows Tableau will only be able to use around 1 GB of memory for this sort of thing, which could explain why you are needing to reload the geocoding all the time.
So what can you do about it?
1) If you're on a 32 bit version of Windows you could try it on a 64 bit version with plenty of memory and see if that makes it better. That might just be easier than any of the other approaches.
2) As you have done - load the Zip Code to lat/lon lookup into the datasource for every view that wants to geocode Zip Codes. That's very ugly.
3) Create a separate data source containing the Zip Code lat/lon lookup and use data blending merge it into your viz. If you know the structure of your data, and particularly if you are only dealing with a single country, that lookup table can be quite compact and this approach performs well. You can just save that data source and connect to it in any viz that wants zip codes without the delays.
One way of doing this is just to connect to the Tableau ZipCodes.tds directly as a data source and blend with that. A couple of gotchas to watch out for there, though:
- ZipCodes.tds has all the Country aliases, so make sure your relationships limit it to just one copy of the country or countries of interest. Make sure you have an appropriate Country field in your primary data source. If you are just dealing with the USA, and don't already have a Country field, just define a calculated field Country with a value of "United States", for example.
- Blending from a Firebird database will hit a Firebird limit if you have more than 1500 Zip Codes to look up. The easiest way around that is to create an extract from the ZipCodes data source. You can minimise the size of that by hiding fields you don't need (especially the geometry column - that has the zip code boundaries which would make the extract really big, and they are no use to you in the extract). Also add a filter on the extract to limit it to just (ome copy of) the countries of interest.
4) The other thing I do is to create a custom geocoding database and purge all the rows I don't need for my visualisation. That tool I shared for hacking extra filled map layers into the geocoding database lets you do that. I don't think that's necessary in this case, though.
I've attached a Superstore sales example demonstrating the speed difference you get by blending from a cut-down extract off the ZipCodes.tds data source, as opposed to just the default - plus some views showing some of the stuff I've been talking about here.
Blend Zip Codes.twbx.zip 1.6 MB
Thanks so much for taking the time to give such an extremely thorough explanation and creating a workbook to boot! Our machines (at least until this Fall) are 32-bit with 4Gb RAM -- 2.5Gb is used by an assortment of system services -- and so your explanation of the amount of data being cached (or probably not being cached in my case) is very helpful. I was also curious to know why there were 500k records being joined from the zip codes data source and now have the answer to that question too.
I have a much better understanding of what is being done and will certainly make use of the techniques you provided. Thanks again for your help!
By the way, in putting that example together I did notice that it was re-reading the database quite a few times for me, too. That seemed to be triggered by certain types of changes I made to the workbook - which presumably caused it to need to refresh its cached view of the ZipCodes. I didn't take the time to try to work out why. Just flagging that simply moving to 64 bit won't make the issue go away completely.
Also, I should point out that all of that explanation of what is going on is just my interpretation of what I've seen by observation and looking at the logs - I'm fairly sure it's close, but I may have got some of the details a bit wrong.
FYI, this should be MUCH better in 7.0.4 which has just come out. Tableau no longer retrieves multiple copies of the zip codes for the synonyms, so geocoding is much quicker, especially for zip codes.
Thanks for the heads up! I downloaded, installed and re-opened the orignal workbook and performance is fanatastic. Again, I appreciate the assistance.
I am using Tableau 7.0.2 to map zip codes of a set of locations in the U.S. The end result is so slow (700 entries) that I had to turn off the automatic updates on the filter. Additionally, the map seems to require my users to be connected to the internet. Can anyone direct me to two things:
1) How I can have all of the US state maps and full country map available inside the packaged workbook so users don;t have to be online, AND
2) A way to speed this up so I can turn on the automatic filter refresh to make my users happy.
Tracey, I can answer #2 for you: download and install the maintenance update 7.0.4 immediately! It is considerably faster.