1 of 1 people found this helpful
You can certainly still refresh a Firebird extract with desktop version 7 and it stays as Firebird. You have to explicitly select the upgrade option to convert it to data engine format (although you will be prompted to do that when you open it in desktop).
I have done this once or twice for the very reason you mention (RAWSQL) - and in fact I've hung on to a copy of 5.2 so I can still create an extract in Firebird format and then upgrade to a later version of Tableau to get the other new features (though I haven't resorted to that for a long time).
You mentioned scheduled refreshes of extract which I presume means you have server. I don't know for certain that the same applies there - but I'd be very surprised if it doesn't.
Thanks a lot for your feedback. This is quite reassuring as so far i couldn't find any alternative way of getting to the result i'm after. So if at least 7.0 (hopefully as well on the server-side) can safely refresh the extracts built in 5.2, that's a solution i can still envisage.
As i know you are really good with maps on Tableau, may i bother you a little with regards to my problematic?
My objective is to build a dashboard that allows to filter on a city or postcode (maybe both at the same time), most probably at country level for a start, but maybe later for multiple countries in Europe.
The end-goal is to be able to select a point (via the postcode/city) then filter on distance to find the number of prospects/customers available in the resulting area.
I had found on clearlyandsimply various nice examples of similar dashboards. However i have additional requirements on top of what was build that include frequent data refresh (thus this thread) and rather large data volume considering the need of the on-fly calculation (already 2 millions records just for the postcode to latitude/longitude data for the UK pilot - therefore not including customer data).
I had envisaged a double data source structure, keeping the reference postcode to lat/long data in a static firebird TDE while having the customer/prospect data in a new TDE, but blending the data makes creating the proper distance calculation i bit tricky, at least with my knowledge of blending and aggregations in such calculations.
Is that something with which you are familiar and could give me advise?
Now that you've explained a bit more about what you're doing I'd be interested to know a bit more about why you need the RAWSQL. One thing which might do away with the need for it is that in Tableau 7, a data extract materialises the results of (some) RAWSQL expressions - i.e. if it sees that the value of the RAWSQL expression for any particular row will always be the same, it evaluates it while creating the extract and stores the result. That partially gets around the restriction on RAWSQL with the data engine. It can't do that if the value will vary (for example if the RAWSQL expression involves the value of a parameter).
On your wider question about what you are trying to do - I can see the challenges. I've done a lot in this space as you probably know. You mention that you've looked at Robert Mudigl's work on Clearly and Simply, so you probably saw my follow-ups to that which described some ways to get around some issues Robert was having.
I also did some analysis which sounds functionally very similar to what you want (aggregate metrics from one table,selected based on their distance from a selected point from another table) which I wrote about in this blog posting on New Zealand Earthquakes. But the technique I used there (for selecting a post code and getting it's latitude and longitude) used a trick with parameters which wouldn't scale to the millions of rows you are talking about.
More recently I've been experimenting with doing some of the work in a spatial database. That can be either preparation of the data outside Tableau using spatial functionality and then importing into Tableau for visualisation, as in this blog post and Shawn Wallwork's accompanying Viz Talk post, or it can be a live connection to the spatial database to allow more dynamic selection (though you can only visualise points, not filled maps driven off the spatial database directly, unfortunately). I haven't written anything up anywhere about what you can do with a live feed to a spatial database. Whether or not that was any use to you would depend on how you want to visualise the results and how you want to deliver them (i.e. you need a live feed, so it can't be on Public or using packaged workbooks sent out to people).
I had a bit more of a think about this.
I think that the key to making it work well at the data volumes you are talking about (though you don't say how many customer rows you have) is likely to be finding a way to force the selection by distance from the selected centre to happen in the database back-end.
I had another read of that series I wrote on Clearly and Simply to remind myself about the issues. I discussed three approaches to doing the distance calculation in the first article in that series. The first two approaches (table calculation and data blending) result in the latitude and longitude of the centre only being available in Tableau - so the calculation of which ones were within range and the subsequent filtering all had to happen in Tableau. That wouldn't scale well if you have hundreds of thousands or millions of rows of customer data (I get the impression you may have that sort of amount).
The third approach I used was a bit of a bodge - I embedded the postcode and the latitude and longitude all within a single parameter, but then used an alias to show only the postcode. That let me pass the centre latitude and longitude down to the database back-end to do the calculation and filtering. But I really don't think that will scale to two and a half million postcodes. Also, when looking at the article again I noticed that Tableau's display of the parameter value has changed subtly since I wrote that (on Public anyway - probably with version 7). When you select the postcode to edit it, it now displays the long string with the embedded latitude and longitude. So with the number of postcodes you want to choose from and the fact that the display is now horrible I really don't think that way is an option.
The only other approach I can think of would be to use a "guided analytics" approach, whereby you have a view of your locations (cities, postcodes) which allows you to select a single location in some way and then uses a single select filter action to carry that location over to another view which uses a datasource which combines your location list with your customer data, in a cross-product. Once again, that approach would allow you to do the heavy-lifting of selecting based on distance in the back-end. As long as you only ever use that view a single location selected it should perform fine.
The only drawback I can see with that approach is that you wouldn't be able to create data extracts from it (an extract would evaluate all possible combinations of locations and customer data (so would give you two and a half million copies of your customer data).
Depending on the number of rows of data, it might be that you would get better performance by storing your data in a spatial database and using spatial functions to do the proximity searching, rather than just the formula Robert Mundigl used in his Clearly and Simply postings.
Let me know if any of this doesn't make sense and I'll try to explain it more fully.
wow, fantastic post again Richard.
First, sorry that i didn't reply to your message from thursday, i wanted to try the various methods you detailed on clearly and simply before going back to you. However i've so far been facing a problem with the distance calculation while trying the table calculation method where i can't get it to calculate the correct distance once filtered on a postcode value, even though the calculated coordinates of the center postcodes are right.
Once solved, i'll probably try using Blending and following your instructions to see how that performs.
I indeed noticed your trick when using the list of coordinates/postcodes in the parameter, but wasn't willing to test it in my context as the no of values required would probably have crashed Tableau. But for other countries that do not require the same amount of values, this could be a good solution.I also noticed the changed display when looking at the workbook you had made available as example, even though in our case we currently only need a single location selection. One thing i could see useful, but i'm not sure is possible, is to pass the parameter to the underlying SQL, which could be a stored procedure. Maybe there are other ways to achieve a similar results, but so far i'm not familiar with such methods, we have been mostly working with simple views until now that didn't require complex thinking like these .
As you guessed, my customers data could be rather large as well. In this UK pilot we will certainly end up with close to a million records, if not more. And this data is bound to increase, even though marginally, over time.
I follow you on the possibility to try pushing the query to the underlying datasource (in my case MS sql server), however, i'm not too sure how i could achieve that. First of all the selection "interface" is something i'm not too sure how to attack in the context of Tableau.
When you mean using Spatial database, i suspect it also implies a dedicated database server application. Is that correct or is this something that can be handled within SQL server (sorry this could seem a dumb question but i'm not familiar at all with spatial databases)?
By spatial database I meant a database with support for spatial datatypes and operations. That lets you do things like defining an area (which could be a circle or a square or it could be the boundary of a region like a county) and run queries which test for things like containership, overlap, intersection etc. So you could say: "find all the customers that fall within <N> miles of this postcode".
SQL server does have support for that type of operation - though I haven't used it much personally.
Keep this thread going, I'll be really interested to see how you get on. From the numbers of postcodes and customers I doubt you will find that any of the three methods in that Clearly and Simply post work for you - I think you'll need to look at the other approach I talked about. That should work well if you drive directly off SQL server - it won't work if you need an extract.
Just following up on our discussion on my problematic. I've been a bit further in my tests, following your guidelines.
I've now built working views using Table calculation (i fixed my problem with the distance, which ended up being simply the quick table calculation type that i didn't set right) and Blending.
Blending is rather simpler to set up somehow, even though performance-wise, it appears to me as delivering the same sort as Table Calculation. However, i didn't go for the full data. Instead i've been trying with a subset of the postcode reference data (which totals at 1.8 mi records). First with 50k records, where everything was fine obviously. I then tried with an intermediate amount of 500k records, but it was already going slower (more than a minute for every loading of the view). As we would have that published on Tableau Server, this would get definitely too slow for our users as you had guessed.
Therefore i'm now heading towards your 4th approach which should push the distance calculation to the background SQL server. I've looked for the Geography functions that are included in SQL (and which i didn't know about so far) and could easily figure out how to calculate distance to a center point etc. At least the bits that look relevant to this case.
I think the idea of the guided analytics make sense, at least in the way that it would remove the need for our users to know postcodes or cities they want to filter on in the case of TableCalc or Blending. Here the analysis would make them choose a location on a map first, then this would trigger the rest of the analysis.
That's where my problem arises now. I'm "missing" the link between the action filter and the SQL side that would run the selected location filtering. How do i pass the selected Latitude/Longitude to the background datasource so it can filter the data that then gets used by Tableau?
With regards to performance, doing a simple count while filtering on distance seemed ok (about a min on the 1.8 mi postcodes, but that's without setting any index on the table first, so this should allow getting a bit quicker.
At least i now want to see the whole logic in place before i can judge of the performance. But as said, i'll probably need your knowledge there again to help me put that in place .
Once we have evaluated the different approaches and how they perform we can always decide to roll up our geographic selection level (from postcodes to postcode area for UK) in order to reduce the amount of records. This should help improve performance. But that's a decision we don't want to take before we have tried those various methods.
Looking forward to your reply and advises again.
Very brief reply as it's late here and I just spotted this as I was about to turn my laptop off.
The approach I was thinking of for passing the lat/long is like this.
You have one view built just against the postcode table. This is where the user selects the (single) row of interest. A filter action on that view transfers them to the main view, passing the selected postcode. I would make sure that the filter action is set to be a single select only (read on, you'll see why).
The second view is built on a datasource which uses a cross-product between postcode and your customer data. I'm not sure how strong your SQL is, so forgive me if I'm explaining the obvious here, but just to be clear, a cross product will return every combination of rows in the two tables. So with 1.8 million postcodes and a couple of million customers that's quite a big result set. But don't panic, you're not ever going to let it try to evaluate the whole result set. By passing in that single postcode, the result set will now return all customers, joined to just that one postcode. As long as your postcode record has a latitude and a longitude you now have the fields you need for your calculation (i.e. you need to know the lat and long, you can't rely on Tableau geocoding.because you need the explicit values available for your calculation.
(I have successfully used that cross-product approach to join a table to itself 5 times - which would have returned a result set which I worked out to be larger than the estimated entire contents of the Internet! But by ensuring the right filters were in place the response was virtually instant. Cross-products are often regarded as bad - but they are actually very powerful on occasions.)
As for how to filter by distance. What you really want is to be able to use a spatial "contains" function. Find all points that are contained in this circle. But you need to be able to dynamically generate a circle centred at your postcode with your desired radius. I'm pretty sure that that would allow you to use a spatial index on your customer locations - which would mean you should be able to get very snappy performance. A spatial index is a special sort of index which allows rapid retrieval by using a 2 dimensional (or even 3 or more dimensional) bounding box to narrow down the rows which need to be examined in detail. I would expect that you could find all customers in a 20 or maybe 50 mile radius in a couple of seconds if you can exploit a spatial index properly. (That's a complete guess - but that's the order of magnitude of saving I think you should be able to get.)
But I had a quick look at SQL Server's spatial functions the other day and although there is a good set of functions, there wasn't the one I was hoping for. I'm fairly sure this would work with functions that I know are supported in some spatial databases - I'm just not quite sure if it can be done with SQL Server. I do have a couple of ideas and was planning to try to find time to experiment at the weekend.
If not, you can certainly evaluate the distance the hard way for each customer against the postcode location and filter on the result - it just won't be quick at your volumes.
Now i see what you meant by cross-product. I thought you had in mind a way that i didn't know that would pass the parameter to the source in some way, i didn't think about the cross-joined table (that's how i call it myself, that's why i got confused with cross-product, i didn't realize it was that same thing).
Anyway, i'm not worried about the methods to filter on distance in SQL as i could figure that out quickly (which corresponds to belonging to a circle as you described - not sure the need of using circular shape and intersection/contains function would make it any faster, but that's something i'll have to test as well). Creating the center point is no problem either, so i think i now have all i need to get my hands dirty again.
Again, i'll give it a try and will let you know of my progress. thanks once more with the clever guidance.
Be aware that Tableau doesn't explicitly support a cross join - but you can achieve it in two ways - one of which is a bit of a trick but can make a HUGE difference to how the query performs - in fact with some circumstances with some databases it makes the difference between completely non-viable and working fine.
The obvious way is just to use a custom SQL statement. The trouble with that way is that Tableau then needs to wrap the entire query inside an outer query, which for some databases causes the database to evaluate the full result set of the cross join before applying any filters.
The much faster way is to use a multiple table connection and define the join on columns in such a way that all rows of one table will match all rows of another. Some people do that with a <> condition between two columns that will never share a common value. I prefer to add a dummy column to both tables, with a constant value. I usually create a column called join_key with a value of 'X' - and then join on that.
Expressing it this way allows Tableau to inject filters in a way that makes life much easier for the database optimiser.
The point about the spatial containership query with a circular shape and a spatial index defined is that it allows the query engine only to visit a very small subset of the rows. Your way has to do the calculation for every customer. If I remember I'll try to mock something up and if I manage to get it going I'll post the timing results. Depending on circle size that could reduce the query time by close to the ratio between the area of the circle and the area of the country - so for small circles it could be hundreds of times faster.
Well I've had a bit of fun playing with this today, and have some interesting feedback for you. I've also learnt a lot.
The first thing I did was to experiment to see how much difference the use of a spatial index would make. To do that I dug out a copy of an old workbook and database I was experimenting with a year or two ago when I was working on that "site catchment analysis" on Clearly and Simply. After Robert and I had done all those postings using various different techniques in Tableau, Robert went one step further and not only repeated the analysis in Excel but also added in the ability to calculate the "optimum centre" - i.e. the location which covered the largest population within a specified radius.
To do that calculation, you (conceptually) need to work through all locations in turn, calculating the distance to every other location and adding up the populations for all within range. After Robert had done that in Excel I couldn't resist trying it in Tableau. I tried various approaches with a few different databases and managed to get it going with several of them - but the calculation was nowhere near as quick as Robert's Excel workbook. But the idea has hung around in the back of mind and your question set me thinking about it again, so I decided to use that as the example to try out the effectiveness of the spatial index approach.
And the answer is that it's very effective - for a radius up to about 25 km it's now faster than Robert's Excel workbook (despite the fact that there's a mysterious delay of 7 seconds where Tableau reports a "query compilation exception" in the logs).
Today I started out by just running a query directly in SQL server while I experimented with defining a spatial index and working out how to make use of it.
I wrote a query which would return the total population within <N> kilometres of each of the 8,000 locations in the dataset of German locations that Robert had used. I first ran it without a spatial index defined. I let it run until it had returned enough rows to give a good estimate of how long it would take. I reckoned it was going to take about 20 hours. Without an index, the run time would be independent of the radius chosen.
I then added the index (which built in no time) and re-ran the query. It took 2 seconds with a radius of 1 km, 4 seconds for 5 km, 13 seconds for 25 km and 54 seconds for 100 km.
That was fine just running it in SQL Server directly, but the next challenge was to get it to work in Tableau. The problem there was that Tableau doesn't support cross joins, so you have to fake it. My first attempt was using that dummy "join_key" column that I mentioned before - but unfortunately that changed the SQL syntax just enough to convince the SQL server optimiser that it should use a really bad query plan instead of the really good one. I had various other unsuccessful attempts (a custom SQL connection with a cross join and a few attempts at forcing an optimiser hint in to the query). Eventually I went back to the join_key idea and added a second "join key". One was set to 'X' and one to 'Y' for all rows in the table and I defined a join condition of:
[join_key] <> [join_key2]
That also joins every row to every other row, but for reasons best known to the optimiser, this way it chose to use the spatial index.
I was then able to prove that the approach we have been discussing using guided analytics works really well - that gives almost instant refresh of the view as you navigate from a chosen centre location.
I also produced this view, which shows the population within a 25 km radius of every location in Germany and labels the centre with the largest surrounding population - and that only took a few seconds to refresh,
Obviously your data volumes are very different to the volumes I've been using here, so I don't think you'll be producing maps like that off your 2 million customers and 2 million UK Post Codes, but you'll definitely be able to do what set out to do.
And along the way I've taught myself a few new tricks, too.
The way i had envisaged that was to build it on top of a view that contains the cross-join. I'm not too sure if using multiple tables would apply as i need to include the distance calculation somewhere (do you know if that can be done within Tableau, working out the geography data typed fields?)
I didn't want to go for an indexed circle dimension as it would require the radius but i need that to be a parameter.
Being able to calculate the best center point would be great too, as in then end it's somewhat what our customer is after. However i'm a bit worried of the performance with our data volume.
I'll have a look at your sample and will do further tests. Hopefully some of your tricks will come in handy.
Thanks again for the time spent, this just shows how passionate you are at doing this things.
Doing it on top of a view will have much the same problem as using custom SQL - Tableau won't be able to put the filter conditions into the cross-join itself. From past experience I think SQL server is pretty good and figuring this out, so you probably won't end up evaluatiing the full cartesian product - but you might. That certainly happens with other databases I've tried this approach with. But even so, based on what I was doing yesterday I'd say there's a good chance that you won't be able to get it to use a spatial index, if you use that approach. I really think that will make a big difference to viability for you.
The way I drove the distance calculation from Tableau was with a RAWSQL expression which took the radius from a Tableau parameter.
I only built a spatial index on the postcode locations (i.e. I created a POINT geography column built from lat and lon and indexed that column) - the circle was just constructed on the fly with a spatial function. The best way to do that is just to define a buffer around the centre point of interest, using the STBUFFER() function, with the radius as a parameter. I'd been thinking of a different approach with PostGIS - but having worked out how to use STBUFFER I think that is likely to be the best way anyway.
So this is the RAWSQL expression I ended up with:
That returns 1 if the two locations being compared are within [Radius (km})] of each other.
I'm just about to try this out on some of my own data, which is much more like the data volumes you are looking at - this idea has opened up some new possibilities for analysis I've been thinking about.
I've tried using the SQL view, and am getting responses times around a min (1.8m vs itself for the sake of this state) when filtering on a postcode and radius (tried with 10km, at 53 sec and 100km at 59sec). This is not too bad, except i can't get the same sort of performance once in Tableau where it seems to take lot more time (i never let it stop though as i thought something was probably wrong somewhere to get it run as such).
I've set up the table with a spatial index on the geographic (points) column. I also have an index on postcode to improve the guided analysis/pre-action view.
I'm myself now trying using a multiple table source as you suggest, and tried to replicate your RAWSQL statement, to produce the distance for a start. However i'm failing to have the this calculation working, because it's trying to agregate on my geography columns but is not including it in the group by. I must be trying to build my view the wrong way (simply displaying lat/long for the data and trying to filter on this distance calc).
What sort of benefit do you think the use of the buffer is bringing (i guess you use it to cache the center point?).