I think this is a "Joe Mako" kind of problem. ;o)
You can do it with a cross-join of the table with itself.
Depending on the number of rows, you probably need to ensure that you have single-select filters selected for both instances of your table before you refresh the view.
I recall doing this successfully with about 20,000 properties (and time series data, so there were lots of rows).
I'll dig out an example later unless Joe beats me to it.
If I cross-join, that will let me cross-tab up to 2 elements. To do 3 elements I'll need to cross-join again right? And if my data set is huge (millions of rows) and if the total cross-tabs goes more than 2 or 3 dimensions then its a quagmire I think.
I've toyed with the cross-join approach back in version 5.x and performance crashed quickly. 6.0 performs a lot better, but I don't think cross-joining sounds like a very viable solution. I really need to transform the data to wide(r) tables I think.
I just ran a quick estimate on my data and on one of my active projects I'd get a over 4 million rows a year and would want to use at a minimum of 4 or 5 crosstabs in a single worksheet (when you think of X, Y, Color, Size, Level of Detail).
Sorry, I sound like a negative nellie. :)
Depending on your database, and as long as you ensure that you have single-select filters in place before refreshing the worksheet, I think you might find the cross-tab approach viable. It all depends whether the database optimiser is smart enough to transform the query which Tableau produces into a form which allows it to apply the filters before performing the cross-join.
I experimented with this quite a but with large cross-joins a while back. Some databases the optimiser was smart enough, others tried to evaluate the full cross-product before applying the filters - which definitely would not fly. I confess that I haven't tried 4 or 5 way cross-joins, so it's quite possible that the optimiser will throw their hands in the air and admit defeat on that. But you've piqued my interest, so I'll give it a go - I've got an ideal data set to try it out on.
I confess that I haven't got my head around how you are envisaging that a wide structure will help - but I'll have a think about it some more
Not sure how well this would work for really large data, but attached is my first pass.
I may also be misunderstanding what you are looking for. So let me know if this is close or what additional abilities you need.
Here is what I did:
1. created two parameters with possible values from your field "Property"
2. created two calc fields based on those parameter (very similar to your calc fields), like:
IF [Property]=[Column Dimension] THEN [Value] END
IF [Property]=[Row Dimension] THEN [Value] END
3. placed the "User" field on the Level of Detail Shelf as a dimension (there is now a mark for each user ID, and if this takes too long, this method will not work well)
4. created a calc field like:
IF FIRST()==0 THEN WINDOW_SUM(AVG(1),0,IIF(FIRST()==0,LAST(),0)) END
5. placed this on the Text shelf,
6. placed the column parameter and calc field on the Columns shelf, and same for the Rows
7. set the aggregation for both rows and Columns calc fields to MAX()
8. for both those pills, un-checked the "Ignore in Table Calculations" option from their context menus
9. showed the parameter controls, and tested it out
If this does not work for you, there are other options that could be tried, and I would be interested in the profile of your data:
- how may rows
- how many distinct "User" values
- how many distinct "Property" values
- anything else that is a constraint or need?
user_table_tall_edit.twbx 16.3 KB
I hunted out this old thread which discussed this problem and how I managed to make it perform OK with the cross-product approach. I'm about to see if I can find the various databases I used for trialling that with SQLServer, Firebird and Access and see how it goes if I extend it to 3, 4 and 5 instances of the table in the cross product.
In reading that thread, bear in mind that this was back in version 5.0 days, so my comments about creating an extract and then using custom SQL against the extract to get a cross product don't apply - that was back when extracts used the Firebird database. This approach just won't work with an extract in version 6.0 - you can't use custom SQL against a version 6.0 extract so you would have to evaluate the full cross product and store the results in the extract - which is just not viable with the sort of numbers of rows and numbers of self-joins you are talking about.
I'll also take a good look at Joe's approach and see if I can form a view on which way would be best for you - or where the cut-off would come in terms of the properties of your data that Joe was asking about.
Well I've had a bit of fun exploring possibilities. As I thought, the cross-join approach is very viable even with large numbers of rows, with some database back-ends, at least. You do need to be careful about the order of operations when constructing the views, though, in order to avoid enormous run-away queries. With some datasources you may also have to work quite hard to define the connections in the first place.
To give you an idea of the scale that I'm talking about, I experimented with the test data I had used a couple of years ago at the time of that earlier thread I pointed at above. That dataset had a set of 34 timed samples (equivalent to 34 users in your example data), with 17,800 metrics per sample (equivalent to 17,800 distinct Properties for each user in your sample). That gave a total of just over 600,000 rows.
I had previously only tried a cross join of that with itself once (i.e. 2 instances of the table). I've experimented with 3, 4 and 5 instances cross joined (actually semi-joined on timestamp) and that worked really well, even with the 5-fold self-join. In fact, with the way I have the data normalised, that meant 11 table instance in the join for the 5-fold version.
I did the sums to see how many rows the result sets would have if I attempted to evaluate the queries with no filter conditions. It's worth spelling out the results - it makes it very clear why you need to filter appropriately before you let the view refresh.
3-fold : 191,945,539,448,944
4-fold : 3,417,782,275,427,900,000
5-fold : 60,857,031,196,269,100,000,000
But as long as I filtered appropriately before refreshing I got very acceptable performance with a SQL Server database (3 or 4 seconds to refresh the view with the 5-fold self join, or with a small multiples view of the 4-fold self join). I've attached a couple of images of the results and a screen-shot to show how I had the filters set up. (The data is performance statistics for my laptop from the Windows Performance monitor, by the way.)
I also tried with MS Access. I had a bit more of a struggle to define the connections - some of the queries Tableau uses to validate the connection either took several minutes, or as I extended the join further started hitting Access limits, so I had to skip some of the validation. I only actually bothered to display views with the 3-fold cross join - that took a couple of minutes to display, so I didn't bother going any further.
You would need to experiment to see how it works with other databases.
There are a couple of crucial aspects to making this work successfully.
Firstly, you have much more chance of success if you define the connection as a multiple table connection rather than a custom SQL connection. This is because with a multiple table connection, Tableau understands the semantics, so can apply filter criteria in a way that generates SQL which the database engine is much more likely to be able to process efficiently. Whether or not this is really necessary varies by database engine, by the way - but I'd suggest you go with the multiple table connection rather than custom SQL if you can to improve your chances of success. I can explain more if you're really interested - otherwise just take my word for it. ;-)
In the case of my data (and in the sample I've attached with your data), defining it as a multiple table connection is easy - you just need a semi join (i.e. your join criteria don't uniquely identify a pair of rows). In the case of your sample data you join the table to itself on user ID, which forms each possible pair of Properties for each user ID.
But if you want a full cross join, you have to use a bit of a trick to make it a multiple table connection, because Tableau insists on you defining a join condition. The easiest way around that is to add an extra dummy column to the table, which has a constant value for all rows. Then you can just join the table to itself on that column - which allows every row to join to every other row.
The second trick is to make sure you build your views in a way that prevents Tableau from trying to evaluate queries before you have filter conditions set. The best way I've found to do that is to turn off automatic updates and then add each field that you want to filter on to the filter shelf. Because updates are disabled, Tableau will prompt you to enter the filter criteria manually. Just leave the criteria blank and uncheck the "Include all values when empty" checkbox.
Once you have all instances of your filters created you can turn on updates and add each of the filters as a quick filter. Nothing will be displayed (because of that checkbox you cleared), but you will now be able to select from the quickfilters. Typically you will want to define them as single-select - but it can also be useful sometimes to have them as multi-select so you can generate small-multiples views, as in one of the example images I've attached.
Finally I took the sample data from your workbook, pulled that out into a CSV file and put a little example together with a cross-join, repeating the analysis Joe had done with table calculations in the version of the workbook he posted earlier. I included one sheet just showing the crosstab of one pair of Properties as Joe had it, and also another sheet with all properties included.
Joe's way certainly works for some kinds of analysis as long as your data volumes are manageable, and is the only way I can see to do any of this with a data extract, because of the lack of Custom SQL. But as long as your data source plays well with the cross-joins, I think that approach gives much more flexibility and scales much, much further.
Okay, you guys are amazing. I'm going to take some time on Monday to try to digest all of this and run it by more folks at the office. You've definitely taken the concept further than I got with it before I got frustrated with performance and complexity.
Immediate questions that come to mind... 1) Our data source will be MS SQL Server. Given the strategies you've described, do you expect it to work better using a live connection to SQL Server or to pull an extract (assuming you have good access/bandwidth etc.)? 2) Would the need to publish a workbook like this to Tableau Server change anything about your approach?
Thanks. I already owe you guys for your insight and time.
I really like the "Full Crosstab" sheet Richard made in his workbook, and suspect that will be a very useful viewpoint for you.
Based on the situation you described, I do not think an extract would be beneficial because MS SQL Server can look at the SQL Tableau gives it and rewrite it in an more optimal structure, providing you better response times. If you create a Tableau Data Extract, you loose that benefit because the custom SQL will be evaluated separately from the worksheet layout SQL, and you loose the benefit of optimizing SQL.
As long as Tableau Server can communicate with the MS SQL Server, it should work fine.