What would be interesting is if Tableau could partly evaluate the calculated fields that use parameter before crafting/sending the query sent to the data source so the the query sent can be cleaned to remove the unnecessary logic. I am not sure what it would take to implement that, but the structure of the calculated field you present here is a common structure that causes slowness. So if there was some way to just parse that style of case statement, I think it would be a great improvement.
For example, if the user selected "State" for the parameter value, the entire
CASEstatement would be replaced with
[State], and then Tableau could proceed to craft the SQL as it normally would.
Interesting - I've never noticed that. As I'm currently waiting for a view of my own to refresh (19 minutes and counting) I just had a quick go at reproducing your issue.
I created a similar calculated field driven off a parameter choosing between a set of dimensions in a 60 million row data extract. I compared the time with one of the dimensions on the color shelf against the time with the calculated field on the color shelf. I got 47.23 vs 47.27 seconds - so there must be something else going on in your case. I also tried with the calculated field in a filter and that made a bit of difference - but nothing like what you are seeing.
What are you using the calculated field for (e.g. is it in a filter or just on one of the other shelves)?
Maybe take a look in the Tableau logs at the queries being sent to the data engine. They're not SQL, they're in the data engine's own private language - but you can sort-of guess what it's doing.
Joe, that would be great if it could do that.
coming from a programming background i was a bit puzzled by this considering CASE statements resume normal execution once the correct path is found (ie: via a break statement).
It's just on one of the shelves to allow a user to select which dimension they wish to look at once the view is published. this worked brilliantly fine with 500,000 records of the same data however now that i've used the entire 20m this has caused it to grind to a halt.
i did notice in the log that this line is repeated many times:
(0f78): GetNumberFormat(): invalid parameter
and after being repeated several hundred times i get
(0f78): Finished rendering sheet: Sheet 12
From the log - where is all the time going? Is it in the actual execution of the data engine query or is in Tableau after the results have been returned?
2011-05-04 09:05:43.131 (077c): [Time] Running the query took 71.1423 sec.
that seems to be the bulk of it, which appears right after about 50 entries like this:
2011-05-04 09:05:43.059 (077c): UpdateProgress: k = 19178054, lb = 19195783, n = 19195783, ub = 19195783
I believe this is because the data source has to evaluate the parameter for each row in your data source before it can aggregate.
That is why I think there would be a speed improvement if the calculated fields were evaluated before crafting the SQL, so the data source would not have to evaluate the parameter CASE/IF statement for each row in the database.
hmmm. i understand what you're saying, and that seems to fit along the lines of the query it's written to the log:
restrict (aggregate (project
(table [Extract].[Extract]) (
(iif (= "Age" "State")
(= "Age" "Gender")
(= "Age" "Age")
is there a solution to this that fits within the existing capabilities of the system?
There might be other methods to get the results you are looking for.
Can you provide a sample workbook that represents your situation?
It certainly sends (a version of) the case statement to the data source - but the data source should be able to evaluate that without having to access all possible fields for each row. The data engine query generated by the test I did last night on 60 million rows had a nested set of conditions listing the 10 columns I'd which I'd included in a case statement in the calculated field, but there was literally a fraction of a second difference in run time - so in that case it clearly didn't force it go and access all 11 columns from the column store.
That's why I asked Dan how the calculated field was used in the sheet (and the answer was it's just on one of the shelves). So I've no idea why the behaviour is different in this case.
If you want to post the section of the log showing the query we could have a look and see if we can guess anything, Dan, but it would be just that - guessing.
I agree, by the way, that there does seem to be scope for a bit of client-side optimisation before even sending the query to the datasource in this case, though like you, I've no idea how viable that would be.
The fact that you're saying your example of 60m records was fine means that there has to be something else going on here.
it doesn't appear to be anything to do with the extract given that if i drag a dimension such as "Gender" to the shelve and then a measure of "Premium" it displays the results in half a second, on 20m records. but when i replace the "Gender" pill with a calculated field that has a CASE statement based on a parameter so that the user can choose which dimension to display based on a drop down it takes more than a minute to return a result - however after the first time it runs fine (must cache the results).
i'll upload an example workbook shortly.
I'd really like to get an answer to this as well...it is a bit of a spoiler for me as I really like the concept of publishing out limited "business-friendly" construction choices to certain user populations without them getting wrapped up in learning the desktop client and having to be forced into playing an "Analyst" role with respect to all the data choices available to them at the data source level.
I asked this question a week or two ago and included an example highlighting this very problem. You can get to the thread here:
Koz, yes I have faced that exact same issue with parameters not being optimized as they potentially could be. If you put the actual fields there instead of the parameters (what the parameters should evaluate to before the SQL is generated IMO), then there would be no speed issue.
I am with Richard, that this is would be a great thing to implement, but I do not know what the feasibility would be. Nor do I believe that I fully understand all that is going on, these are just my guesses.
perhaps given that parameters are a newly implemented feature (just released in 6.0 i believe?) there is room to improve optimization in the coming months.
would be nice to see if this is on the cards, given the fundamental requirement for it.
PS: i'm attending the sydney users meeting next week, i'll bring it up then ;)
One option to have a work around to this is to do the following:
-Create a sheet for each possible segment value (ie: if your parameter was going to allow state, gender or age then create a sheet for each of these).
-Add a filter that will show the sheet only when the user selects the corrosponding value from the parameter
-add these to a view
this way what's happening is when you select a drop down segment, it's not executing a case statement on the data but rather showing or hiding the various sheets. the performance is as fast as you'd expect (sub 1sec for 20m records) however there is the issue that people have noticed before - hidden sheets in dashboards still take up a small amount of room.