Are you running the basic sql request against BigQuery? What is the query syntax you are using?
No, I have a duplicate version of my table in a PostGreSql database. This is where I have executed my sql request :
"SELECT DISTINCT COUNT(visitor_id) FROM visitors"
Result : 218.570
With Tableau 8 and CNTD(visitor_id), I have 215.487
The table schema and the data are exactly the same both on my google bigquery table and on my postgresql table.
3 of 3 people found this helpful
Ahh... Count distinct in Google BigQuery is a statistical estimate, not an exact calculation
See here for more detail... https://developers.google.com/bigquery/docs/query-reference#aggfunctions
If you want to check for duplicate primary keys then try putting Sum(number of records) on filters, set it to atleast 2 and then put your ID onto columns
Thank you for this information, I didn't enough read the documentation, now I understand why the values aren't identical... It explains a lot of other results I have in others workbooks.. I'm agree with your solution, it will work, although this is not the best way I would have liked to do. For example, if I have many Measure values from multiple datasources in a same sheet (like KPI), I can't put a Sum(number of records) on filters, because it will affect all values.
Cristian, I already read this article, but not tried yet.
Although the count distinct is by default a statistical estimate in BigQuery, there is possibility in the SQL syntax to force it to be the exact count :
"When you use the
DISTINCTkeyword, you can also specify a value for
n > 0). If the count is greater than
n, the function returns a statistical approximation. If the count is less than
n, the function returns an exact result. The default value for
nis 1000. Large values for
nmight substantially increase query execution time."
However there is actually no way in Tableau to do that
It would great if Tableau and Simba find a way for us to have the exact count.
As for now, there is some big limitations between BigQuery and Tableau. (ex: unable to do Big Join (with the JOIN EACH syntax), unable to do data blending when doing a count distinct)
2 of 2 people found this helpful
I understand that Google want to be as fast as possible, and a statistical approximation is of course a good way to reduce the time of a request. But It would be better if by default, the DISTINCT keyword return an exact result, and we can specify a value for n to return a statistical approximation.
The solution with a custom sql (8) from this article works fine : Tips & Tricks: Alternatives to Count Distinct
But I don't like to do this, not the way it should be done.
Now I just check for duplicate primary keys in a workbook, and I use COUNT() instead of COUNTD() in my others worbooks.
I would like to suggest you to investigate other options, beyond Google Big Query, for example a cloud solution hosting a real SQL database.
Thanks for the reply.
A standard SQL database is not a viable solution for us because we have too much data. Actually, we were previously using a MySQL database before switching to Google BigQuery. At the current moment, we haven't found any other cloud solution that have the speed, easiness to administrate and low pricing as Google BigQuery. The only other alternative that we found was Amazon Redshift but unless buying a lot of instances and tweaking a lot with the database, I have found that this option was much slower while using it with our data and Tableau (often ~15-20min. to load a report with Redshift vs ~1min. with BigQuery).
I've finally found a solution to get a correct count distinct with Bigquery, which works fine with my KPIs.
One of the many things I wanted to do was to be able to see in the same sheet how many users and buyers over the time I have.
Below I explain how I've make it works :
At first I had two duplicate datasources, linked by the calculated field "event_at" (="user_created_at" in my first DS and ="user_purchase_first_at" in the second). The first datasource to get the number of users, and the second to get the number of buyers.
To count the number of users, I just use the COUNT() function, and I check before in an other workbook that there are no duplicated keys (read my first post).
With COUNTD(user_id) from the second datasource, because of the statistical estimation of Count Distinct from Google Bigquery, the results were not perfectly exact.
To solve my problem, I have duplicated my first datasource and I have edited the connection with a custom sql request like :
SELECT user_id, user_purchase_first_at, count(1) AS buyers FROM analytics_orders GROUP BY user_id, user_purchase_first_at
This custom sql request do the same as a count distinct ( see (8) from this article ), because an user can only have one unique value for my field user_purchase_first_at in my data model (The first time an user make a purchase).
In my sheet, I put the COUNT(user_id) from my first datasource and the COUNT(buyers) from my new duplicate datasource in the Measure values, in the rows I put Measure Names, and for columns I put the calculated field "event_at".
It works like a charm.
I hope this can help you to get what you want. Otherwise, don't fear to think about make a change in your data model.
2 of 2 people found this helpful
Well, there are some routes to downtown, but i am afraid i cannot give you any price mark or rates value.
- rackspace is /will host Actian Vectorwise solution (Big Data on the Rackspace Cloud - Rackspace Hosting) , you could drop them an email and ask for details
- 1010data (1010data - Big Data Analytics) the little know solution for huge volumes of data
- Made in Germany - Exasol (EXASOL | High-speed database for business intelligence (BI) applications / solutions) in memory MPP cloud solution for BI
- Maybe you want to give MySQL a new chance and replace innodb storage engine with the one designed by Tokutek (Serious Compression, Increased Query Speed, Hot Back Up and Excellent Database Performance | Tokutek)
Gartner report for 2013 data warehouse vendors could be read here Magic Quadrant for Data Warehouse Database Management Systems
and for 2012 here Magic Quadrant for Data Warehouse Database Management Systems
Hope this helps.