3 Replies Latest reply on Dec 28, 2016 8:30 PM by Richard Leeke

    "Assume Referential Integrity" seems to assume rather more than that

    Richard Leeke

      I've just come across a rather surprising bug in one of my old workbooks. The impact of the error is really obvious (a simple view returns a wildly incorrect result for the count of rows in the data source), so I couldn't quite believe I wouldn't have noticed it when I first created the workbook. After a bit of detective work opening the workbook in old versions of Tableau I established that the cause of the problem is some undocumented behaviour of the "Assume Referential Integrity" setting. It so happened that the behaviour was masked by the way Tableau used to treat a calculated field back in 9.0 when I originally created the workbook, but was exposed from 9.2 onwards because Tableau now optimises the way it handles the calculated field.

       

      All of the documentation I have been able to find and most of the forum posts from the likes of Dan Cory, Russell Christopher and Bora Beran warn that "Assume Referential Integrity" can lead to incorrect results if your data is not referentially intact. But I haven't seen anyone warn that it can give wrong results even if your data is referentially intact. What I am seeing happens when you turn on "Assume Referential Integrity" even if you have declarative referential integrity in the database - but not if you don't turn it on. That fact alone shows that Tableau is assuming rather more than just referential integrity.

       

      My understanding is that Assume Referential Integrity tells Tableau to assume that there are no "orphan" children in the fact table, which allows it to omit the join to the corresponding dimension table (join culling). However, what I am seeing is that in some circumstances Tableau assumes "no children" rather than "no orphans". This behaviour seems to depend on the order in which the tables were added to the connection: if the fact table is not the first table to be added in some cases it may be omitted from the query. If all the view is doing is counting rows in the data connection, that gives completely the wrong answer.

       

      I have seen a couple of slightly enigmatic comments from Dan Cory in old forum posts which suggest that in some cases join culling is able to omit the fact table, so I suspect that in some circumstances this may actually be intentional and desirable behaviour, but in my particular case it just looks like a bug to me. At the very least I think it's something which needs to be spelled out in the documentation.

       

      I can't attach a packaged workbook showing the behaviour because it only happens with a live connection to a SQL datasource, so here are a few screenshots demonstrating what I'm seeing. This is with SQL Server, I haven't tried any other databases.

       

      I have a dimension table with three rows and a fact table with 6 rows, referencing the dimension table:

       

      d_id is the primary key of the dimension table and is defined as a foreign key in the fact table.

       

      I defined a data source using these two tables as follows:

       

      Using that data source I get the correct counts of rows by dimension name:

      This is correct whether or not I enable the "Assume Referential Integrity" setting.

       

      But if I add the tables to the data source in the opposite order, with the dimension first, the behaviour changes. I don't recall ever seeing anything saying that the order in which tables are added to a data source matters.

      The view still gives the correct answer if "Assume Referential Integrity" is not selected, but if it is selected the results are wrong:

       

      I get this behaviour whether or not there is actually referential integrity declared in the database - I have it declared here to prove that the issue isn't due to an issue in the data.

       

      Creating a data extract also gives the correct answer, even if "Assume Referential Integrity" is selected for the datasource when creating the extract. That makes sense - the extract has to get all the data.

       

      The SQL generated in the two cases is as follows:

       

      Without "Assume Referential Integrity"

      SELECT [Dim].[d_name] AS [d_name],

                 SUM(CAST(1 as BIGINT)) AS [sum:Number of Records:ok]

      FROM [dbo].[Dim] [Dim]

        INNER JOIN [dbo].[Fact] [Fact] ON ([Dim].[d_id] = [Fact].[d_id])

      GROUP BY [Dim].[d_name]

       

      With "Assume Referential Integrity"

      SELECT [Dim].[d_name] AS [d_name],

                 SUM(CAST(1 as BIGINT)) AS [sum:Number of Records:ok]

      FROM [dbo].[Dim] [Dim]

      GROUP BY [Dim].[d_name]

       

      Clearly in the latter case it has omitted the join to the fact table.

       

      I mentioned at the start that the behaviour was originally masked when I created the workbook back in 9.0, but has been exposed from 9.2 onwards. The reason for that is that join culling only removes tables which are not directly referenced in the view. My original view had a filter defined on a calculated field, where the calculated field referenced a boolean column in the fact table. The calculated field was defined like this:

       

      [Calc Field for Filter] = [Boolean Parameter Value] OR [Fact Table Boolean Column]

       

      In 9.0, the fact that the calculated field referenced the fact table was enough to prevent join culling. From 9.2 onwards Tableau inspects the value of the parameter before deciding - so if the parameter was true, it knew it didn't need to reference the fact table, which allowed join culling to kick in.

       

      This took a while to get my head around - and I'm intrigued to know whether this really is by design, or whether it has somehow just slipped through the cracks. I'm amazed I can't find any other forum posts reporting it.

       

      Now I just have to decide how to work around it. If the behaviour really is by design and the order of adding tables to a data connection really is supposed to be significant then I suppose the correct way is to redefine all of my data sources. The trouble is doing that changes the field naming in the data source and therefore breaks all the (zillions of) views using the data source. The bodgy alternative is to add a filter on a field from the fact table with all values selected. Has anyone got any other suggestions?

        • 1. Re: "Assume Referential Integrity" seems to assume rather more than that
          Bora Beran

          Hi Richard,

          Referential Integrity requires some notion of direction in which it applies.

           

          When there is PK - FK relationship, direction is clear. E.g. assume table Employee with the primary key of EmployeeID. It can only have 1 row for each EmployeeID.  A table like EmployeeDependents can have a FK of EmployeeID but not all employees have to have kids so some EmployeeIDs from Employee table may not appear in EmployeeDependents table but not EmployeeID that's not in Employee table can appear in EmployeeDependents because of the referential integrity constraints.

           

          So it is safe to drop Employee table from a join as it will have no impact on the results. While dropping EmployeeDependents is not safe since it can impact results as an employee can have multiple children or no children.

           

          When an integrity assumption is made, same has to be said about which of the tables is safe to drop. This is implied in the order tables are joined as there is no other hint in our UI to pick a direction. Since typically fact table is at the "center of the schema", which in our join diagram in data tab conceptually would correspond to the table on the left, that's where the fact table is assumed to be. This is why when you put the fact table on the right, you get different results and probably why there aren't reports of issues with this because most people naturally tend to put fact tables on the left.

           

          I think we can and should improve our documentation to spell it out more explicitly as you suggested. I will follow up with our documentation team and get this taken care of.

           

          Thank you,

           

          Bora

          • 2. Re: "Assume Referential Integrity" seems to assume rather more than that
            Richard Leeke

            Hi Bora

             

            Thanks for the lightening fast response. The fact that "there is no other hint in our UI to pick a direction" makes sense now that you say it. I was kind-of assuming that you took cardinality into account - but thinking about that some more I can see that's not going to be definitive.

             

            In my case the database I'm working with has nothing like a star schema (more of a spaghetti schema) - I just cast the post in star-schema terminology for clarity. I'm not sure it's even possible to add all the tables in an order which guarantees that all relationships are left to right PK -> FK - I'll have to think about that. Obviously the database doesn't have referential integrity constraints (and I can't add them) and "Assume Referential Integrity" does make a big difference in several views. I'll have to experiment to find the best way around this.

             

            Good to have it explained, though - thank you.

             

            Richard

            • 3. Re: "Assume Referential Integrity" seems to assume rather more than that
              Richard Leeke

              A couple more observations on this one.

               

              I think the reason I was assuming that Tableau would know the direction of the PK->FK relationship is that in the vase of the database I'm working with, all tables do have primary key constraints defined, even though not many foreign key constraints are defined. Just knowing the primary keys is actually enough to determine the directions of the relationships. I guess Tableau doesn't bother to look at this because who would define a schema with primary key constraints but no foreign key constraints? Allow me to introduce you to Microsoft... :-)

               

              The other thought that occurred to me is that I was surprised not to have found other people running into the issue. One reason for that is probably that it is quite unusual not to reference any fields at all on the Fact table. As soon as you reference any fields the join can't be culled.

              1 of 1 people found this helpful