7 Replies Latest reply on Jun 6, 2014 2:47 AM by Siraj Samsudeen

    Guidelines for posting data/tbwx files

    Alex Kerin

      As a common visitor to these forums, I will often ask people if they can provide some sample data. This often helps in more unusual situations with joins or blends or table calculations. However, twice recently I have seen people post a tbwx with clearly confidential information in them - one with medical records, and another almost 2 million rows of financial information for a public company. In both cases the origin company was trivial to find out. So maybe these guidelines will help:

       

      • If you can, use the superstore example data set - it is one of the best example databases I've seen (and I struggle with my own). It has lots of different data types, appropriate hierarchies, the dates are through and beyond present, and the data makes sense
      • If you need to post your own (and many will), make sure it is truly anonymous.
        • Reduce the row count to a few hundred at most,
        • Delete columns that are not relevant (delete from the source, not hide),
        • Completely obfuscate names/departments. If you delete location names but other names still exist (in the data, or from your name as author of the post) it is trivial via Linked In to find who you work for
        • Ideally change values as well. If it's an Excel source, then "+randbetween(0,100) will work with the caveat that if you are having difficulty getting a certain value you may need to keep the measure data

       

      Please don't risk lose your job by failing to do this.

        • 1. Re: Guidelines for posting data/tbwx files
          DJ Crane

          I'm new to Tableau, but this post makes a great deal of sense. Thanks for offering to assist in such a responsible and thoughtful manner.

          • 2. Re: Guidelines for posting data/tbwx files
            Alex Kerin

            Another point, it's great when the original poster comes back and says the solution works (or doesn't), just so there's some closure. Equally it means that other readers of the thread know the solution is okay.

            1 of 1 people found this helpful
            • 3. Re: Guidelines for posting data/tbwx files
              Richard Leeke

              Not to mention that a bit of common courtesy just oils the community's wheels and makes people far more inclined to continue to help. I can't speak for others (though I guess others feel the same), but I know that I feel a lot less inclined to respond to a question from someone if I remember that the last time I spent an hour labouring over a gnarly problem I didn't get so much as an acknowledgement of the response. I don't actively keep a blacklist - but I've been sorely tempted to start one once or twice...

               

              And you are quite right Alex that it's really good to have closure on threads to know whether the poster's actual problem was solved or not.

              2 of 2 people found this helpful
              • 4. Re: Guidelines for posting data/tbwx files
                Shawn Wallwork

                I thought it was just me. I'd post something, thinking wow that's kind of a cool solution. And then the questioner never returned to share in the fun of getting it working. But it's still a cool solution hanging out there waiting to get discovered. And every once in a while someone comes along months later and discovers it, which can kind of be fun too.

                 

                --Shawn

                • 5. Re: Guidelines for posting data/tbwx files
                  Dimitri.B

                  Hey, that would be a cool idea to include this into that "better" forum functionality that Joe and others were talking about. If user asks a question, gets some responses but doesn't bother responding in any way after 7 days or so, i.e. mark correct/helpful, etc. - they get a penalty point against their reputation. No need for a black list!

                  • 6. Re: Guidelines for posting data/tbwx files
                    Jonathan Drummey

                    Here are some additional pointers on blinding data for Tableau. I try to use the existing data as much as possible and do as little as necessary, so I can retain the variation and complexity of the original data. I'll also try to do this algorithmically (so I can unwind/undo what I did if I have to) and programmatically (so I can easily repeat the work).

                     

                    I'll use one or more of the following techniques to blind & anonymize data:

                     

                    - Removing sensitive fields/columns. For example, if address information is in your data source but it's not being used in the workbook, remove it by changing the original query, editing the connection in Tableau, and/or generating the extract with only fields in use.

                    - Null'ing data - As an alternative to totally removing the field, either setting it to Null in the query or updating the column in a duplicate data set with Null values.

                    - Replacing sensitive fields with blinded values. I use this one a lot, we have internal database IDs for things like patients & physicians & floors/units/offices, so I'll use the internal ID in place of a name, so "Jane Doe, MD" becomes "Provider 1125". Then I'll remove the name column using one of the above techniques. If your internal ID is the same as a public number (like a social security number), then you'll need to create a calculated field in your data source or use the export & re-import technique.

                    - Deleting data. I might have data for 10 hospital units, but for debugging or whatever I'll delete data for 8 of the units so there's only a little bit. Combined with the above techniques, that can often be enough.

                    - Scrambling sensitive fields. This is where things get tricky, because we want to scramble the data enough, but not so much that it loses the variation we need for demonstration purposes. (For example, what if the scrambling algorithm accidentally comes up with non-unique values for a column that we need to be unique, or values outside of any realistic range)? I'll do things like offset results by some arbitrary time period, add/subtract a random number from results, etc. Generally I don't scramble text fields, but instead replace them with a blinded value.

                    - Export data for re-import. For example, you might export just the columns you need for the test/demonstration and use that as the data source in Tableau.

                     

                    Some ways that Tableau helps this process are:

                     

                    - Editing the connection in Tableau. You can edit a multiple tables connection to only include certain fields, and/or use Custom SQL to only include certain fields, set certain fields to Null, etc.

                    - Generating a Tableau data extract with only fields in use. You could generate an extract for sharing that has only the fields in use in the workbook.

                    - Data->Replace Data Source. You could set up a separate query with blinded/anonymized data, or a separate connection, then use Tableau's replace data source function to swap out your real data with the anonymized data. This would let you keep all the column names the same.

                    - Replace References. You can right-click on any field in the Data window in Tableau and use Replace References… to replace it with another field. I might use this to replace a physician name with a blinded ID field that I just built in Tableau, then I can generate the extract for export with only the fields in use. This can be very helpful if you're working out issues with table calculations with complex compute using settings, when you replace the references the new field retains all of the settings of the old field.

                    - Worksheet->Copy->Data. I'll create a worksheet with a text table that has just the columns I need for the demonstration, then use the Worksheet->Copy->Data function to get the data that I'll immediately Ctrl+V to paste into a new blank workbook. Then I'll build the desired view with just that data.

                    - Calculated fields. You can use calculated fields to do all sorts of things to your text. Though Tableau doesn't have a built-in random number function, a couple of options are to use a table calculation like INDEX() or a RAWSQL call like RAWSQL_REAL("RND()").

                    3 of 3 people found this helpful
                    • 7. Re: Guidelines for posting data/tbwx files
                      Siraj Samsudeen

                      Hi Jonathan,

                       

                      Just now, I came across this thread containing golden nuggets of advice on masking data and the tableau features that would help that. Excellent post - worth making it a sticky! Can I bring this to the attention of @tracyfitzerald (hmmm. Why do I not get the auto-complete for tracy?)