Are you able to post a sample workbook (twbx file) to show what you mean?
Tracy I am not quite sure if I can attach my sample to the forum here. It is patient information. I am not quite sure about sharing the data on public forum. I apologize for this inconvenience.
I completely understand! If you'd like additional help and are able to send the workbook to our Support team, we will keep the data confidential. You can send this information to firstname.lastname@example.org
Have a good weekend!
1 of 1 people found this helpful
I work in a hospital and regularly deal with patient and practice information. When sharing data publicly in this forum (and even with Tableau tech support, since legally I can't give them patient- or provider-identifiable data without a whole lot of agreements and security processes in place), I take the following steps to anonymize and randomize the data:
- Start with a known data set (something I have some familiarity with, so I can validate the results).
- Export it to Excel/OpenOffice (if it's not there already).
- If the data started in Excel/OpenOffice, I'll unhide all columns and rows to make sure I can see everything.
- Add an ID column that is just a numeric ID, using Excel's Fill->Series function
- If patient age is necessary for the view in Tableau, I'll create a new column and use Excel's RAND() function add/subtract a number of years from the age. I'll also add a new column and randomly flip genders if that's necessary.
- If there are physicians/providers listed in the data, create additional ID columns for them as well.
- If there are specific units/offices listed in the data that could be used to identify them, replace those with ID columns. For these last two (providers and units) I've done this enough times that I've set up a couple of tables in my DB that map them to aliases so I can at least have consistent handles that have meaning for me.
- Continue on replacing identifiers as necessary
- For the measures themselves, for each create a new column and use Excel's RAND() function again to randomize the data a bit. This is often the hardest part, in order to not introduce artifacts into the data.
The next steps are to take the data and sanitize it.
- Copy this "master" data worksheet (that has the original and new numbers) to an intermediate "map" worksheet, using Copy/Paste Values (to get rid of the formulas, since Excel will re-calculate the RAND functions every time the worksheet is opened).
- Copy the "map" worksheet to a "sanitized worksheet"
- In the sanitized worksheet, delete unique identifiers like a visit number or account number
- Delete patient names, physician/provider names, original patient age, etc.
- Use Excel's Filter/Sort operations to identify a relatively random set of rows and delete them. This makes sure that any aggregates based on counting records (such as incidents or events) won't end up with a result that would match any reported total.
- In order to test edge conditions, I'll often add some rows as well.
Finally, I Copy/Paste that sanitized worksheet into an entirely new workbook, and that's the workbook that is used in the post. This sanitized workbook has the general shape of the original data, but doesn't have anything that could be used to identify patients, providers, units, or actual results. And the "map" worksheet lets me get back to the original data to verify results.