3 Replies Latest reply on Jan 21, 2014 11:31 AM by Bill Jindrich

    Data Shaping for Survey Data

    Bill Jindrich

      Hi All,

      I'm fairly new to Tableau. One of the areas that I'm exploring is the analysis of internal survey data. A couple of months ago, when I was first trying out Tableau, I had an XLS containing survey response data, with one row per user - it had many columns, which could be grouped into 2 categories: demographics and survey response data. I separated the data into what I thought made logical sense, 2 "tables" (Excel worksheets), with a User ID field that tied them together: a Demographics Table and a Survey Response Table.


      Bringing these 2 tables into Tableau, I obviously had to create 2 separate data sources. Things seemed to work mostly ok as I built out some sheets and dashboards but I ran into some issues, which I don't recall the specifics of right now. I know that I had tracked down a few of the issues to the fact that each sheet seemed to have a "primary" data source, and there were some limitations or restrictions based on the primary data source that is chosen, with regards to non-primary data sources included on a sheet.


      Flash forward to now, and I'm starting "fresh", although the survey data is similar. I want to ensure that I put the data in the very best shape in Tableau, so I'm hoping to get some input from the group here.


      I am familiar with Tableau's Excel Data Shaper add-in, and I can see how it can help, however I'm wondering if there are any best practices for shaping & pulling XLS survey data into Tableau when you have a number of columns of demographic data and a number of columns of survey response data.


      I've seen examples, where it seems that the demographic data isn't shaped, only the survey response data is shaped. With a number of demographic data columns, it seems like the demographic data is being replicated unnecessarily. Besides making that overall data set larger than necessary, I've seen this approach cause some challenges, with the need to add a "distinct" clause to every chart/visualization.


      So, how do you all do it?

      Separate data sources for demographics and survey response data or a single data source?

      Shape only the response data or shape the demographics data too?

      Any other tips or tricks?


      Thanks in advance!

        • 1. Re: Data Shaping for Survey Data
          Susan Baier

          Hi Bill!


          I work exclusively with survey data in Excel files, and like you I went through a lot of trial and error before figuring out a solution that makes sense for me. I'll explain generally, then if you want more info etc. I'm happy to go into more depth.


          The main value of reshaping is to make it possible to put multiple questions in a single chart, if they share a similar type of response. For example, I always have a bunch of items that respondents have rated on a Likert scale, and if I don't reshape the data I can only show one of these items at a time on a chart reflecting the average score by segment, for example. Reshaping allows me to have all of those items as elements in a "Question" field, and the scores for each as elements in a "Response" field, so that I can just filter to show the questions I want on a particular chart.


          There's no point in reshaping everything -- if you've got a question for which the responses are different from those you'll see for any other question in the survey (such as age ranges, income, etc.) you're only going to show that question's results alone anyway.


          The other thing with reshaping is that every response needs to be a number -- you can't have a mix of numbers and text. This work greats for Likert scale responses, obviously, but if you're wanting to show a group of items on a single chart and they all have the same response options but they're text, you're going to want to recode those into numbers. (I have questions like "How often do you do the following?" with a list of activities, all that can be rated with a response of "All the time", "Often", "Sometimes", etc. -- I recode those responses to a number so that I can show the percentage of respondents who selected each option for each item on the list, all on one chart.)


          The important thing to remember, though, is that you can't have the same number represent different things if you're going to want to label them as such in your workbook. For example, you can have a whole series of Likert score items (How important? How valuable? How influential?), and as long as you're going to label your chart with an annotation such as "1=Not at all valuable, 7=Extremely valuable" you're fine. But you can't have other items in the reshaping that you've identified with the same number and want to label differently -- Tableau won't let you change the alias for something just on a single sheet, as far as I know. So you have to map out carefully any plans for recoding so you don't duplicate numbers, and so you remember which response label goes with which number.


          So those are my key rules -- if you want to put a number of items on a single chart, you need to reshape. And if you don't, you don't. And if you do reshape, make sure you recode your items so they're all numbers, but no duplicate numbers meaning different things in your data set.


          I hope that's helpful. As I said, it took me a few tries with my dataset to know which items I will want to reshape, and which items I'll keep to the left of the column I'm reshaping from so they aren't included in my "Question" and "Response" set.


          I'd be happy to discuss further if this isn't clear -- please do let me know.


          All my best,



          1 of 1 people found this helpful
          • 2. Re: Data Shaping for Survey Data
            Jonathan Drummey

            Hi Bill,


            First, if you haven't already, check out Steve Wexler's stuff, including New Whitepaper from Steve Wexler - Visualizing Survey Data and his blog posts on survey data at http://datarevelations.com/.


            I do regular downloads of patient experience survey data that comes in a very wide Excel table with demographic information and answers, with one column per question and N columns for multiple answer questions. We use Microsoft Access for our database. I import the raw data into a "Respondents" table, and a reshaped version that only has the Respondent ID, Question, and Answer into a "Responses" table. (Steve has instructions for this in one of his blog posts).


            I also have a couple of useful dimension tables set up:

            • Elemental Questions identifies the fundamental questions that get used across surveys, like "Rate your provider".
            • Questions identifies each question, what survey it's from, has short & long versions of the question text, identifies whether the question is multiple choice, and what type of question it is (Yes/No, Likert, 0 to 10 rating, etc.), and the elemental question.
            • Answers has a row for ach question/answer combination. Our patient experience survey vendor provides answers in numeric form, so this table has a row for that numeric answer value and the appropriate label. They use positive answer values between 0 and 9999 are valid, answer values outside that are indicating invalid answers, did not answer, etc. There's one additional column to identify whether the Answer is Positive, Neutral or Negative for Likert Questions.


            The datasource for Tableau is a query that starts with all the Responses and then left-joins in the information on the Respondents, Questions, Answers, and Elemental Questions along with other useful data from our database such as benchmarks. I only grab the demographic fields from the Respondents table that are useful, I'm not trying to grab every field & original question. This setup could be done as a Multiple Tables query in Tableau, it's just easier for me to build & maintain the query in Access and then use a Tableau Data Extract (which I find necessary for performance, given how slow Access is).


            You had a comment about this repeating data unecessarily, and that's true. I have a few counterpoints to that:

            • Tableau data extracts are highly compressed, so fields with low cardinality (a Gender field being a good example) take up very little extra space.
            • Depending on your data source, Tableau can support billions of records. Data extracts are one of those sources, so having a row for every question & answer combination is not a big deal for us.
            • You don't have to have "the mother of all data sources", you can build different data sources for different purposes.


            Inside Tableau I have a calculated field called "Usable Response" that uses the Answer Value to identify whether the response is a usable, then a calculation for a % Always (which might be 9 or 10 on a 10 point scale, or "Always" on a a Never/Sometimes/Always scale, etc.) looks like this:


            SUM(IF [Positive] = "Yes" AND [Usable Response] = "Usable" THEN 1 ELSE 0 END)

              / SUM(IF [Usable Response] = "Usable" THEN 1 ELSE 0 END)


            Notice that I don't need a count distinct of Respondent IDs for this kind of analysis, all I need to do is count records so long as I'm clear in the filtering and dimensions in the view about what is being counted.


            I'm slowly working towards making datasources ready for Tableau users who aren't data analysts, I'll probably be moving more of the calculations into the data source, so for example the numerator above might become a "Positive" measure and explicitly use the "# of Usable Responses" measure I already have as the denominator. This would also enable the row-level calculations to be materialized in the extract and improve performance.


            One important thing to mention is about Tableau's reference lines. If you use an Average reference line, that is going to average the measure across whatever you define (Table/Pane/Cell) If you want the true group result, then you'd use the Total reference line option. For example, if you had results of 5/10, 6/10, and 7/14, then the Average reference line would be (.5 + .5 + .6)/3 = 53.3%, whereas the Total reference line would be (5 + 6 + 7)/(10 + 10 + 14) = 18/34 = 52.9%.



            1 of 1 people found this helpful
            • 3. Re: Data Shaping for Survey Data
              Bill Jindrich

              Jonathan and Susan, thanks so much for your responses. I still need to digest your replies as well as read Steve's whitepaper. If I have more questions, I'll come back to this forum.