1 2 Previous Next 18 Replies Latest reply on Jan 30, 2017 7:09 PM by Shao Yihao

    The Walking Dead - Sankey Viz

    JP Kelly

      Hi All,

       

      Relative newbie here looking for some guidance on how to put together a sankey viz that maps the relationship between characters in The Walking Dead (which will be the "nodes" on the left), and the episodes in which they appear (which will be the "nodes" on the right). I've found some really great sankey diagram tutorials online. My favourite is this one by Ryan Smith (screenshot from the tutorial below) in which he maps football players who played in the recent Euros (point A, left) to the domestic leagues in which they play (point B, right).

       

      Sankey.jpg

       

      As useful as this tutorial is (along with many others that I've found), it seems to differ from what I'm trying to achieve in one significant way: instead of one node on the left connecting to one node on the right (e.g. in the Ryan Smith example, one name of a footballer [left] connects to one league [right]), my diagram will connect one character [left] to multiple episodes [right]. E.g. the protagonist Rick Grimes is credited in 83 episodes to date (see below), so his line from the left would split off to 83 different points on the right.

       

      Walking Dead - Number of Appearances.jpg

      The answer to this may be painfully obvious, but as I'm new to the wonderful world of dataviz, I can't quite get my head around it. Would it be as simple as reversing the concept/points in the tutorial cited above, so that a league (or "character" in my example), occupies the left and splits out to various different players (or "episodes" in my example)? Hopefully that makes sense! If not, I'd be happy to elaborate.

       

      Thanks in advance everyone!

       

      PS - Simon Runc, seeing how helpful you were with my last query, perhaps you can help again?!

        • 1. Re: The Walking Dead - Sankey Viz
          Simon Runc

          hi JP,

           

          Thanks for the ping...so you are certainly jumping into the deep end with Tableau! (Sankey's are very much, as you are finding out, not "out of the box"!!).

           

          As I don't know your data structure, control on data-prep...etc. it's hard to be too exact. However you should be able to reverse engineer this Viz from Nicco (this is also one of my all time fave vizes!...really shows how few refugees are really coming to Europe, contradicting our media!!). Political persuasions apart!...this does have Refugees from a single country of origin going to multiple different "destination countries", which (I think) is what you want.

           

          Tableau Public 

           

          A few things to bear in mind on the way his method differs from the one you've found (...and again we are going into the rabbit hole!). In the Concentra version the padding (the path) is done in the data, so it creates multiple records from 1 to N (the N depends on how many points, so how smooth you want the curve to be). Nicco (who is a Tableau wiz!) uses a clever technique known as data padding. He just brings in each record twice, one with the indicator [RowType] "Dummy" and one with "Real". He has a field [topad] (if [Rowtype]=='Real' then 1 else 49 end), so it gives the "Real" rows the value of 1, and the "dummy" 49. He then created a "Bin" from this field [padded] with the Bin Size set to 1. What this means is that Tableau creates the Bins 1,2,3,4....49 (due to a little known, or documented, feature called (in general terms) data densification), and he can access these for his curve (as they are Table Calculation they can access these "densified" values). This is a very elegent solution, and is why his version only has 800 rows!

           

          Let me know how you get on...and if you get stuck, can you post a workbook, as it'll be much easier to diagnose what's gone wrong.

           

          Also came across this the other day

           

          Tableau Public

           

          Sankey's without duplicating the datasource (using LoDs)...not really had a good look, so not sure if it would work here, but interesting never the less (and again uses padding, rather than data-prep)

          2 of 2 people found this helpful
          • 2. Re: The Walking Dead - Sankey Viz
            JP Kelly

            Thanks Simon (yet again...)

             

            I like the deep end - it's less crowded!

             

            I think I'll have a stab at Nicco's method and will let you know how I get on. No doubt I'll stumble at the first hurdle and will be posting back here shortly. Watch this space...

             

            JP

            • 3. Re: The Walking Dead - Sankey Viz
              JP Kelly

              Remember what I said about falling at the first hurdle?...

              Hurdle.gif

              I got stuck at "He just brings each record in twice...". How exactly is this done? I would assume it's by opening a new project, connecting to the relevant data, and then choosing "duplicate data source" from the main menu? As for the second hurdle, I couldn't figure this out either (as in the gif above). How do you change the "indicator" for the data in order to create "real" and "dummy" versions? It might help to explain this to me as you would to a senile 90 year old.

               

              Believe it or not, I had much more success from hurdle three onwards, and managed to figure out how to create bins, measures, etc (which is actually quite an achievement as I've never even touched these before). Basically, if I can clear the first two hurdles then I should be okay.

               

              And for anyone else interested in sankeys there's a great tutorial here by Chris Love which is very similar to the method you describe above - though he uses an earlier version of Tableau so some things are probably different.

               

              JP

              • 4. Re: The Walking Dead - Sankey Viz
                Simon Runc

                hi JP,

                 

                Nice GIF!!!

                 

                My bad...I get so used to "data speak" that I forget how jargony I get!

                 

                So by bring it in twice, I mean in the same data-source, so you have 2 rows for every entry. This used to require some custom SQL, but is much easier with the union feature in T10.

                 

                So the easiest way, is just to copy your data onto another Excel sheet, call on "Real" & one "Dummy"...I've attached the Excel I've used in the below .gif so you can see what it does. Below shows this happening

                 

                Union Data.gif

                 

                You'll notice that I get a new field (well 2 actually) called Sheet (and Table Name), which means that I can now write the toPad formula from this field.

                 

                Good work on the other elements...they are by far the most complicated bit (apart from the curves...but you don't need to understand the maths to get them working!!). Like the link...Chris has done some great videos.

                 

                Hope that makes sense?

                • 5. Re: The Walking Dead - Sankey Viz
                  JP Kelly

                  Simon,

                   

                  Thanks - that's much clearer. Have cleared that hurdle now. Was easier than I expected!

                   

                  However, despite my bragging in the last post about nailing hurdles three onwards, things aren't working out quite as planned...

                   

                  Even though I've figured out how to create the ToPad field, there's an error which is stopping me from being able to move on to the next step (creating the bin for the padding). Screenshot of the error:

                   

                  ToPad Problem.jpg

                   

                  Again, it's probably something horribly obvious that I'm missing!...

                   

                  In case you can't make it out in that screen shot, the error reads: Field [ToPad] contains a circular reference.

                  • 6. Re: The Walking Dead - Sankey Viz
                    Simon Runc

                    hi JP,

                     

                    So I think your formula needs to be

                     

                    IF [Table Name] = 'Real' THEN 1 ELSE 49 END

                     

                    Assuming you've kept to the excel sheet naming convention of "Real" and "Dummy". [RowType] was just the name that Nicco used to define the thing we've done by Unioning the 2 sheets (he created the UNION in a data-prep tool before bringing into Tableau...as UNION wasn't available when he built this)

                     

                    btw Tableau understands most syntax types so = or == are the same.

                     

                    Hope that does the trick.

                     

                    Looks like it's Sankey week for me...not having built one for ages, I've had your question....and we went to Lazer Quest last night on a team night out. I have all the data of who shot who...and thinking a Sankey might be a good way to show what happened in the arena (I also have your same problem, that I need a one-to-many Sankey). btw I will only be a small footnote in the Viz...having come last on both games and basically being company "Cannon Fodder" (...I think I'll stick to Tableau!!)

                    • 7. Re: The Walking Dead - Sankey Viz
                      JP Kelly

                      A Lazer Quest Viz? I'd love to see that!

                       

                      Thanks for the tip btw, managed to get that working, but starting to think I may have bitten off more than I can chew (you did warn me!) I basically have no idea where to go from here. You original instructions kind of stop at this point. The tutorial by Chris Love gives some guidance on what to do next, but I'm having a similar problem. The next step in his tutorial is...

                       

                      "creating a third function [t]:

                       

                      [t] = (index()-25)/4"

                       

                      t Problem.jpeg

                      I have no idea what [t] is supposed to refer to in my data though? If I follow his tutorial to the letter, I get the error: "Reference to undefined field [t]". Clearly I need to put something else in place of [t]. The only thing that seems to work in that formula is [Number of Records]. Any thoughts? Hopefully this will be my last plea for help - though I wouldn't count on it... Step three seems pretty baffling too. It may be that I have to put this idea on the back burner until I gain a bit more experienced with Tableau.

                      • 8. Re: The Walking Dead - Sankey Viz
                        JP Kelly

                        Quick update: realised the rookie mistake I was making (the "[t]" doesn't actually go in the calculation!)

                         

                        I've also been reverse engineering the original viz you shared by Nicco which has really helped. However, his data is structured quite differently to mine so I'm not sure that my ranking functions are referring to the correct measures.

                         

                        His data looks like this:

                        Nicco Data.jpg

                        Meanwhile, mine looks like this.

                        JP Data.jpg

                        If I'm modeling my Sankey on Nicco's, then as far as I can figure out "Country of Asylum" would equate to "Episode ID", while "Origin" would equate to "Character ID" (or "Character"). The reason I'm assuming this is because are multiple rows for the same "Origin" in Nicco's viz, just as there are multiple rows for "Character ID" (or "Character") in mine. I.e. There are multiple entries for Afghanistan (which link to different "country of asylum" entries), just as I have multiple entries for Rick Grimes (which link to different "Episode ID" entries). Hopefully that makes sense?

                         

                        In Nicco's viz, the rank functions (see below) both refer to "Population end-2014" - though I can't figure out what the equivalent would be in my data. His Rank functions are as follows:

                         

                        [Rank 1] = RUNNING_SUM(sum([Population end-2014]))/TOTAL(sum([Population end-2014]))

                         

                        [Rank 2] = RUNNING_SUM(sum([Population end-2014]))/TOTAL(sum([Population end-2014]))

                         

                        In my viz I replaced "Population end-2014" with "Character ID" just to see what happens. Here's what my viz currently looks like based on this attempt (I'm clearly going wrong somewhere but am determined to figure this out!):

                        Walking Dead Sankey - In Progress.jpg

                         

                        Any advice would be most welcome!

                        • 9. Re: The Walking Dead - Sankey Viz
                          Simon Runc

                          hi JP,

                           

                          So yes...Sankey's are pretty complicated!!

                           

                          It'll be easier for me to have a play, and then let you know what is what (with regards which things map onto Nicco's version)...I suspect it's to do with the Compute Levels of the Table Calculations. Are you able to post your data here (or if it's not something you can share, happy for you to email it over to me).

                           

                          It'll be useful for me anyway...as I embark on the Lazer Quest battle viz!!

                          • 10. Re: The Walking Dead - Sankey Viz
                            JP Kelly

                            Thanks Simon - have just emailed you the workbook and the data!

                            • 11. Re: The Walking Dead - Sankey Viz
                              Simon Runc

                              Hi JP,

                               

                              Apologies for the delay...the last couple of weeks got away from me!! (for anyone else...JP emailed over his data and .twb and I said I'd explain a bit about how Nicco's version works, and which bits we needed to adapt to make the same, general, method work for this situation)

                               

                              ....So how do we adapt Nicco’s Sankey for your purposes? I’ll run through the basic method on how/why Nicco’s works and the bits we have to adapt for this situation….

                               

                              So as mentioned before we need to bring the data in twice...a “real” version and a “dummy” version (this is so we can create a series of numbers, using Tableau’s “domain padding” in which to run the curves over). We can do this with the Union feature. For simplicity you’ve just made a copy of the data on a second tab (in Excel), and called the original tab “real” and the copy “dummy” (we could, actually, just union the same tab twice, and use the fact the one is called “Table” and the other “Table1” in the “Table Name” field generated in the Union). So we just Union these together.

                               

                               

                               

                              Once we have the data in we need to make a few calcs…

                               

                              [toPad]

                              IF [Table Name] = 'Real' THEN 1 ELSE 49 END

                              And we then create a Bin from this field, called “padded”

                              As a “range aware” pill (a field which knows it’s MIN, MAX and increment..so a Bin or Date field), we can use this to Pad the domain. This means we can access (via a Table Calc) all the values between 1 and 49 (so 1,2,3...49).

                               

                              Then we have the Curve formula...you can understand this, or just look it up on Wikipedia when you need a curve (I'm in the 2nd option territory!!)

                              [Sigmoid]

                              1/(1+EXP(1)^-[t])

                               

                              We also have 3 further formulas….The first is Size

                              [Size]

                              RUNNING_AVG(COUNTD([Character ID + Episode]))

                              This is used to Size the Sankey lines to be proportional to the number of episodes each character is in.

                              This is the first change from Nicco’s version, which uses

                              RUNNING_AVG(SUM([Population end-2014]))...the difference is just due to fact we want to count episodes and his one SUMs a field (the population)

                               

                              NB. I created a [Character ID + Episode] calculated field, which just concatenates Character and Episode so we can run over the full grain of data. We need a Sankey line per Character ID and Episode.

                               

                              Then we also have 2 formulas which determine the Start and End Points of the Sankey curves (when combined)

                              [Rank 1]

                              RUNNING_SUM(COUNTD([Character ID + Episode]))/TOTAL(COUNTD([Character ID + Episode]))/2

                               

                              [Rank 2]

                              RUNNING_SUM(COUNTD([Character ID + Episode]))/TOTAL(COUNTD([Character ID + Episode]))/2

                              Both [Rank 1] and [Rank 2] are the same formula, but as Table Calculations, they are run-over slightly different levels, and so determine where a Sankey line starts (left-hand side), and where it finishes (right-hand side)...with the curve determining the path from one to another ([t] = -6 to [t] = 6...this is the [t] formula run when Padded = 1, then 2, then 3...49, although as a continuous field this is run over every 0.25...the number of these marks will determine how "smooth" you want the curves...for this purpose every 0.25 is sufficient).

                              This is again different from Nicco’s which used

                              RUNNING_SUM(sum([Population end-2014]))/TOTAL(sum([Population end-2014]))

                              Again as we are counting episodes/characters and he SUMing population.

                               

                              The general form of this formula is to make everything a %age of Total (in fact a running total of %age). This way we can fix the Axis between 0 and 1 (0% to 100%), and ensure the values are resized to always be between 0 and 1 (make it kind of responsive). We could use absolutes but would make the Sankey non-dynamic (eg. if the data was filtered, or we add new data, we’d need to perform some manual maintenance to define the start and end of the Y axis.

                               

                              Then we combine these to be the [Curve] formula

                              [Curve]

                              [Rank 1]+(([Rank 2] - [Rank 1])*[Sigmoid])

                               

                              The final formula we need is [t]

                              [t]

                              (index()-25)/4

                              This formula is accessed in the [Sigmoid]...and is what creates the curve from [t] = -6 to [t] = 6

                              So next bit is setting up the Sankey…

                              We create the pill arrangement exactly as per Nicco’s version.

                               

                              We then need to set up the Addressing/Partitioning on the Table Calculations. A few things to note first....

                               

                              We use the [Padded] bin on path. This means that Tableau completes the domain (this is due to the pill arrangement, the table calculations and their addressing/partitioning...see this video if you are interested in the murky world of Tableau Densification, in it’s many forms Tableau Request Live - Data Densification on Vimeo  ...please note this video is a little old, so some of the exact times when it does and doesn’t densify may have changed, but it’s generally still correct and gives a great insight into what ‘s going on)

                               

                              Only Table Calculations can access these densified points (so points between 2 and 48) which don’t “really” exist in the data.

                               

                              So the following Table Calculations are all just set to Compute Using = [Padded]

                              [Size], [t]

                               

                              The [Curve] table calculation set up is the mosty complicated...looking at the formula

                              [Rank 1]+(([Rank 2] - [Rank 1])*[Sigmoid])

                               

                              It contains 3 table calculations, and we set them up slightly differently. This is why they have been set up in 3 different formulas. This lets us run them independently (in terms of Addressing/Partitioning) from each other.

                               

                              [t] is set-up to compute using [Padded]

                              [Rank 1] is set up as follows

                              And [Rank 2]

                               

                              Notice the subtle difference in the [Rank 1] and [Rank 2] set up...the Episode and Character order is reversed.

                               

                              If you follow through all the calculations, this means that each Sankey line starts at a point on the Left Hand Side, which is the value (from 0 to 1) of where we want the Sankey line to start (we actually need a line for each Episode and Character, but by sorting by Character all the Character-lines are stacked on top of one another, to show their overall involvement, as a %age of all Character/Episode counts. And by sorting the Episodes the same happens right-hand side for Eposides)

                               

                               

                              After all this we need to do a bit of formatting, by making the [t] axis fixed to -5 to +5, and the [Curve] Axis from 0-1.

                               

                              To set up the 2 stacked bar charts either side of the Sankey curves, we set them up as follows. Where the COUNTD([Character ID + Episode]) is set up a Quick Table Calc “Percent of Total”, and Compute Using = [Character ID + Episode]...so this reflects the same order as the Sankey Curves (so again everything falls between 0 and 1)

                              And similar on the Episode bar chart, on the right-hand side

                              We then bring them all onto a Dashboard. We can then set up the following 2 actions

                              This one means by clicking in a Character, all the episodes they are in are highlighted (along with the Sankey lines to them)

                              And this one so when we click on an episode the Characters in that episode are highlighted (along with the Sankey lines to them)

                               

                              And that’s about it...simple ey?!!!

                               

                              Hope this helps you understand a bit of what's going on, and why...but if any bits don't make sense, or you want a bit if extra clarity let me know.

                              1 of 1 people found this helpful
                              • 12. Re: The Walking Dead - Sankey Viz
                                JP Kelly

                                Simon,

                                 

                                Thanks so much for the write up - it really helps me to make sense of it all, and I'm sure it'll be of use to many other tableau users.

                                 

                                I haven't done much with the viz since you originally helped me out with this (mostly because, as you pointed out, it's not the most effective type of viz for what I was trying to communicate) but I'll probably still do something with it at some point (FYI, I made this and this instead).

                                 

                                Given that there was clearly too much data in the one that I was working on (27 characters, 84 episodes), I think I'll have a go at making a tableau story with one page for each season (this will only be around 15 characters and 16 episodes at most). I'll post the result here if/when I get the time to do this!

                                 

                                Thanks again for the help and the write up.

                                • 13. Re: The Walking Dead - Sankey Viz
                                  Simon Runc

                                  hi JP,

                                   

                                  Glad you found it useful...and yes season by season will be more readable, but really like your 2 versions...really professional looking!

                                   

                                  While I was messing about with this, I did try another Viz-type (just got intrigued by the best way to show this)...I've posted here (see Dashboard 2). Could do with a bit more polishing, but is similar (in type) to your "TheWalkingDead-EpisodeandCharacterGuideSeasons1-6" one.

                                   

                                  Looking forward to the final version(s)...it's been quite a bit of fun, this one!!

                                  • 14. Re: The Walking Dead - Sankey Viz
                                    Alexander Mou

                                    Nice job Simon!

                                     

                                    Got into Sankey chart lately and found a way to save the dummy table.

                                    We don't need to duplicate the table any more.

                                    Vizible Difference: Creating Sankey Chart Made A Bit Easier

                                     

                                    Cheers.

                                    1 2 Previous Next