# Sankey Diagrams

**christopher.hawk**Nov 19, 2014 7:43 AM

Thanks to our meeting I went down the rabbit hole that is Sankey diagrams or graphs. I have to say it is a challenge to produce these graphs in any format let alone Tableau. I missed the name of the woman that showed us her version. But I can say I have both regretted it and appreciated it over the past day. There are some good discussions both in the Tableau user community and in some other excellent blogs.

Here is my first result. I'm still playing with details like labeling and colors but it is an actual result. And it is pretty cool.

The key learning over the last 2 days is that data prep is a primary requirement. I jumped in and built a Sankey diagram with an intermediate point. So each path has a starting location, an intermediate location, and a final location, positions 1, 2 and 3. Data needs to be generated to produce the curve of the path between these points, and to produce the ribbon effect.

First I needed to lay out the vertical positions I want. This was done in Excel using a running percentage total after sorting the data set on each category.

Producing the curve between the points is pretty straight forward. Tableau's primary limitation though is that the width of a line cannot be explicitly defined based on data content. If I want a line path that is 42% of the width of the plot area I cannot make it that explicit. The resulting view can be useful but is not the true Sankey view.

To get around the line width limitations, the standard response is to create a line for each percentage, essentially 100 different lines that follow their own path. My sample data set had 40 distinct paths through 2 transitions. The basic data set is pretty straight forward, a path defined by three categories, and the could of units passing through that path.

The first data prep step is to create the curve using the Sigmoid function. What this means in data prep terms is that each record has to be repeated with a newly created distinct value of T. My example uses T values from -6 to 6 incrementing every .25 for a row multiplier factor of 49. My first attempt was to cut/paste 49 times with a new T value. That got very boring fast so I decided to create a separate data set with the 49 T values and right join in tableau to create those records. On a new tab I created the 49 row data set for T values and joined to it.

Next I needed to spread the rows vertically to produce the width in the ribbons. My data at this point (after join) is one data row repeated 49 times to create a path. I now need to replicate the set of records defining a path to be as many paths as needed to fill the width of the ribbon. I defined the width in the data set using excel formulas (PctStart and PctEnd below). This is the width of each ribbon if the Y axis runs from 0 to 100.

Then because I hate cut/paste I used the same right join method to create the added rows. Tableau doesn't allow joins like "ON pct1 WHERE pct1 BETWEEN lowlimit and highlimit" and I couldn't figure out a join that actually worked. So I ended up adding all 100 percentages (100 rows) as new rows. My 40 row data set has now become 196,000 rows.

The join values are a constant of 1.

With the data prep complete I had to build out the calculated fields. First I built the Sigmoid function modified T value.

1/(1+EXP(1)^-[T])

Then I built the positions for each path. What I needed was the planned position from the original 40 paths. I can't use the psoitions explicitly, bu the offset from the start. I used the percentage value obtained by the linking to the percent dataset plus offsets for positions 2 and 3. In this way each produced path is identical in curve and shape, but simply offset from the original.

The curve calculation Is a bit trickier but is straight forward once you understand the direction changes.

IF [Pos 2]>[Pos 1] then

MIN([Pos 1],[Pos 2])+([Sigmoid Function]*ABS([Pos 2]-[Pos 1]))

ELSE MAX([Pos 1],[Pos 2])-([Sigmoid Function]*ABS([Pos 2]-[Pos 1]))

END

This is repeated for position 2 to 3.

Finally the piece I tripped on was to be sure to display the dimension T as well as the sum(T). Here is what one of the transitions looks like in design.

When I get a chance I will try to mask the data and package this for public consumption. For now I hope this is enough to get people started trying this on their own. It is a good viz, but tricky to set up and requires a lot of data prep. I have uses for it, but some real cost/benefit questions.