2 Replies Latest reply on Oct 31, 2018 6:22 AM by Gabe DeWitt

    Tabpy, SQL, and the Knapsack Problem - Combination Optimization inside Tableau

    Gabe DeWitt

      Here's some neat things I've pieced together that others might find helpful.

      You'll need to have set up Tabpy for these examples.


      Querying MS SQL Server with Tabpy

      I don't know much SQL, but have managed to figure a few things out.

      A neat ability within Tableau that I just recently became aware of is Performance Recording.

      Help > Settings and Performance > Start Performance Recording

      • Turn it on, do a few things that will pull the data from your SQL database, and turn it back off. Tableau then outputs a Dashboard of the results.
      • Put the Slider for "Show Events..." to '0'. This will then display all the queries Tableau made.
      • Select the Query of choice from the Events panel, and it's associated query will display in the Query panel.
      • Copy that Query.

      What's really especially great about this is that the query includes all of the complex If statements and logic that I used to create the calculated values in Tableau, and now i have them in a pure SQL format.

      Tableau Performance Recording.JPG


      From here I put that code into a python editor of choice. I mostly use Jupytar Notebook and Spyder.

      Here's the script I use to pull MS SQL Server data via python

      import pandas as pd
      import pyodbc
      server = 'SERVER NAME'
      db = 'DATABASE NAME'
      # Create the connection
      conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Trusted_Connection=yes')
      # query db
      sql = '''  
      df = pd.read_sql(sql, conn)


      Load the data into pandas, and viola, I have a dataframe of the same fields I calculated in Tableau to do some more advanced analysis stuff with in python.


      Aside -

      The initial reason I went down this path with Tabpy was for an Asset Management Dashboard I'm building. I wanted to be able to answer the question:

      "if I have a $5M dollar budget, what equipment should I spend it on?"

      I created a Priority Index based on the bucketed age of equipment past its replacement date, paired with an internal measure of priority (a value recorded in the database when the equipment was originally added). I used a parameterized modifier on both age priority and internal priority, allowing the combined Priority Index to vary based on a user's configurations of what he/she denotes as a higher weighting factor toward replacement priority.

      This worked well enough in Tableau, using a running sum to determine which and how many pieces of equipment to replace...but it would fall short of utilizing the whole budget, depending on the priory parameter modifier selections. Sometimes it would leave a lot left unspent if the next item on the index was a large value item (eg $1M), sending it over the running sum cap.


      So, I needed another way that could handle this with a more programmatic solution.

      I found some easy enough to understand Python that handles the Knapsack problem.


      Knapsack (Combination Optimization) with Tabpy

      For those not familiar with this problem, here are some wiki bits on the subject:

      The knapsack problem or rucksack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items.

      As with many useful but computationally complex algorithms, there has been substantial research on creating and analyzing algorithms that approximate a solution. The knapsack problem, though NP-Hard, is one of a collection of algorithms that can still be approximated to any specified degree. This means that the problem has a polynomial time approximation scheme. To be exact, the knapsack problem has a fully polynomial time approximation scheme (FPTAS)


      After getting my code to work in python I had to make it work in Tableau...which was a bit tricky.

      Rather than get too long winded with explanations, here's the python code as it would appear in a calculated field in Tableau for a Boolean Result...

      I also made another version that returned the asset ID string, but found a simple Boolean response worked better (I included a note toward this in the below script).


      import pandas as pd
      import pyodbc
      server = 'ServerName'
      db = 'databaseName'
      # Create the connection
      conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Trusted_Connection=yes')
      # query db
      sql = '''
      SELECT pasted stuff from Performance Recording Query
      df = pd.read_sql(sql, conn)
      #reorder dataframe columns
      df= df[['AssetID','Cost','ReplacementLag', 'ReplacementPriority']]
      print df
      # mutiply by Tableau Parameters and then combine priority weighted values into one
      df['ReplacementWGT'] = (df['ReplacementLag']*max(_arg3)) + (df['ReplacementPriority']*max(_arg4))
      print df
      #reorder columns
      df= df[['AssetID','Cost', 'ReplacementWGT']]
      # fill NaN values with 0
      # Turn AssetID into String
      df['AssetID'] = df['AssetID'].apply(str)
      # Turn to Integers
      df['ReplacementWGT'] = df['ReplacementWGT'].astype(int)
      df['Cost'] = df['Cost'].astype(int)   
      # Sort dateframe by ReplacementWGT
      df=df.sort_values(by='ReplacementWGT', ascending=False)
      print df
      # Limit for Testing 
      ''' limit df for testing '''
      print df
      # turn df into tuples
      subset = df
      itemsList = [tuple(x) for x in subset.values]
      # Turns tuple list into tuple of tuples
      items = tuple(itemsList)
      print items
      # adding max() to budget parameter allowed this to work
      BUDGET= max(_arg2)
      # This is the start of the Knapsack script
          xrange = range
      def totalvalue(comb):
          ' Totalise a particular combination of items'
          totwt = totval = 0
          for item, wt, val in comb:
              totwt  += wt
              totval += val
          return (totval, -totwt) if totwt <= BUDGET else (0, 0)
      # Second part of the Knapsack scripts
      def knapsack01_dp(items, limit):
          table = [[0 for w in range(limit + 1)] for j in xrange(len(items) + 1)]
          for j in xrange(1, len(items) + 1):
              item, wt, val = items[j-1]
              for w in xrange(1, limit + 1):
                  if wt > w:
                      table[j][w] = table[j-1][w]
                      table[j][w] = max(table[j-1][w],
                                        table[j-1][w-wt] + val)
          result = []
          w = limit
          for j in range(len(items), 0, -1):
              was_added = table[j][w] != table[j-1][w]
              if was_added:
                  item, wt, val = items[j-1]
                  w -= wt
          return result
      # Output of the Knapsack Script 
      bagged = knapsack01_dp(items, BUDGET)
      print bagged
      # get list of solutions from Knapsack 
      lst1 = [i[0] for i in bagged]
      print lst1
      # get the list of Assets from Tableau Data
      lst2 = [i for i in _arg1]
      print lst2
      # Look for Knapsack solutions inside of Assets list
      lst3= []
      for i in lst2 :
          if i in lst1:
      ''' #if you want to return Real value instead of Boolean. You'll also need to modifying the initial entry Script from SCRIPT_BOOL to SCRIPT_REAL
      for i in lst2 :
          if i in lst1:
      return lst3
      ",ATTR([Asset Id (str)]), [Budget], [Replacement Lag Weight],[Replacement Priority Weight]     


      It took me a bit to figure out how Tabpy took values in, and how it liked to return them, but I'm feeling a little better now being able to get things to function how I initially imagined it.

      One of my main goals was to see if I could get Tableau Parameters to affect the python script.

      The trick to it was adding max() to the parameter input arguments.


      The only downfall to this solution is the solving time it takes for Tableau to return the results. Combination Optimization isn't a quick solution. 

      Also, when you change anything involving the Tabpy queries it runs the full python script again.

      As I increase the complexity of the problem (higher budget, more assets) it can take serval minutes to run and return Tableau visualized results.

      For UX, this is not a very good solution (from an executive user pov).


      After getting this all to work in Tableau, I feel there are some more efficient ways to return these same python based results as maybe endpoints, or resting lists in MS SQL Server (e.g. perhaps generated each morning via some python for multiple configurations of the Asset Replacement Dashboard parameters). I also plan to see what other python solutions for combination optimization exist, as to compare the resulting lists of selected assets (see which is doing a better job). I've read some things about the Numberjack python library that sounds promising... I'll try and play with it soon.


      I hope this helps a few people connect some dots with SQL, Tableau, and Python.