5 Replies Latest reply on Feb 12, 2019 2:59 PM by Kaz Shakir

    Multiple left self joins from .csv and min date subqueries

    Amy Pielow

      I have a dataset I need to find related rows on, with the data coming from .csv that I want to put in Tableau. I wrote code in SAS to do so, but I now need to rewrite the operation to only be done in Tableau, and I am having trouble writing the custom SQL to do. I can't use the built in joins because I need more than just = relationships, which is the only available option with the legacy connector.

       

      For context, each row of my data represents a call from an account. What I need to do is take each row then find if there exists another row (call) that matches on the account. Also, I have additional columns that denote call context (1 through 4). End result is looking for calls that match on any of these levels. I want to keep all calls and add offset columns with data about the next call that matches these account and context criteria.

       

      In SAS, my steps were

      1. Inner self join on my call_database on t1.account number = t2.account number, where I keep all t1 columns and t2 columns for record number, date, and account number

      2. Sort the above table to only keep one row for each call, the row where the t2 data corresponds to the first call after the t1 row's call

      3. Do steps 1 and 2 for just matching accounts, then for matching accounts and context 1, ..., through to matching accounts, context 1 through context 4

       

      So now I would have have 5 tables--my initial set of all calls and tables with info on a call and it's next call based on certain criteria. I then join all these tables back together, using left joins with the call_database as the leftmost/base table using the record number to join. This leaves me the same table I started with that has new columns with data on the next call with matching criteria.

       

      I am really struggling to migrate this to Tableau. I can join two tables (the base and the account matches) but cannot figure out the syntax for multiple joins. Additionally, I can't figure out how to only keep one row which corresponds to the next call. This is important because my data contains ~1 million rows, and that blows up without doing the preprocessing step. I've got pieces that run but haven't been able to put it all together successfully.

       

      Does this even seem possible with Tableau? Any direction on doing something complicated in a single step would be appreciated.

       

       

      This is what my SAS code looks like

       

       

      PROC SQL;

         CREATE TABLE WORK._MATCH_ACCT_NUM AS

         SELECT t1.*,

                /* Days_Between */

                  (datdif(datepart(t1.openeddate),datepart( t2.openeddate),'act/act')) AS Days_Between

            FROM WORK.call_database t1

                 Inner JOIN WORK.call_database t2 ON (t1.AccountNumber = t2.AccountNumber)

      WHERE (CALCULATED Days_Between) BETWEEN -7 AND 0 AND t1.OpenedDate >  t2.OpenedDate;

      QUIT;

       

      proc sort data=_acct_match_dates out=final_acct_match nodupkey;

         by accountnumber OpenedDate;

      run;

       

      PROC SQL;

         CREATE TABLE WORK._ACCT_MATCH_C1 AS

         SELECT t1.*,

                /* Days_Between */

                  (datdif(datepart(t1.openeddate),datepart( t2.openeddate),'act/act')) AS Days_Between

            FROM WORK.call_database t1

                 Inner JOIN WORK.call_database t2 ON (t1.AccountNumber = t2.AccountNumber) AND (t1.ContactType1 = t2.ContactType1)

      WHERE (CALCULATED Days_Between) BETWEEN -7 AND 0 AND t1.OpenedDate >  t2.OpenedDate;

      QUIT;

       

      proc sort data=_ACCT_MATCH_C1 out=FINAL_ACCT_MATCH_C1 nodupkey;

         by accountnumber contacttype1 OpenedDate;

      run;

       

      PROC SQL;

         CREATE TABLE WORK.FINAL_MATCHED_CALLS AS

         SELECT t1.*,

                /* Calls with Account Matching  */

                  (case

                  when missing(t2.RecordNumber) then .

                  else t2.RecordNumber1

                  end) AS Acct_Next_Call,

        (case

                  when missing(t2.RecordNumber) then .

                  else t2.OpenedDate1

                  end) AS Acct_Next_Call_Date format=datetime18.,

      (case

                  when missing(t2.RecordNumber) then .

                  else t2.Days_Between

                  end) AS Acct_Next_Call_Days_Btwn,

        /* Next_Call_Record_Number_C1 */

                  (case

                  when missing(t3.RecordNumber) then .

                  else t3.RecordNumber1

                  end) AS Acct_C1_Next_Call,

        (case

                  when missing(t3.RecordNumber) then .

                  else t3.OpenedDate1

                  end) AS Acct_C1_Next_Call_Date format=datetime18.,

      (case

                  when missing(t3.RecordNumber) then .

                  else t3.Days_Between

                  end) AS Acct_C1_Next_Call_Days_Btwn,

      /* CALC CLOSEST LEVEL OF MATCH */

                  (case

                  when not missing(t3.RecordNumber) then t3.RecordNumber1

                  when not missing(t2.RecordNumber) then t2.RecordNumber1

      else .

                  end) AS Acct_Max_Next_Call,

            FROM WORK.call_database t1

                 LEFT JOIN WORK.FINAL_ACCT_MATCH t2 ON (t1.RecordNumber = t2.RecordNumber)

                 LEFT JOIN WORK.FINAL_ACCT_MATCH_C1 t3 ON (t1.RecordNumber = t3.RecordNumber)

            ORDER BY t1.AccountNumber,

                     t1.OpenedDate;

      QUIT;

       

      This is custom SQL that only joins based on account, but keeps all matching rows. Any code for doing the multiple joins has not worked.

       

       

      SELECT [BaseCallData].[AccountNumber] AS [AccountNumber],

        [BaseCallData].[ContactType1] AS [ContactType1],

        [BaseCallData].[ContactType2] AS [ContactType2],

        [BaseCallData].[ContactType3] AS [ContactType3],

        [BaseCallData].[ContactType4] AS [ContactType4],

        [BaseCallData].[OpenedDate] AS [OpenedDate],

        [BaseCallData].[RecordNumber] AS [RecordNumber],

        [ACCTJOIN].[AccountNumber] AS [ACCTNMBER2],

        Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy") as [Acctjoin_OD],

        Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") as [BASE_OD],

        DateDiff("d",[Acctjoin_OD],[BASE_OD]) as [Acct_OD2],

        [ACCTJOIN].[OpenedDate] AS [Acct_OpenedDate]

      FROM [Jan19_calls#csv] [BaseCallData] LEFT JOIN [Jan19_calls#csv] [ACCTJOIN]

      ON [BaseCallData].[AccountNumber] = [ACCTJOIN].[AccountNumber]

      WHERE [ACCTJOIN].[OpenedDate] >= [BaseCallData].[OpenedDate]

      AND  DateDiff("d", Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy"),Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") ) >= -7

        • 1. Re: Multiple left self joins from .csv and min date subqueries
          Kaz Shakir

          Amy,

          It would be helpful if you could provide a little bit more information about what you are trying to accomplish - in other words, what will the final dashboard/visualization look like, and what sorts of information is required to be conveyed?  It seems from reading your post, that it has something to do with tracking the amount of time between similar type calls made on each account.  There might be different ways to tackle your problem in Tableau, but we really need some more information to be able to better understand.

           

          In the mean time, I did spend some time reviewing the SQL you posted, and I'm wondering if you might be able to do something like the SQL below, to address your query.  Naturally, since I don't have a dataset on which to test this, I don't know if it will execute, but perhaps it will give you some helpful ideas:

           

          SELECT

               [BaseCallData].[AccountNumber] AS [AccountNumber],

               [BaseCallData].[ContactType1] AS [ContactType1],

               [BaseCallData].[ContactType2] AS [ContactType2],

               [BaseCallData].[ContactType3] AS [ContactType3],

               [BaseCallData].[ContactType4] AS [ContactType4],

               [BaseCallData].[OpenedDate] AS [OpenedDate],

               [BaseCallData].[RecordNumber] AS [RecordNumber],

               [FINAL_ACCT_MATCH].[RecordNumber] AS [ACCT_NEXT_CALL],

               [FINAL_ACCT_MATCH].[OpenedDate] AS [ACCT_NEXT_CALL_DATE],

               [FINAL_ACCT_MATCH].[Days_Between] AS [ACCT_NEXT_CALL_DAYS_BETWEEN],

               [FINAL_ACCT_MATCH_C1].[RecordNumber] AS [ACCT_C1_NEXT_CALL],

               [FINAL_ACCT_MATCH_C1].[OpenedDate] AS [ACCT_C1_NEXT_CALL_DATE],

               [FINAL_ACCT_MATCH_C1].[Days_Between] AS [ACCT_C1_NEXT_CALL_DAYS_BETWEEN]

          FROM [Jan19_calls#csv] [BaseCallData]

          LEFT OUTER JOIN (

               SELECT

                    [BaseCallData].[AccountNumber] AS [AccountNumber],

                    [BaseCallData].[ContactType1] AS [ContactType1],

                    [BaseCallData].[ContactType2] AS [ContactType2],

                    [BaseCallData].[ContactType3] AS [ContactType3],

                    [BaseCallData].[ContactType4] AS [ContactType4],

                    [BaseCallData].[OpenedDate] AS [OpenedDate],

                    [BaseCallData].[RecordNumber] AS [RecordNumber],

                    [ACCTJOIN].[AccountNumber] AS [ACCTNMBER2],

                    Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy") as [Acctjoin_OD],

                    Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") as [BASE_OD],

                    DateDiff("d",[Acctjoin_OD],[BASE_OD]) as [Acct_OD2],

                    [ACCTJOIN].[OpenedDate] AS [Acct_OpenedDate]

               FROM [Jan19_calls#csv] [BaseCallData]

               INNER JOIN [Jan19_calls#csv] [ACCTJOIN]

                                        ON ([BaseCallData].[AccountNumber] = [ACCTJOIN].[AccountNumber])

               WHERE [ACCTJOIN].[OpenedDate] >= [BaseCallData].[OpenedDate]

                              AND  DateDiff("d", Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy"),Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") ) >= -7

               ) [FINAL_ACCT_MATCH] ON ([BaseCallData].[RecordNumber] = [FINAL_ACCT_MATCH].[RecordNumber])

          LEFT OUTER JOIN (

               SELECT

                    [BaseCallData].[AccountNumber] AS [AccountNumber],

                    [BaseCallData].[ContactType1] AS [ContactType1],

                    [BaseCallData].[ContactType2] AS [ContactType2],

                    [BaseCallData].[ContactType3] AS [ContactType3],

                    [BaseCallData].[ContactType4] AS [ContactType4],

                    [BaseCallData].[OpenedDate] AS [OpenedDate],

                    [BaseCallData].[RecordNumber] AS [RecordNumber],

                    [ACCTJOIN].[AccountNumber] AS [ACCTNMBER2],

                    Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy") as [Acctjoin_OD],

                    Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") as [BASE_OD],

                    DateDiff("d",[Acctjoin_OD],[BASE_OD]) as [Acct_OD2],

                    [ACCTJOIN].[OpenedDate] AS [Acct_OpenedDate]

               FROM [Jan19_calls#csv] [BaseCallData]

               INNER JOIN [Jan19_calls#csv] [ACCTJOIN]

                                   ON (([BaseCallData].[AccountNumber] = [ACCTJOIN].[AccountNumber]) AND

                                             ([BaseCallData].[ContactType1] = [ACCTJOIN].[ContactType1]))

               WHERE [ACCTJOIN].[OpenedDate] >= [BaseCallData].[OpenedDate]

                              AND  DateDiff("d", Format([ACCTJOIN].[OpenedDate],"mm/dd/yyyy"),Format([BaseCallData].[OpenedDate],"mm/dd/yyyy") ) >= -7

               ) [FINAL_ACCT_MATCH_C1] ON ([BaseCallData].[RecordNumber] = [FINAL_ACCT_MATCH_C1].[RecordNumber])

          ORDER BY

               [BaseCallData].[AccountNumber],

               [BaseCallData].[OpenedDate]

           

          Hope that helps.  Please let us know any additional details, and I'm sure someone here will be able to help.

          • 2. Re: Multiple left self joins from .csv and min date subqueries
            Amy Pielow

            Your guess at the goal is correct, I've got ~5 dashboards (20ish sheets) that track time between calls for an account, account + contact type, etc. I also look at what percent of calls are follow up calls within a week, overall call patterns based on topic, call patterns based on who took the call, and a view other things. I initially built the dashboard trying to identify these "follow up" or "repeat" calls within Tableau with table indices and LOD calcs, but I needed to do some calculations like percent of calls that couldn't rely on lookups, so I moved the joins/associations between calls to the dataset itself.

             

            I was able to modify what you gave me to work for my dashboard, thank you for the help!

             

            I may still modify it to try to keep only the first row based on record number during the customSQL. I'd like to cut down on my row size (all those outer joins cause a combinatoric explosion!) since I only care about the *next* call, not all subsequent calls, but if nothing else I know how to do that with a context filter.

            • 3. Re: Multiple left self joins from .csv and min date subqueries
              Kaz Shakir

              Amy,

              I'm glad that my notes were able to provide some helpful guidance.  Thanks for sharing a little more about the dashboards you are trying to create.  I think I'm getting a better idea of what you are trying to accomplish.  It seems like I faced a similar task not long ago (although, not nearly as involved as yours).  One of the approaches I took in my task, was to "build a column" for the last call, and the penultimate call (or previous call).  This made for a complicated query, but it did help to reduce the number of rows, and then allowed some of the calculations to be pushed back to Tableau.

               

              Just as an example, I tried to use your datasource labels to build a similar query.  Please take a look at the attached, and perhaps you can adapt it to your needs.

               

              Kaz.

               

              p.s., I was also able to create a mock-up based on the information you have provided so far, just to give you an idea of how this might look.  I didn't use a CSV file, but rather, I used an excel file, and used the "Legacy Connector" so that I could use custom sql with it.  I'm attaching the packaged workbook, hopefully you can open it and see what I've done.  I'm using version 2018.1.1 of Tableau.

               

              SELECT

                   [BaseCallData].[AccountNumber] AS [AccountNumber],

                   [BaseCallData].[ContactType1] AS [ContactType1],

                   [BaseCallData].[ContactType2] AS [ContactType2],

                   [BaseCallData].[ContactType3] AS [ContactType3],

                   [BaseCallData].[ContactType4] AS [ContactType4],

                   [BaseCallData].[OpenedDate] AS [OpenedDate],

                   [BaseCallData].[RecordNumber] AS [RecordNumber],

                   [LastCall].[LastCallDate] AS [LastCallDate],

                   [PreviousCall].[PreviousCallDate] AS [PreviousCallDate],

                   DATEDIFF("d", FORMAT([LastCall].[LastCallDate], "mm/dd/yyyy"), FORMAT([PreviousCall].[PreviousCallDate], "mm/dd/yyyy")) AS           [ACCT_NEXT_CALL_DAYS]

              FROM [Jan19_calls#csv] [BaseCallData]

              LEFT OUTER JOIN (

                   -- this sub-query will give you the most recent call that was made to

                   -- each account

                   SELECT

                        [BaseCallData].[AccountNumber] AS [AccountNumber],

                        MAX([BaseCallData].[OpenedDate]) AS [LastCallDate]

                   FROM [Jan19_calls#csv] [BaseCallData]

                   GROUP BY [BaseCallData].[AccountNumber]

                   ) [LastCall] ON ([BaseCallData].[AccountNumber] = [LastCall].[AccountNumber])

              LEFT OUTER JOIN (

                   -- this sub-query will give you the second most recent call that was

                   -- made to each account

                   SELECT

                        [SubQuery2].[AccountNumber] AS [AccountNumber],

                        MAX([SubQuery2].[OpenedDate]) AS [PreviousCallDate]

                   FROM (

                        SELECT

                             [BaseCallData].[AccountNumber] AS [AccountNumber],

                             [BaseCallData].[ContactType1] AS [ContactType1],

                             [BaseCallData].[ContactType2] AS [ContactType2],

                             [BaseCallData].[ContactType3] AS [ContactType3],

                             [BaseCallData].[ContactType4] AS [ContactType4],

                             [BaseCallData].[OpenedDate] AS [OpenedDate],

                             [BaseCallData].[RecordNumber] AS [RecordNumber]

                        FROM [Jan19_calls#csv] [BaseCallData]

                        LEFT OUTER JOIN (

                             -- this sub-query will give you the most recent call that was made to

                             -- each account

                             SELECT

                                  [BaseCallData].[AccountNumber] AS [AccountNumber],

                                  MAX([BaseCallData].[OpenedDate]) AS [LastCallDate]

                             FROM [Jan19_calls#csv] [BaseCallData]

                             GROUP BY [BaseCallData].[AccountNumber]

                        ) [LastCall] ON (([BaseCallData].[AccountNumber] = [LastCall].[AccountNumber} AND

                                                      ([BaseCallData].[OpenedDate] = [LastCall].[LastCallDate]))

                   WHERE [LastCall].[LastCallDate] IS NULL

                   ) [SubQuery2]

                   GROUP BY [SubQuery2].[AccountNumber]

                   ) [PreviousCall] ON ([BaseCallData].[AccountNumber] = [PreviousCall].[AccountNumber])

              • 4. Re: Multiple left self joins from .csv and min date subqueries
                Amy Pielow

                Wow, thank you for all the effort and attention.

                 

                Could you please explain to me the logic behind this inner subquery?

                 

                LEFT OUTER JOIN (

                SELECT

                [BaseCallData].[AccountNumber] AS [AccountNumber],

                MAX([BaseCallData].[OpenedDate]) AS [LastCallDate]

                FROM [Tickets By Date Range Detailed for CWM_Jan19#csv] [BaseCallData]

                GROUP BY [BaseCallData].[AccountNumber]

                ) [LastCall] ON ([BaseCallData].[AccountNumber] = [LastCall].[AccountNumber] AND

                ([BaseCallData].[OpenedDate] = [LastCall].[LastCallDate]))

                WHERE [LastCall].[LastCallDate] IS NULL

                 

                What is the impact of the WHERE [LastCall].[LastCallDate] IS NULL?

                • 5. Re: Multiple left self joins from .csv and min date subqueries
                  Kaz Shakir

                  I'm glad to help - this stuff is fun for me to think through.

                   

                  That subquery is a little confusing.  The idea behind that one is to eliminate all the rows that are related to the last call.  So what you are left with is everything BUT the last call for each account.  Then when the SELECT clause takes the MAX of the OpenedDate, you are effectively getting the penultimate, or next to last, call.

                   

                  But in the portion of the query you pasted in your email, you left out the portion where the WHERE clause actually applies.  You need to look at this whole section:

                   

                       SELECT

                            [BaseCallData].[AccountNumber] AS [AccountNumber],

                            [BaseCallData].[ContactType1] AS [ContactType1],

                            [BaseCallData].[ContactType2] AS [ContactType2],

                            [BaseCallData].[ContactType3] AS [ContactType3],

                            [BaseCallData].[ContactType4] AS [ContactType4],

                            [BaseCallData].[OpenedDate] AS [OpenedDate],

                            [BaseCallData].[RecordNumber] AS [RecordNumber]

                       FROM [Jan19_calls#csv] [BaseCallData]

                       LEFT OUTER JOIN (

                            SELECT

                                 [BaseCallData].[AccountNumber] AS [AccountNumber],

                                 MAX([BaseCallData].[OpenedDate]) AS [LastCallDate]

                            FROM [Jan19_calls#csv] [BaseCallData]

                            GROUP BY [BaseCallData].[AccountNumber]

                            ) [LastCall] ON (([BaseCallData].[AccountNumber] = [LastCall].[AccountNumber} AND

                                                          ([BaseCallData].[OpenedDate] = [LastCall].[LastCallDate]))

                       WHERE [LastCall].[LastCallDate] IS NULL

                   

                  So, the inner most SELECT clause is pretty easy to understand: the result from that clause will be a "table" (that I call [LastCall]) that has two columns, AccountNumber and LastCallDate.  And because of the MAX (which is an aggregation) around OpenedDate, there can only be one result per AccountNumber; and the GROUP BY clause will make sure that every row of that subquery result contains a distinct AccountNumber.

                   

                  Now, with that inner SELECT result in mind, look at the outer SELECT.  It basically says: take your BaseCallData table, and join it with a table that has two columns: AccountNumber and LastCallDate.

                   

                       SELECT

                            [BaseCallData].[AccountNumber] AS [AccountNumber],

                            [BaseCallData].[ContactType1] AS [ContactType1],

                            [BaseCallData].[ContactType2] AS [ContactType2],

                            [BaseCallData].[ContactType3] AS [ContactType3],

                            [BaseCallData].[ContactType4] AS [ContactType4],

                            [BaseCallData].[OpenedDate] AS [OpenedDate],

                            [BaseCallData].[RecordNumber] AS [RecordNumber]

                       FROM [Jan19_calls#csv] [BaseCallData]

                       LEFT OUTER JOIN ( ... ) [LastCall] ON (([BaseCallData].[AccountNumber] = [LastCall].[AccountNumber} AND

                                                                                        ([BaseCallData].[OpenedDate] = [LastCall].[LastCallDate]))

                       WHERE [LastCall].[LastCallDate] IS NULL

                   

                  And the join is going look for rows where it can match both the AccountNumber up in both tables, and it's also going to try and match the LastCallDate to the OpenedDate.  But, since there is only one call for each account at that has that specific date/time (the date/time indicated in the field LastCallDate), it will only match one row in the BaseCallData table for each account.  Then, because we are using a LEFT OUTER JOIN, all of the rows from the BaseCallData table will still be there, but those where LastCallDate did not match with OpenedDate, will have a LastCallDate field that is NULL.  Finally, applying the WHERE clause to that result, means that the only rows we keep are the rows that are NOT the last call for each AccountNumber (the rows where LastCallDate did not match OpenedDate).

                   

                  That result of that sub-query is called [SubQuery2] in my example.  (I know, very creative, right?).  And it contains the same columns as the BaseCallData table, but has fewer rows.  The rows that were eliminated were the rows that contain the very last call for each account.  So, now when we execute the next part of the query (the ... in the SQL below is replaced by the sub-query we just discussed, shown above):

                   

                  SELECT

                       [SubQuery2].[AccountNumber] AS [AccountNumber],

                       MAX([SubQuery2].[OpenedDate]) AS [PreviousCallDate]

                  FROM ( ... ) [SubQuery2]

                  GROUP BY [SubQuery2].[AccountNumber]

                   

                  you can see that it's basically the same query we used to extract the last call for each account, but, now, since we eliminated the rows with the last call for each account, we are going to get a resulting table that has the next-to-last call for each account.

                   

                  Does that help make that clear?  Obviously, you can keep going, creating deeper and deeper sub-queries if you want to extract more calls, but the SQL will get really hard to read at that point.

                   

                  Kaz.

                   

                  p.s., I should mention that I noticed I have some typos in the SQL I posted in the last email, so please be careful with it.  For example, in one spot, at the end of [AccountNumber, I typed a "}" instead of a "])".