5 Replies Latest reply on Apr 11, 2014 11:57 AM by malcolm stanley

    Using R and Data Science Tool Kit to map IP addresses

    malcolm stanley

      My objective is to map physical locations using IP addresses as source data in Tableau.

       

      I have a log of several thousand entries of user sessions in a MySQL database.

      Each row in the database has a dotted quad IP Address in a field named [clientIP] .

      a data sample looks like this:

       

      clientIP

      208.54.39.177

      208.54.39.169

      208.54.4.141

      208.54.4.135

      ...

       

      In Tableau [clientIP] field is recognized as dimensional data.

       

      I have installed an R server and can connect to it successfully.

       

      On the R server I have installed the RDSTK package, which allows use of the Data Science Toolkit found at Data Science Toolkit.

      The Data Science ToolKit contains an IP address to Location function named ip2coordinates().

      in R, a sample output of this function is

       

      > ip2coordinates("8.8.8.8")

        ip.address dma_code latitude country_code3 area_code longitude  country_name

      1    8.8.8.8      807  37.4192           USA       650 -122.0574 United States

        postal_code region      locality country_code

      1       94043     CA Mountain View           US

       

      to create a single return value containing only  the latitude and longitude, you can enter the following:

      > paste(ip2coordinates("8.8.8.8")[c(3,6)], collapse= ',')

      [1] "37.4192008972168,-122.057403564453"

       

      Note that if I enter the IP address incorrectly, such as by using commas instead of periods in the dotted quad IP address, I get the following error:

      > paste(ip2coordinates("8,8,8,8")[c(3,6)], collapse= ',')

      Error in `[.data.frame`(ip2coordinates("8,8,8,8"), c(3, 6)) :

        undefined columns selected

       

      In my Tableau workbook, I have created a calculated field named calculatedCustomerLocation and in it I have entered the following:

      SCRIPT_STR( "require(RDSTK); paste(ip2coordinates(.arg1)[c(3,6)], collapse= ',')", [clientIP])

       

      Tableau refuses to accept this, giving an aggregation error. If I attempt to type the [clientIP] value as a string using STR([clientIP]), I get a similar aggregation error. If I attempt to type the [clientIP] value using ATTR([clientIP]), I get the following error:

       

      Error in `[.data.frame`(ip2coordinates(.arg1), c(3, 6)) :

        undefined columns selected

       

      I assume that I am failing to get a string value for [clientIP] into .arg1, or failing to pass that in an acceptable format to the R server.

       

      my questions: answers to either will be appreciated:

      a) how do I debug this? there appears to be no way to break this up into intermediate steps so I can see what Tableau is passing to the R server. There is no Tableau log I am aware of that will show me what the query looks like. Similiarly, the R server has no apparent method for logging the content of a remote call. Finally, the DSTK server is remote and no logs are available to look at. so I am at a loss to understand what value for the .arg1 field is being passed, in what format it is being passed, or how I am supposed to determine that. Any thoughts?

       

      b) how do I fix this? I cannot aggregate a string value; I am attempting to pass a text string to the R server and have it pass a text string back. it seems to me the SCRIPT_STR type is pretty useless if you cannot do this sort of thing, so my assumption is that I have some basic syntax incorrect, but cannot through experimentation discover what that might be. Can you help me understand what I am doing wrong?

       

      Thanks for reading...

        • 1. Re: Using R and Data Science Tool Kit to map IP addresses
          malcolm stanley

          UPDATE:

           

          I limited the query to a single IP address which I knew to be locatable by the DSTK.

          The query returned successfully.

           

          A) This implies that the error is being generated by IP addresses which are not found by the DSTK database.

          In order to correct this, I need some sort of error handling in the R query.
          Can anyone help me understand how to put a null value in the query response if it comes up empty?

           

          B) the location data returned successfully by the query does not register as a location in Tableau.

          How do I type this data and / or how should I format it so Tableau will recognize it as a location?

          • 2. Re: Using R and Data Science Tool Kit to map IP addresses
            malcolm stanley

            To answer B), above, break the query into two parts:

             

            calculatedCustomerLatitude =

            FLOAT(SCRIPT_STR( "require(RDSTK); paste(ip2coordinates(.arg1)[c(3)], collapse= ',')", ATTR([clientIP])))

             

            calculatedCustomerLongitude =

            FLOAT(SCRIPT_STR( "require(RDSTK); paste(ip2coordinates(.arg1)[c(6)], collapse= ',')", ATTR([clientIP])))

             

            Note that I am typing each as a FLOAT so that Tableau can then allow the assignment of a geographic role.

            Once this is in place, mapping can occur.

            however, many IP addresses still fail this query, so we need to stuff nulls or 0's in so we can avoid getting errored out.

            • 3. Re: Using R and Data Science Tool Kit to map IP addresses
              malcolm stanley

              so this is interesting:

               

              I added error handling like this:

               

              FLOAT(SCRIPT_STR( "require('RDSTK'); errorValue <- '0'; tryCatch(paste(ip2coordinates(.arg1)[c(3)]),error = function(e) errorValue);", ATTR([clientIP])))

               

              The outcome is pretty interesting: the error- handling works against a single value.

              when I run the sheet against an IP address that returns a valid value set, I get that.

              BUT I ONLY GET THE FIRST VALUE.

              all subsequent queries appear to return the returned value of the first query

               

              I tried adding a "rm('errorValue') to the query: that messed up the return value.

               

              so now I am very confused about how Tableau and the R server handling this query, how they store the variables, and how they release them.

              • 4. Re: Using R and Data Science Tool Kit to map IP addresses
                malcolm stanley

                The completed queries are:

                 

                calculatedLatitude:

                FLOAT(SCRIPT_STR( "require('RDSTK'); errorValue <- '0'; tryCatch(paste(ip2coordinates(.arg1)[c(3)]),error = function(e) errorValue);", ATTR([clientIP])))

                 

                and

                 

                calculatedLongitude:

                FLOAT(SCRIPT_STR( "require('RDSTK'); errorValue <- '0'; tryCatch(paste(ip2coordinates(.arg1)[c(6)]),error = function(e) errorValue);”, ATTR([clientIP])))

                • 5. Re: Using R and Data Science Tool Kit to map IP addresses
                  malcolm stanley

                  Yeah, still does not work entirely correctly;

                   

                  what I ended up doing was writing a native R script that would query the DSTK directly FROM R and not from Tableau, then export the data into a .csv that can be imported back into Tableau.

                   

                  I recently graduated from that to a PHP script that pulls the IP addresses directly from the database and then pushes the results back into a separate table on the database server.

                   

                  essentially the R integration is not useful here as I cannot make the calculated field value work properly for each row.

                  If you can make it work, it *would* be useful, but I found it easier to just go around the problem in the end...