6 Replies Latest reply on Jun 22, 2016 2:15 PM by Tom W

    Extract source code?

    Vincent Baumel

      Is there any API functionality that would allow me to go to a specified URL (or set of URLs) and extract a particular line of the source code?

        • 1. Re: Extract source code?
          Jeff Strauss

          I don't think so, but can you clarify what you're trying to do?

          • 2. Re: Extract source code?
            Tom W

            It sounds like you want to scrape data from a website?

            • 3. Re: Extract source code?
              Vincent Baumel

              The organization NASPO has a website that includes a map, and when you click on a state it brings you to a page with a little biography of the NASPO contact for that state. The profile pages all have very similar URLs (the last two characters are the state abbreviation), and when I view the source code I see that the contact name, title, phone number and fax number are all contained on line 588. I've got sales data for each state in the US, and currently I've written and blended a small spreadsheet with all of the NASPO contact data to display, depending on which state the user selects. Once in a while a new person takes over a position, and rather than habitually going through each state, I'd like a procedure/script that does the following:

              • (user selects a state abbreviation from a list of parameters)
              • go to previously mentioned URL, substituting the last 2 characters for the parameter (similar to a URL action)
              • Locate line 588 in the page source code
              • parse out the relevant string fields and assign them to variables so that my viz can display the correlated contact data

               

              I'm not much of a developer at all, but I'm really excited to see and learn some of what's capable. Think this type of data scraping could be done?

              • 4. Re: Extract source code?
                Tom W

                Is it possible? Absolutely yes.

                Can Tableau do this in it's entirely? Definitively not.

                 

                The easiest thing within Tableau would be to add a field in your dataset which includes the address of page containing the information. I.e. http://www.naspo.org/States/Profiles/state/TX. You could then hyperlink to this from within Tableau or even display it with a Dashboard action within a Webpage Part.

                 

                Anything more complex you would need to run a script / process outside of Tableau which creates the dataset (or adds to the existing dataset) so Tableau can read it straight up. With regards to how to do that external process, this isn't really the right forum for it, especially given that you aren't a programmer. I will tell you this much though - it looks highly scrapable to me so for someone with scripting experience, it wouldn't be hard to do. You can either jump down the rabbit hole and start learning a new scripting language or I'd suggest you engage someone with the skills to do this like hiring someone on freelancer.com.

                3 of 3 people found this helpful
                • 5. Re: Extract source code?
                  Vincent Baumel

                  Thanks, Tom. I figured it wouldn't be tough for someone with the skills to do it. If I could conceptualize how it would work, it's just a matter of knowing the syntax to do it. I'll look into what options could automate this for me. I appreciate the insight!

                  • 6. Re: Extract source code?
                    Tom W

                    To do it yourself, you would need an understanding of HTML, perhaps some Javascript then some type of scripting language to actually pull it all together.

                    As an example, you've referenced a line number from the HTML but that's not really a good basis as the developer might add lines to their code, things change frequently. You need to anchor onto pieces of information within the document, that's easiest with css/class lookups.

                     

                    High level you need to run a script which;

                    • Extracts a list of states from the main page
                      • Get the URL for each state
                      • For each URL, lookup the page
                        • Within the HTML returned for the page look for an item to anchor off to isolate the contact pane
                        • Find the cleanest bit of the code you can possibly isolate, reducing the amount of noise.
                        • Further cleanup that chunk of text to be refined to exactly what you need
                        • Output that chunk or store it in a file

                     

                    Anything more specific than that is far too detailed to cover here. However, you could start your quest with some highlevel intro to HTML and scraping, pick a scripting language and start googling things like "How do I extract a list from a dropdown on a webpage using Python?".

                     

                     

                    Alternatives worth trying;

                    https://www.import.io/ - I've used this before, it worked OK. I was frustrated by the lack of control I had though so I went back to my normal scripting ways. Some knowledge of HTML would be helpful.

                    The #1 Browser Automation, Data Extraction, and Web Testing Tool | iMacros Software - I've used this addon for firefox before and it worked very well. Again, scripting knowledge required but it might help as a better introduction as some of the concepts I outlined above are specifically built into the tool. You can also record and alter macros like you can with Microsoft Excel / Word etc.

                    2 of 2 people found this helpful