13 Replies Latest reply on Mar 1, 2016 3:37 PM by James Baker

    How to remove html tags from data

    Ashish Singhal

      I have data coming from Share point list. I am using MS Access as a medium to connect tableau to sharepoint.

      The data I am getting has html tags, which I want to remove.

       

      This is how my data looks:

      <div><font size="2" face="Calibri">[12/10/2014] Discussed continued UPMDB needs with fellow SCO Developer and OR Specialists.</font></div>

      <div> </div>

       

       

      is there a way to remove these html tags ?

        • 1. Re: How to remove html tags from data
          Joe Oppelt

          Maybe someone has a better suggestion than this...

           

          You can trigger off the open- and close-brackets.  It almost seems like you need to do this iteratively, which is not a feature of tableau calcs.

           

          Can you count on the html tags being only at the beginning and at the end of the string?  If so, you can grab from position 1 of the string and search for the first occurrence of the close bracket that is NOT preceded by another open bracket.  (Successive tags always have this character pattern:  ><  )   Then grab from that last close bracket until you get your next open bracket, and that will be your target string.

           

          I'm not exactly sure how to find that last close bracket from the initial pile of html tags, though.  It'll take some clever thinking to figure that out.

           

           

          And if you can't be sure that all tags are at the beginning and end of the string (and that they are all concatenated in the string without spaces between them) ...  maybe this isn't something to ask Tableau to do.  Maybe your ETL steps can massage the data instead.

          1 of 1 people found this helpful
          • 2. Re: How to remove html tags from data
            Ashish Singhal

            Thanks Joe for your answer. I agree it could be a temporary solution, but since I have about 100 columns in my data and idea of doing it for all those which I am using sounds time consuming and error prone.

            • 3. Re: How to remove html tags from data
              James Baker

              I think this needs to be done in the data, outside of Tableau.  You need some sort of data cleaning, ideally a tool that can truly parse HTML/DOM to extract what you really want.

               

              I don't know (there might be, I have no knowledge in this area) if there's a way to tell Sharepoint that you want text-only content?

              1 of 1 people found this helpful
              • 4. Re: How to remove html tags from data
                pooja.gandhi

                Maybe use the REGEXP function in Tableau. I am not sure if Access supports it though, you can create a data extract and then use the function:

                 

                REGEXP_REPLACE([String], '<.*?>', "")

                 

                Basically saying replace everything between the <> with a space:

                 

                3 of 3 people found this helpful
                • 5. Re: How to remove html tags from data
                  Joe Oppelt

                  Hey, I had an idea.

                   

                  This makes a huge assumption that all the tags do not have spaces between them, and that you have tags only at the beginning ant ad the end of the string.

                   

                  Use the REPLACE function to replace >< with spaces (or with anything, actually.)

                   

                  REPLACE([mystring],"><", "")

                   

                  That will essentially change all your leading tags into one giant tag, and all the trailing tags into one giant tag.

                   

                  So this:

                   

                   

                  <first tag><second tag>and here is the actual string</second tag></first tag>

                   

                  will change to this:<first tagsecond tag>and here is the actual string</second tag/first tag>

                   

                  Now you can use FIND() to find the first close-tag, and to find the first open-tag after position1.

                   

                  So it would look something like this:

                   

                   

                  MID(  REPLACE([mystring,"><",""),  FIND(REPLACE([mystring,"><",""),">") +1 , (FIND(REPLACE([mystring,"><",""), "<" , 2) -  FIND(REPLACE([mystring,"><",""),">") +1)  )

                   

                   

                   

                  MID() essentially grabs a substring from the target string (first argument), starting at position-x (second argument), for y-many characters (third argument).

                   

                  I keep repeating the REPLACE functions so that we keep using the same target.  (Chop out all occurrences of "><".)

                   

                  The second argument says to use the character after the first ">".

                   

                  The third argument is a length, so I'm subtracting the position of the first char we're grabbing from the position where the first open-bracket is found, but only start looking for that open-bracket after position 2 (so you don't find the very first one.)

                   

                  Again, all of this assumes that all the tags do not have spaces between them, and that there are no additional sets of tags embedded in the middle of the string.

                  2 of 2 people found this helpful
                  • 6. Re: How to remove html tags from data
                    Joe Oppelt

                    Pooja -- How do you get that to repeat?

                     

                    Attached is a hacked up workbook testing this out.  (I never used REGEXP before, so thanks for pointing to that!)

                     

                     

                    My calc does:

                     

                    REGEXP_REPLACE('<abc><def> 1234567 </abc></def>','<.*?>', "")

                     

                    The output is:

                     

                    <def> 1234567 </abc></def>

                     

                    I tested this against a hard-coded string and against a parameter.  You'll also see a commented-out line using another pattern that also results in the same thing.  (And I guess there is a whole world of pattern logic I can play with!)

                     

                    Anyway, your post seems to take care of all the iterations of tags, but mine is just grabbing the first one.

                     

                    Looking at your output, REGEXP would clean up all the tags, regardless of spacing, and regardless of mid-string tags, etc.

                    1 of 1 people found this helpful
                    • 7. Re: How to remove html tags from data
                      pooja.gandhi

                      Joe!

                       

                      There was a bug pre-9.1 release, so it wasn't doing a global replace. Looks like you used an older version and hence the output replaced only the first match. See the comment here:

                       

                      Regular Expression REGEXP_REPLACEALL() or REGEXP_FINDALL()

                       

                      Hope that helps!

                       

                      Pooja.

                      1 of 1 people found this helpful
                      • 8. Re: How to remove html tags from data
                        Joe Oppelt

                        OK, yes.  I did that on 9.0.


                        Well that's a cool function.  Gotta put that in my bag of tricks.

                         

                        It certainly takes care of Ashish's question!

                        1 of 1 people found this helpful
                        • 9. Re: How to remove html tags from data
                          Ashish Singhal

                          Thanks Pooja!

                           

                          This is almost a perfect solution, just that I have to use Extract instead of a live connection. Now I have find out a way so that the extract can be updated based on a schedule.

                          Ah and one more thing I had to use nested Regexp_Replace to take care of "&nbsp", just mentioned in hope that you might suggest a better way for that .

                          • 10. Re: How to remove html tags from data
                            Ashish Singhal

                            Thanks Joe! Although Pooja's solution takes care of my case but your solution would be useful when I will absolutely have to use live connection.

                            • 11. Re: How to remove html tags from data
                              Joe Oppelt

                              We both learned something today from Pooja.

                              • 12. Re: How to remove html tags from data
                                pooja.gandhi

                                Hi Ashish!

                                 

                                Yeah, if your database doesn't support REGEXP you are going to have to use a data extract. REGEXP can be used multiple ways, this was just a demo based on your sample data, as you already noticed, you will need to alter the expression to better suit your needs while using real data, nested replace is what I would do too to get rid of other special characters in the string.

                                 

                                Glad it helped!

                                 

                                Pooja.

                                1 of 1 people found this helpful
                                • 13. Re: How to remove html tags from data
                                  James Baker

                                  Re Regex: works great, until the first time your data has those carets in it. 

                                   

                                  Snarky quote: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

                                  http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/