Monday, July 23, 2012

HOW TO REMOVE HTML FROM A STRING (T-SQL)

I've been working on SSRS report using a source database which had very interesting value within the field. It looked like HTML on the web site:





It's really annoying for business users to see something like this on SSRS report:

SELECT TheName = 'Put your <span style="color: #ff0000; font-weight: bold; text-decoration: underline">name </span>in the box:'



This was a "name" column just below the ID in database and there was not other columns with no HTML. I am not so sure if someone took a wrong approach creating this field using HTML or that was an intension of a developer making his life easier on the web interface, but I have decided to find a quick solution to get rid of it and to deliver the report as soon as I can.

CREATE FUNCTION fn_RemoveHTMLFromText (@inputString nvarchar(max))
RETURNS nvarchar(MAX)

AS
BEGIN


  /*Variables to store source fielde temporarily and to remove tags one by one*/
  DECLARE @replaceHTML nvarchar(2000), @counter int, @outputString nvarchar(max)
  set @counter = 0
  SET @outputString = @inputString

  /*This was extra case which I've added later to remove no-break space*/
  SET @outputString = REPLACE(@outputString, '&nbsp;', '')


  /*This loop searches for tags beginning with "<" and ending with ">" */
  WHILE (CHARINDEX('<', @outputString,1)>0 AND CHARINDEX('>', @outputString,1)>0)
  BEGIN
    SET @counter = @counter + 1

    /*
    Some math here... looking for tags and taking substring storing result into temporarily variable, for example "</span>"
   */

   SELECT @replaceHTML = SUBSTRING(@outputString, CHARINDEX('<', @outputString,1), CHARINDEX('>',   @outputString,1)-CHARINDEX('<', @outputString,1)+1)

   /* Replace the tag that we stored in previous step */
   SET @outputString = REPLACE(@outputString, @replaceHTML, '')

   /* Let's clear our variable just in case... */
   SET @replaceHTML = ''

   /* Let's set up maximum number of tags just for fun breaking the loop after 15 tags */
  if @counter >15
      RETURN(@outputString);

  END

  RETURN(@outputString);

END
 


Let's use the function:


SELECT TheName = dbo.fn_RemoveHTMLFromText ('Put your <span style="color: #ff0000; font-weight: bold; text-decoration: underline">name </span>in the box:')

And finally, this is what I want to see:





Of course, this is just a simple solution which was applied on the field with not so many HTML tags, but can be easily transformed into more serious HTML cleaner.


Mike C.






4 comments:

  1. very good info..

    ReplyDelete
  2. Hi Mike, great code.

    I'm trying to use your function but we're experiencing problems with greater than or less than being interpreted. For Example if we have SPAN tag 44 SPAN tag, the function is truncating the string presumably because it's seeing >44 as legitimate text & therefore we are left with an open tag. Any ideas on ignoring this scenario? thanks,
    Tom, UK

    ReplyDelete
    Replies
    1. Hi Tom,

      Can you share an example of your string with me and I will take a look into it. Basically, if you add some code to exclude certain html tags like spam that might solve this issue. This will remove all html tags except those that you still want to keep though.

      I would love to see your html string and check what is it about.

      Mike

      Delete
    2. did you find a solution to this one?

      Delete