Using SQL functions, convert HTML to PlainText

 – Using table-valued functions in SQL convert HTML to plaintext

Execute this code in  SQL

CREATE FUNCTION [dbo].[Html_To_Plaintext_ITVF] (@HTMLText nvarchar(max), @linefeed nvarchar(10))
RETURNS TABLE
AS
  RETURN 
  WITH cteTagsToReplaceWithLF AS (
    SELECT a.tag
      , ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS R 
    FROM (
      VALUES ('</p>'), ('</tr>'), ('<br>'), ('<br />'), ('</li>'),('&nbsp;')
    ) a(tag)
  )
 
  , cteWithLineFeeds(n, Html) AS (
    SELECT 1, REPLACE(@HTMLText, cteTags.tag, @linefeed)
    FROM cteTagsToReplaceWithLF cteTags
    WHERE cteTags.R = 1
 
    UNION ALL
 
    SELECT n + 1, REPLACE(ctelf.Html, cteTags.tag, @linefeed)
    FROM cteWithLineFeeds ctelf
      JOIN cteTagsToReplaceWithLF cteTags ON ctelf.n = cteTags.R
  )
 
  , cteHtml (i, HtmlText) AS (
    SELECT TOP 1 0, Html
    FROM cteWithLineFeeds
    ORDER BY n DESC
  
    UNION ALL
 
    SELECT i + 1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N'<', HtmlText), CHARINDEX(N'>', HtmlText, CHARINDEX(N'<', HtmlText)) - CHARINDEX(N'<', HtmlText) + 1, ''))
    FROM cteHtml
    WHERE CHARINDEX('<', HtmlText) > 0
      AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) > 0
      AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) - CHARINDEX('<', HtmlText) > 0
  )
 
  SELECT TOP 1 LTRIM(RTRIM(HtmlText)) AS PlainText
  FROM cteHtml
  ORDER BY i DESC
Go

Let’s take a look at what’s going on. The first thing we see is a CTE with the name cteTagsToReplaceWithLF.

SELECT a.Tagtoreplace, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS Row 
FROM (
    VALUES ('</p>'), ('</tr>'), ('<br>'), ('<br />'), ('</li>'),('&nbsp;')
) a(Tagtoreplace)

You can just execute this on any database to get a two-column result. The first column has a sequential number, and the second column has one of the tags that we defined. These are the tags that will be replaced by linefeeds.

Table Result

The second CTE on the list replaces our HTML tags with linefeed characters.

Now Execute this function:

select * from Html_To_Plaintext_ITVF('<p><strong>Fruits</strong></p><ul><li><strong>Apple</strong></li><li><strong>Mango</strong></li><li><strong>Cherry</strong></li> </ul>  <p>&nbsp;</p>  <p>&nbsp;</p>  <ol>  <li><strong>Veggies</strong></li>  <li><strong>Potatoes</strong></li> </ol>','')

Result :

– Using a User-Defined Function, Convert HTML to PlainText

The following User Defined Function accepts HTML input and returns only TEXT. Before passing as input to function, all single quotes in HTML should be replaced with two single quotes (not double quotes).

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Execute the above function as follows:

SELECT dbo.udf_StripHTML('<b>The CodeHubs </b> <a href="http://thecodehubs.com">Thecodehubs.com</a>') as [PlainText]

udf_html_to_plaintext

Submit a Comment

Your email address will not be published. Required fields are marked *

Subscribe

Select Categories