Skip to main content
Solved

Strip HTML From Text


Is there a transformer for stripping HTML from text? I see the HTMLStripper, but it appears to replace HTML with XML. I want all HTML tags removed, including the <!DOCTYPE html>.

Thanks

Best answer by jdh

Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.

View original
Did this help you find an answer to your question?

erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • April 3, 2017

Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • April 3, 2017
erik_jan wrote:

Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.

That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

 

 


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • April 3, 2017

Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • April 3, 2017
jdh wrote:
That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

 

 

Oops, you are right.

 

This expression is more selective: "<[a-z|0-9| !]*>"

  • April 3, 2017

I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • April 3, 2017
erik_jan wrote:
Oops, you are right.

 

This expression is more selective: "<[a-z|0-9| !]*>"
Except it doesn't match any tags with attributes, or closing tags for that matter

 

 


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • April 3, 2017
jdh wrote:
Except it doesn't match any tags with attributes, or closing tags for that matter

 

 

Okay, it is not complete. And I do like your solution: <[^>]*>

 

I guess my regex knowledge needs some extending.

 


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • April 3, 2017
jdh wrote:
Except it doesn't match any tags with attributes, or closing tags for that matter

 

 

There is a good explanation of lazy vs greedy at http://www.regular-expressions.info/repeat.html

 

 

The bonus is that it uses html tags as it's example.

 

 


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor
  • April 3, 2017
jimo wrote:

I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks

You may be able to use the XMLFormatter prior to stripping the html tags to remove extraneous whitespace and empty elements. Make sure to change Whitespace Handling to Remove excess whitespace.

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings