Skip to main content

Is there a transformer for stripping HTML from text? I see the HTMLStripper, but it appears to replace HTML with XML. I want all HTML tags removed, including the <!DOCTYPE html>.

Thanks

Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.


Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.

That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

 

 


Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.


That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

 

 

Oops, you are right.

 

This expression is more selective: "<la-z|0-9| !]*>"

I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks


Oops, you are right.

 

This expression is more selective: "<;a-z|0-9| !]*>"
Except it doesn't match any tags with attributes, or closing tags for that matter

 

 


Except it doesn't match any tags with attributes, or closing tags for that matter

 

 

Okay, it is not complete. And I do like your solution: <t^>]*>

 

I guess my regex knowledge needs some extending.

 


Except it doesn't match any tags with attributes, or closing tags for that matter

 

 

There is a good explanation of lazy vs greedy at http://www.regular-expressions.info/repeat.html

 

 

The bonus is that it uses html tags as it's example.

 

 


I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks

You may be able to use the XMLFormatter prior to stripping the html tags to remove extraneous whitespace and empty elements. Make sure to change Whitespace Handling to Remove excess whitespace.

Reply