Solved

Strip HTML From Text

8 years ago
April 3, 2017
9 replies
274 views

Anonymous

Is there a transformer for stripping HTML from text? I see the HTMLStripper, but it appears to replace HTML with XML. I want all HTML tags removed, including the <!DOCTYPE html>.

Thanks

Best answer by jdh

Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.

View original

Did this help you find an answer to your question?

+18

erik_jan
Contributor
2181 replies
8 years ago
April 3, 2017

Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.

+28

jdh
Contributor
1982 replies
8 years ago
April 3, 2017

erik_jan wrote:

Could you use the StringReplacer transformer using regular expressions:

Find "<.*>" and replace with nothing.

This will replace all <anything inside> occurrences in the text.

That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

+28

jdh
Contributor
1982 replies
Best Answer
8 years ago
April 3, 2017

Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.

+18

erik_jan
Contributor
2181 replies
8 years ago
April 3, 2017

jdh wrote:

That's a greedy expression and will replace everything from the first opening tag to the last closing tag.

Oops, you are right.

This expression is more selective: "<[a-z|0-9| !]*>"

Anonymous
0 replies
8 years ago
April 3, 2017

I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks

+28

jdh
Contributor
1982 replies
8 years ago
April 3, 2017

erik_jan wrote:

Oops, you are right.

This expression is more selective: "<[a-z|0-9| !]*>"

Except it doesn't match any tags with attributes, or closing tags for that matter

+18

erik_jan
Contributor
2181 replies
8 years ago
April 3, 2017

jdh wrote:

Except it doesn't match any tags with attributes, or closing tags for that matter

Okay, it is not complete. And I do like your solution: <[^>]*>

I guess my regex knowledge needs some extending.

+28

jdh
Contributor
1982 replies
8 years ago
April 3, 2017

jdh wrote:

Except it doesn't match any tags with attributes, or closing tags for that matter

There is a good explanation of lazy vs greedy at http://www.regular-expressions.info/repeat.html

The bonus is that it uses html tags as it's example.

+28

jdh
Contributor
1982 replies
8 years ago
April 3, 2017

jimo wrote:

I was able to strip the HTML tags with this little gem:

<[^>]*>

However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.

Thanks

You may be able to use the XMLFormatter prior to stripping the html tags to remove extraneous whitespace and empty elements. Make sure to change Whitespace Handling to Remove excess whitespace.

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Strip HTML From Text