Is there a transformer for stripping HTML from text? I see the HTMLStripper, but it appears to replace HTML with XML. I want all HTML tags removed, including the <!DOCTYPE html>.
Thanks
Is there a transformer for stripping HTML from text? I see the HTMLStripper, but it appears to replace HTML with XML. I want all HTML tags removed, including the <!DOCTYPE html>.
Thanks
Could you use the StringReplacer transformer using regular expressions:
Find "<.*>" and replace with nothing.
This will replace all <anything inside> occurrences in the text.
Could you use the StringReplacer transformer using regular expressions:
Find "<.*>" and replace with nothing.
This will replace all <anything inside> occurrences in the text.
Using the StringReplacer as @erik_jan says, but with <[^>]*> instead should remove all opening and closing tags.
This expression is more selective: "<la-z|0-9| !]*>"
I was able to strip the HTML tags with this little gem:
<[^>]*>
However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.
Thanks
This expression is more selective: "<;a-z|0-9| !]*>"
I guess my regex knowledge needs some extending.
The bonus is that it uses html tags as it's example.
I was able to strip the HTML tags with this little gem:
<[^>]*>
However, I couldn't remove the lines with only spaces or no content. I used another StringReplacer to remove the spaces, then a tester to see if the line was an "empty string". That worked.
Thanks