Skip to main content
Solved

Regex editor and run difference


Forum|alt.badge.img

Hello all,

I've come up with a difference between Regex editor and workspace run result.

As you can see below I'm expecting to get 28 matches for the feature which has the value written in Test String box in its pdf_page_text attribute. However, the string searcher transformer returns none.

Does anyone know what causes this situation?

Best answer by mark2atsafe

So we made a fix for this issue, to better handle different types of line ending. I'm not sure entirely which version of FME it went into, but we tested it in builds 19756/20076 so I'm sure that 2019.2 and 2020 will include it. The fix means that the regex editor and runtime will produce the same results. The change is in the transformer action - rather than the regex editor - so what you saw in the editor will remain the same, but the transformer will now also pick up on it at runtime.

Hope this now works for you, and apologies for any inconvenience it caused.

View original
Did this help you find an answer to your question?

19 replies

redgeographics
Celebrity
Forum|alt.badge.img+47

What is the exact string in your PDF?


Forum|alt.badge.img
  • Author
  • February 1, 2019
redgeographics wrote:

What is the exact string in your PDF?

It's the same string in the attribute value. Copy and pasted it into the Text Sting in the editor.

 

CROTONE,ITA.

10-9 / BLK

13-1 / 13-2

13-3 / 13-4

13-5 / BLK

DURHAM TEES VALLEY,U.K.

10-9 / 10-9A

EAST MIDLANDS,U.K.

10-1P / 10-1P1

10-1R / BLK

# 10-2 / 10-2A

# 10-2B / 10-2C


redgeographics
Celebrity
Forum|alt.badge.img+47

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $


ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • February 1, 2019
redgeographics wrote:

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)


Forum|alt.badge.img
  • Author
  • February 1, 2019
redgeographics wrote:

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

So, then multi-line strings don't work on runtime. Is there a way to set flags or at least a document about which flags does FME works with? Like /m or /g


Forum|alt.badge.img
  • Author
  • February 1, 2019
ebygomm wrote:

Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)

I've tried to replace ^ and $ with /n but that didn't work either.


ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • February 1, 2019

If you are just aiming to match the place names, this will match each line that starts with a letter

^[A-Z].*?\n

Forum|alt.badge.img
  • Author
  • February 1, 2019
ebygomm wrote:

If you are just aiming to match the place names, this will match each line that starts with a letter

^[A-Z].*?\n

This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?


redgeographics
Celebrity
Forum|alt.badge.img+47
cdural wrote:

This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?

I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.


Forum|alt.badge.img
  • Author
  • February 1, 2019
redgeographics wrote:

I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.

But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20


ebygomm
Influencer
Forum|alt.badge.img+31
  • Influencer
  • February 1, 2019
cdural wrote:

But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20

Can you split your string with an attribute splitter at the newline and then use your existing regex


redgeographics
Celebrity
Forum|alt.badge.img+47

This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

 


Forum|alt.badge.img
  • Author
  • February 1, 2019
redgeographics wrote:

This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

 

That worked like a charm. Thank you.

 

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.


mark2atsafe
Safer
Forum|alt.badge.img+43
  • Safer
  • February 1, 2019
cdural wrote:

That worked like a charm. Thank you.

 

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.

Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

I should also mention that I tried your original regex in a regex tester and it said it was "catastrophic"! Really. Apparently it means that it will work fine where it does find the string you are looking for, but when it doesn't it will start to iterate round and round in an almost never-ending circle. This blog explains why (although I must say I don't understand the half of it!): https://www.regular-expressions.info/catastrophic.html


Forum|alt.badge.img
  • Author
  • February 4, 2019
mark2atsafe wrote:

Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

I should also mention that I tried your original regex in a regex tester and it said it was "catastrophic"! Really. Apparently it means that it will work fine where it does find the string you are looking for, but when it doesn't it will start to iterate round and round in an almost never-ending circle. This blog explains why (although I must say I don't understand the half of it!): https://www.regular-expressions.info/catastrophic.html

Hello @Mark2AtSafe,

Thank you for the answer. As you guessed the problem was the carriage returns. The text contained both carriage returns and line feed at the end of the lines. So I removed carriage returns with a string replacer before the string searcher and my original regex worked.

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

 

Last, for my original regex, I'm not a guru and it could be improved. However, I know that It does not cause any catastrophic situations since I don't use any quantifiers like + or *. It will stop matching if the number of the characters I'm looking for exceeds 20


mark2atsafe
Safer
Forum|alt.badge.img+43
  • Safer
  • February 4, 2019
cdural wrote:

Hello @Mark2AtSafe,

Thank you for the answer. As you guessed the problem was the carriage returns. The text contained both carriage returns and line feed at the end of the lines. So I removed carriage returns with a string replacer before the string searcher and my original regex worked.

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

 

Last, for my original regex, I'm not a guru and it could be improved. However, I know that It does not cause any catastrophic situations since I don't use any quantifiers like + or *. It will stop matching if the number of the characters I'm looking for exceeds 20

Yes, there's definitely an issue in there with the regex editor and runtime. I've passed it on to the developers and we'll see what they say.


mark2atsafe
Safer
Forum|alt.badge.img+43
  • Safer
  • Best Answer
  • September 5, 2019

So we made a fix for this issue, to better handle different types of line ending. I'm not sure entirely which version of FME it went into, but we tested it in builds 19756/20076 so I'm sure that 2019.2 and 2020 will include it. The fix means that the regex editor and runtime will produce the same results. The change is in the transformer action - rather than the regex editor - so what you saw in the editor will remain the same, but the transformer will now also pick up on it at runtime.

Hope this now works for you, and apologies for any inconvenience it caused.


gio
Contributor
Forum|alt.badge.img+15
  • Contributor
  • September 6, 2019

([A-Z]|[ ])+,([A-Z]+|[.])+

A character class containing all characters and one containing single space-character.

More elegant

([A-Z ]+,[A-Z.]+)

To not hide the space:

^([A-Z\\s]+,[A-Z.]+)$

 

And in ^[A-Z].*?\\n the ? is superfluous. : ^[A-Z].*\\n

 

This is not trivial, as it means more or less work for the parsing engine.

 

Also, one of the best references around:

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

 

 


gio
Contributor
Forum|alt.badge.img+15
  • Contributor
  • September 9, 2019

yup that is why they have switches for that conveniency.

 

Causes .. to match newlines . as end-of-lines etc

 

But there is more:

^
Start of line
$
End of line
\A
Start of string
\z
End of string

 

Stil advising reding up on the matter. See posted link.


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings