Skip to main content

Hello all,

I've come up with a difference between Regex editor and workspace run result.

As you can see below I'm expecting to get 28 matches for the feature which has the value written in Test String box in its pdf_page_text attribute. However, the string searcher transformer returns none.

Does anyone know what causes this situation?

What is the exact string in your PDF?


What is the exact string in your PDF?

It's the same string in the attribute value. Copy and pasted it into the Text Sting in the editor.

 

CROTONE,ITA.

10-9 / BLK

13-1 / 13-2

13-3 / 13-4

13-5 / BLK

DURHAM TEES VALLEY,U.K.

10-9 / 10-9A

EAST MIDLANDS,U.K.

10-1P / 10-1P1

10-1R / BLK

# 10-2 / 10-2A

# 10-2B / 10-2C


I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $


I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)


I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

So, then multi-line strings don't work on runtime. Is there a way to set flags or at least a document about which flags does FME works with? Like /m or /g


Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)

I've tried to replace ^ and $ with /n but that didn't work either.


If you are just aiming to match the place names, this will match each line that starts with a letter

^tA-Z].*?\n

If you are just aiming to match the place names, this will match each line that starts with a letter

^tA-Z].*?\n

This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?


This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?

I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.


I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.

But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20


But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20

Can you split your string with an attribute splitter at the newline and then use your existing regex


This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

 


This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

 

That worked like a charm. Thank you.

 

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.


That worked like a charm. Thank you.

 

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.

Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

I should also mention that I tried your original regex in a regex tester and it said it was "catastrophic"! Really. Apparently it means that it will work fine where it does find the string you are looking for, but when it doesn't it will start to iterate round and round in an almost never-ending circle. This blog explains why (although I must say I don't understand the half of it!): https://www.regular-expressions.info/catastrophic.html


Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

I should also mention that I tried your original regex in a regex tester and it said it was "catastrophic"! Really. Apparently it means that it will work fine where it does find the string you are looking for, but when it doesn't it will start to iterate round and round in an almost never-ending circle. This blog explains why (although I must say I don't understand the half of it!): https://www.regular-expressions.info/catastrophic.html

Hello @Mark2AtSafe,

Thank you for the answer. As you guessed the problem was the carriage returns. The text contained both carriage returns and line feed at the end of the lines. So I removed carriage returns with a string replacer before the string searcher and my original regex worked.

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

 

Last, for my original regex, I'm not a guru and it could be improved. However, I know that It does not cause any catastrophic situations since I don't use any quantifiers like + or *. It will stop matching if the number of the characters I'm looking for exceeds 20


Hello @Mark2AtSafe,

Thank you for the answer. As you guessed the problem was the carriage returns. The text contained both carriage returns and line feed at the end of the lines. So I removed carriage returns with a string replacer before the string searcher and my original regex worked.

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

 

Last, for my original regex, I'm not a guru and it could be improved. However, I know that It does not cause any catastrophic situations since I don't use any quantifiers like + or *. It will stop matching if the number of the characters I'm looking for exceeds 20

Yes, there's definitely an issue in there with the regex editor and runtime. I've passed it on to the developers and we'll see what they say.


So we made a fix for this issue, to better handle different types of line ending. I'm not sure entirely which version of FME it went into, but we tested it in builds 19756/20076 so I'm sure that 2019.2 and 2020 will include it. The fix means that the regex editor and runtime will produce the same results. The change is in the transformer action - rather than the regex editor - so what you saw in the editor will remain the same, but the transformer will now also pick up on it at runtime.

Hope this now works for you, and apologies for any inconvenience it caused.


([A-Z]|[ ])+,([A-Z]+|[.])+

A character class containing all characters and one containing single space-character.

More elegant

([A-Z ]+,[A-Z.]+)

To not hide the space:

^([A-Z\\s]+,[A-Z.]+)$

 

And in ^^A-Z].*?\\n the ? is superfluous. : ^^A-Z].*\\n

 

This is not trivial, as it means more or less work for the parsing engine.

 

Also, one of the best references around:

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

 

 


yup that is why they have switches for that conveniency.

 

Causes .. to match newlines . as end-of-lines etc

 

But there is more:

^
Start of line
$
End of line
\A
Start of string
\z
End of string

 

Stil advising reding up on the matter. See posted link.


Reply