Solved

Regex editor and run difference

6 years ago
February 1, 2019
19 replies
36 views

cdural
9 replies

Hello all,

I've come up with a difference between Regex editor and workspace run result.

As you can see below I'm expecting to get 28 matches for the feature which has the value written in Test String box in its pdf_page_text attribute. However, the string searcher transformer returns none.

Does anyone know what causes this situation?

Best answer by mark2atsafe

So we made a fix for this issue, to better handle different types of line ending. I'm not sure entirely which version of FME it went into, but we tested it in builds 19756/20076 so I'm sure that 2019.2 and 2020 will include it. The fix means that the regex editor and runtime will produce the same results. The change is in the transformer action - rather than the regex editor - so what you saw in the editor will remain the same, but the transformer will now also pick up on it at runtime.

Hope this now works for you, and apologies for any inconvenience it caused.

View original

Did this help you find an answer to your question?

+50

redgeographics
Celebrity
3642 replies
6 years ago
February 1, 2019

What is the exact string in your PDF?

cdural
Author
9 replies
6 years ago
February 1, 2019

redgeographics wrote:

What is the exact string in your PDF?

It's the same string in the attribute value. Copy and pasted it into the Text Sting in the editor.

CROTONE,ITA.

10-9 / BLK

13-1 / 13-2

13-3 / 13-4

13-5 / BLK

DURHAM TEES VALLEY,U.K.

10-9 / 10-9A

EAST MIDLANDS,U.K.

10-1P / 10-1P1

10-1R / BLK

# 10-2 / 10-2A

# 10-2B / 10-2C

+50

redgeographics
Celebrity
3642 replies
6 years ago
February 1, 2019

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

+39

ebygomm
Influencer
3308 replies
6 years ago
February 1, 2019

redgeographics wrote:

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)

cdural
Author
9 replies
6 years ago
February 1, 2019

redgeographics wrote:

I think this is happening:

The Test String in the Regex editor considers each line separately. However, if your entire attribute value is that (multi-line) string it won't match that regex because of the ^ and $

So, then multi-line strings don't work on runtime. Is there a way to set flags or at least a document about which flags does FME works with? Like /m or /g

cdural
Author
9 replies
6 years ago
February 1, 2019

ebygomm wrote:

Yes, if you have a multiple lines in an attribute you need to look for the line breaks (\\n)

I've tried to replace ^ and $ with /n but that didn't work either.

+39

ebygomm
Influencer
3308 replies
6 years ago
February 1, 2019

If you are just aiming to match the place names, this will match each line that starts with a letter

^[A-Z].*?\n

cdural
Author
9 replies
6 years ago
February 1, 2019

ebygomm wrote:

If you are just aiming to match the place names, this will match each line that starts with a letter

^[A-Z].*?\n

This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?

+50

redgeographics
Celebrity
3642 replies
6 years ago
February 1, 2019

cdural wrote:

This text is just a part of the complete and it contains other lines starting with a letter. So I wouldn't like to change the search parameters.

Besides, I'd like to know if this is a bug?

I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.

cdural
Author
9 replies
6 years ago
February 1, 2019

redgeographics wrote:

I don't think it's a bug per se, but definitely something that doesn't work as expected and I wonder if the devs even considered this. @Mark2AtSafe could you take a look at this?

In the meantime, I wonder if you could remove the ^ and $ from your regex to get the correct result.

But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20

+39

ebygomm
Influencer
3308 replies
6 years ago
February 1, 2019

cdural wrote:

But I don't know at which index the string I'm searching is going to end or begin. So if I do that it would just bring the letters and characters up to 20

Can you split your string with an attribute splitter at the newline and then use your existing regex

+50

redgeographics
Celebrity
3642 replies
6 years ago
February 1, 2019

This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

cdural
Author
9 replies
6 years ago
February 1, 2019

redgeographics wrote:

This regex seems to work with your text:

([A-Z]|[ ])+,([A-Z]+|[.])+

That worked like a charm. Thank you.

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.

+45

mark2atsafe
Safer
2517 replies
6 years ago
February 1, 2019

cdural wrote:

That worked like a charm. Thank you.

However, I'm still curious about this issue. @Mark2AtSafe could you kindly explain why this happens and what shall I do to be sure about getting the correct results in future.

Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

I should also mention that I tried your original regex in a regex tester and it said it was "catastrophic"! Really. Apparently it means that it will work fine where it does find the string you are looking for, but when it doesn't it will start to iterate round and round in an almost never-ending circle. This blog explains why (although I must say I don't understand the half of it!): https://www.regular-expressions.info/catastrophic.html

cdural
Author
9 replies
6 years ago
February 4, 2019

mark2atsafe wrote:

Hi @cdural @redgeographics @egomm

So... the big question here is are these linefeed or carriage return characters in your PDF? I'm guessing they are carriage returns.

That's because if I set up your data using an AttributeCreator (where pressing the return key gives me linefeed characters), then it seems to work fine in my StringSearcher.

But if I manually change the end of line markers to carriage returns, the source looks almost identical, but the StringSearcher does not find the match.

So... it's a guess but I'm thinking that the regex preview dialog uses linefeed characters the same way that the text edit dialog does.

If you did a replace on your source string, replacing the LF characters with CR, then the StringSearcher should (if I'm correct) work fine.

I will query the developers about that.

Hello @Mark2AtSafe,

Thank you for the answer. As you guessed the problem was the carriage returns. The text contained both carriage returns and line feed at the end of the lines. So I removed carriage returns with a string replacer before the string searcher and my original regex worked.

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

Last, for my original regex, I'm not a guru and it could be improved. However, I know that It does not cause any catastrophic situations since I don't use any quantifiers like + or *. It will stop matching if the number of the characters I'm looking for exceeds 20

+45

mark2atsafe
Safer
2517 replies
6 years ago
February 4, 2019

cdural wrote:

Hello @Mark2AtSafe,

However, I insist there is a problem with the difference between the regex editor and runtime.

I'm writing my workflow so you can test it on yourself if you like to;

- I forwarded the data to the inspector transformer before the string searcher and opened the value window of pdf_page_text attribute and copied the text

- Then I pasted it back into the regex editor test string and coded my regex. It said there are 28 matches.

- Last I placed the string searcher transformer between the tester and the inspector. But it didn't match anything.

So I was in FME framework from beginning to end and had two different outputs for the same text.

Yes, there's definitely an issue in there with the regex editor and runtime. I've passed it on to the developers and we'll see what they say.

+45

mark2atsafe
Safer
2517 replies
Best Answer
5 years ago
September 5, 2019

Hope this now works for you, and apologies for any inconvenience it caused.

+15

gio
Contributor
2252 replies
5 years ago
September 6, 2019

([A-Z]|[ ])+,([A-Z]+|[.])+

A character class containing all characters and one containing single space-character.

More elegant

([A-Z ]+,[A-Z.]+)

To not hide the space:

^([A-Z\\s]+,[A-Z.]+)$

And in ^[A-Z].*?\\n the ? is superfluous. : ^[A-Z].*\\n

This is not trivial, as it means more or less work for the parsing engine.

Also, one of the best references around:

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

+15

gio
Contributor
2252 replies
5 years ago
September 9, 2019

yup that is why they have switches for that conveniency.

Causes .. to match newlines . as end-of-lines etc

But there is more:

Start of line

End of line

\A

Start of string

\z

End of string

Stil advising reding up on the matter. See posted link.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Regex editor and run difference

19 replies

Reply

Helpful Members This Week

Recently Solved Questions

FME Flow version control how to use different branch

Parameters within group parameters not available in a webhook?

How to restart a REST Server in ArcGIS Server?

Remove last CR/LF from a CSV

1019 error with change detector and polygons

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

Helpful Members This Week

Recently Solved Questions

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings