all_matches{}.startIndex of StringSearcher seems to skip Carriage Return

Question

Hi,

Yesterday I came across the situation where I wanted to split a text file into several portions. For 'new lines', the file contained both single Line Feeds (i.e. LF or \\n) as Carriage Return - Line Feed (i.e. CRLF or \\r\\n) combinations. Thus both the Unix End Of Line (EOL) methodology and the Windows EOL methodology were present.

I used the StringSearcher to match relevant parts of the text file, and then wanted to use (concurrent) all_matches{}.startIndex to fetch the relevant portions of the file.

However, I found out that when I then use the found startIndex in e.g. the SubstringExtractor, it doesn't doesn't return the same (starting) characters as found in the match. Digging a bit deeper, it seems that the StringSearcher all_matches{}.startIndex seems to skip counting CR (\\r). I'm guessing it identifies CRLF (\\r\\n) as an individual character c.q. index?

See the screenshot below for a demo workspace illustrating this issue. Also, I found that the equivalent @FindRegularExpression() does seem to find the correct startIndex.

Curious to hear other people's thought on this, I'm guessing this is a bug of the StringSearcher.

Kind regards,

Thijs

ps. While working on this issue, I also came across a couple of additional observations. Maybe these can also be looked at/investigated.

The StringSearcher uses multi-line-mode by default, wheras @FindRegularExpression() uses non multi-line-mode by default. I.e. I needed to add the Regex Mode modifier '(?m)' to make the @FindRegularExpression() string function behave as desired.
When optional 'Matched Result' parameter of the StringSearcher Transformer is left blank, an invisible attribute is created (with <space> for its name and the firstMatch for its value).

E.g. for the StringSearcher;

Examples of 'correctly implemented' optional attributes;
- Creator - Creation Instance (FME default '_creation_instance')
- ListExploder - Element Index (FME default '_element_index')
Examples of 'incorrectly implemented' optional attributes;
- StringSearcher - Matched Result (FME default '_first_match')
- XMLXQueryExtractor - Result Attribute (FME default '_result')

Implementation of SubstringExtractor and @Substring() function is slightly different. I.e. Both use startIndex, but SubstringExtractor uses 'End Index' for input, whereas @Substring() uses numChars as input. I personally prefer the latter.
FME documentation on String functions doesn't mention 'SubstringExtractor' as equivalent transformer for @Substring() transformer

all_matches{}.startIndex of StringSearcher seems to skip Carriage Return

0 replies

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded