Question

all_matches{}.startIndex of StringSearcher seems to skip Carriage Return

  • 15 December 2022
  • 0 replies
  • 5 views

Badge +3

Hi,

 

Yesterday I came across the situation where I wanted to split a text file into several portions. For 'new lines', the file contained both single Line Feeds (i.e. LF or \\n) as Carriage Return - Line Feed (i.e. CRLF or \\r\\n) combinations. Thus both the Unix End Of Line (EOL) methodology and the Windows EOL methodology were present.

 

I used the StringSearcher to match relevant parts of the text file, and then wanted to use (concurrent) all_matches{}.startIndex to fetch the relevant portions of the file.

However, I found out that when I then use the found startIndex in e.g. the SubstringExtractor, it doesn't doesn't return the same (starting) characters as found in the match. Digging a bit deeper, it seems that the StringSearcher all_matches{}.startIndex seems to skip counting CR (\\r). I'm guessing it identifies CRLF (\\r\\n) as an individual character c.q. index?

 

See the screenshot below for a demo workspace illustrating this issue. Also, I found that the equivalent @FindRegularExpression() does seem to find the correct startIndex.

imageCurious to hear other people's thought on this, I'm guessing this is a bug of the StringSearcher.

 

Kind regards,

 

Thijs

 

ps. While working on this issue, I also came across a couple of additional observations. Maybe these can also be looked at/investigated.

  1. The StringSearcher uses multi-line-mode by default, wheras @FindRegularExpression() uses non multi-line-mode by default. I.e. I needed to add the Regex Mode modifier '(?m)' to make the @FindRegularExpression() string function behave as desired. 
  2. When optional 'Matched Result' parameter of the StringSearcher Transformer is left blank, an invisible attribute is created (with <space> for its name and the firstMatch for its value).
  • E.g. for the StringSearcher;

image.png

  • Examples of 'correctly implemented' optional attributes;
    • Creator - Creation Instance (FME default '_creation_instance')
    • ListExploder - Element Index (FME default '_element_index')
  • Examples of 'incorrectly implemented' optional attributes;
    • StringSearcher - Matched Result (FME default '_first_match')
    • XMLXQueryExtractor - Result Attribute (FME default '_result')
  1. Implementation of SubstringExtractor and @Substring() function is slightly different. I.e. Both use startIndex, but SubstringExtractor uses 'End Index' for input, whereas @Substring() uses numChars as input. I personally prefer the latter.
  2. FME documentation on String functions doesn't mention 'SubstringExtractor' as equivalent transformer for @Substring() transformer

 


0 replies

Be the first to reply!

Reply