Question

Removing Duplicates in Fixed Width text File

  • 30 November 2018
  • 2 replies
  • 2 views

Badge

Hi,

 

I have a fixed width text file and would like to remove some of the rows where a certain portion of the row is duplicated, for example please see extract below:

 

NE500086033 15.10.201831.12.2099Masterson

NE500085977 08.10.201831.12.2099Gilmore

NE500085699 24.09.201831.12.2021Doherty

NE500085699 24.09.201831.12.2099Banks

NE500085312 10.09.201831.12.2099Moyo

 

If the parts in bold are the same I would like to remove both records from the file. Then output the file in exactly the same format with the duplicates removed. I am using the CAT reader to read the file but I'm not 100% sure which is the best writer to use. Any advice is much appreciated.

 

Thanks,

Charlie


2 replies

Userlevel 4
Badge +25

If it's always the same length you could use an AttributeSplitter to split off that bit into a new list attribute and then a DuplicateRemover with that list attribute as the key.

A SubstringExtractor instead of the AttributeSplitter has the same result.

Userlevel 2
Badge +17

A possible way is:

  1. Read the text file with the Text File reader line by line.
  2. Extract the portion to be compared and save as a new attribute with a transformer, such as SubstringExtractor.
  3. Send the features to the Matcher (Mathed Geometry: None, Attribute Matching Strategy: Match Selected Attributes, Selected Attributes: <attribute storing the portion to be compare>).
  4. Write text line data in the features output via the NotMatched port into a destination text file with the Text File writer.

If you need to keep the original order of text lines, expose "text_line_number" attribute in the reader, and sort the features by the line number with the Sorter before writing out.

Reply