Skip to main content
Question

Removing Duplicates in Fixed Width text File

  • November 30, 2018
  • 2 replies
  • 16 views

Forum|alt.badge.img

Hi,

 

I have a fixed width text file and would like to remove some of the rows where a certain portion of the row is duplicated, for example please see extract below:

 

NE500086033 15.10.201831.12.2099Masterson

NE500085977 08.10.201831.12.2099Gilmore

NE500085699 24.09.201831.12.2021Doherty

NE500085699 24.09.201831.12.2099Banks

NE500085312 10.09.201831.12.2099Moyo

 

If the parts in bold are the same I would like to remove both records from the file. Then output the file in exactly the same format with the duplicates removed. I am using the CAT reader to read the file but I'm not 100% sure which is the best writer to use. Any advice is much appreciated.

 

Thanks,

Charlie

2 replies

redgeographics
Celebrity
Forum|alt.badge.img+47

If it's always the same length you could use an AttributeSplitter to split off that bit into a new list attribute and then a DuplicateRemover with that list attribute as the key.

A SubstringExtractor instead of the AttributeSplitter has the same result.


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • November 30, 2018

A possible way is:

  1. Read the text file with the Text File reader line by line.
  2. Extract the portion to be compared and save as a new attribute with a transformer, such as SubstringExtractor.
  3. Send the features to the Matcher (Mathed Geometry: None, Attribute Matching Strategy: Match Selected Attributes, Selected Attributes: <attribute storing the portion to be compare>).
  4. Write text line data in the features output via the NotMatched port into a destination text file with the Text File writer.

If you need to keep the original order of text lines, expose "text_line_number" attribute in the reader, and sort the features by the line number with the Sorter before writing out.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings