Skip to main content
Question

Match sets of keywords in freeform text


jdh
Contributor
Forum|alt.badge.img+28
  • Contributor

Hi all,

 

 

I have one dataset containing a set of features with attributes like ID, Name, Date, Location.

 

While none of the individual attributes are unique, the combination of all of them are. (Record)

 

I have another dataset of features with one attribute containing freeform multiline text.  (Text)

 

 

Each Text feature contains ALL of the values of ONE feature of the Record dataset, but not in any order, and generally not an exact match on a line.

 

 

I need to identify which Text feature corresponds to which Record feature.  Each Record should have a zero or one match with a Text feature.

 

 

I am assuming that python and regex is the way to go, but I'm not sure as to the most efficient way to process the data.  

 

 

Record FeaturesIDNameDateLocation24AAA23 MAY 2019X32AAA07 JUN 2019Y24BBB07 JUN 2019Z

 

A sample text feature could contain something like:
SEE 24
2926m
7000'
Search shelter X
32
500 2000 2500
800 32 200
AAA/ABC
07 JUN 2019
Y

The correct record in this case is 32-AAA.

 

8 replies

jdh
Contributor
Forum|alt.badge.img+28
  • Author
  • Contributor
  • June 19, 2019

I could guarantee that the Record features arrive before the text features.

 

 

Maybe something in a pythonCaller where

 

Add the record features to a dictionary

 

For each text feature loop through the record dictionary and regex search for all attributes of that record.

 

If there is a full match, pop that record from the dictionary, export text feature and break inner loop.

erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • June 19, 2019

If the datasets are not too big, you could use an unconditional FeatureMerger (join on 1=1) to create a Cartesian join.

Then follow by a Tester to test on:

Text contains ID And Text contains Name ......

That would solve it using 2 transformers.


ebygomm
Influencer
Forum|alt.badge.img+33
  • Influencer
  • June 19, 2019

What format is your record dataset in? If it is stored as a csv/text or similar I'd be tempted to read the csv directly to a list of tuples and iterate over them for matches, e.g.

import fme
import fmeobjects
import csv

class FeatureProcessor(object):
    
    
    def __init__(self):
        self.inputfilename = FME_MacroValues['SourceDataset_CSV2']
        with open(self.inputfilename) as f:
            self.data=[tuple(line) for line in csv.reader(f)]

        
    def input(self,feature):
        text = feature.getAttribute('text')
        for y in self.data:
            value = 0
            for x in y:
                if x in text:
                    value +=1
            if value ==4:
                feature.setAttribute('value',value)
                feature.setAttribute('record',','.join(y))
                feature.setAttribute('ID',y[0])
                feature.setAttribute('Name',y[1])
                feature.setAttribute('Date',y[2])
                feature.setAttribute('Location',y[3])
                self.pyoutput(feature)
                break
                    

    def close(self):
        pass

jdh
Contributor
Forum|alt.badge.img+28
  • Author
  • Contributor
  • June 19, 2019
erik_jan wrote:

If the datasets are not too big, you could use an unconditional FeatureMerger (join on 1=1) to create a Cartesian join.

Then follow by a Tester to test on:

Text contains ID And Text contains Name ......

That would solve it using 2 transformers.

I would say that it averages to about 2000 records, and there are 8 attributes to check.

 


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • June 19, 2019
jdh wrote:

I would say that it averages to about 2000 records, and there are 8 attributes to check.

 

With 2000 records, I would give this a try.

Assuming you use FME 2019, which is a lot faster and better with memory.


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • June 19, 2019
erik_jan wrote:

If the datasets are not too big, you could use an unconditional FeatureMerger (join on 1=1) to create a Cartesian join.

Then follow by a Tester to test on:

Text contains ID And Text contains Name ......

That would solve it using 2 transformers.

And in FME 2019 using the FeatureJoiner instead of FeatureMerger


jdh
Contributor
Forum|alt.badge.img+28
  • Author
  • Contributor
  • June 19, 2019
erik_jan wrote:

And in FME 2019 using the FeatureJoiner instead of FeatureMerger

A Feature Joiner with 2000 features on an unconditional merge would output 4 million features, all of which would have to be tested. Also how do you differentiate a record that has no matching text, from a record that just doesn't match that particular text?


jdh
Contributor
Forum|alt.badge.img+28
  • Author
  • Contributor
  • June 19, 2019
ebygomm wrote:

What format is your record dataset in? If it is stored as a csv/text or similar I'd be tempted to read the csv directly to a list of tuples and iterate over them for matches, e.g.

import fme
import fmeobjects
import csv

class FeatureProcessor(object):
    
    
    def __init__(self):
        self.inputfilename = FME_MacroValues['SourceDataset_CSV2']
        with open(self.inputfilename) as f:
            self.data=[tuple(line) for line in csv.reader(f)]

        
    def input(self,feature):
        text = feature.getAttribute('text')
        for y in self.data:
            value = 0
            for x in y:
                if x in text:
                    value +=1
            if value ==4:
                feature.setAttribute('value',value)
                feature.setAttribute('record',','.join(y))
                feature.setAttribute('ID',y[0])
                feature.setAttribute('Name',y[1])
                feature.setAttribute('Date',y[2])
                feature.setAttribute('Location',y[3])
                self.pyoutput(feature)
                break
                    

    def close(self):
        pass

I like you idea of just tracking how many values passed.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings