Skip to main content

Hello! I’m trying to figure out the best way to aggregate values by the most common words. I’ve found a few threads and documents but not quite what I was looking for.

For example, I have a list of building names and numbers where each entry can have a of variation of a building name and number:

 

“1000 The Coolest Building Ever”

“1000 Coolest Building”

“Coolest Building”

“100 Coolest Building Dr.”

 

I would like the output to be “Coolest Building”, as it has common base words across all features. Is this possible?

 

Bonus point if a variation of “Bldg.” “Bldg” can be included. Any advice/guidance is appreciated!

 

Hm, you may use AI to resolve this kind of fuzzy matching I think, if the number of features/different values is not too big.
 


Yah. There is no easy way on this one. You would first have to build your list using something like: Normalize Data Using FME Desktop - YouTube or like ​@alexbiz said AI for this maybe?

Then have an attribute mapper for shortforms like bldg=building or st=Street 

First thing I think would be to get the “extras” out of the attributes like the numbers and get it down to a pure line of text. no ##s no special characters. Then remove things like Building, Bldg, st, street, and get down to a “Name” then Normalize. Keep track of this “list of values” for an Attribute Mapper.

This is a challenge. Let us know how it goes. Hope that helps.


Hi ​@slustado ,

If your examples cover every string pattern that could appear, I think you can use StringSearcher with this expression to extract the part representing building name - "Coolest" in the examples.
(\d*\s+)?(The\s+)?(.+)\s+(Building|Bldg)


I also think this is a easy problem for AI for an AI to solve. 


Reply