Skip to main content

Within the two data sets there is the common field of subscriber address and address. However I believe there is an issue with the formatting as when I fuzzy matched the Subscriber address with the UPRN dataset the ratio was poor yet when manually checking the fuzzy match output they were mostly correct. Is there an extra step to reduce the issue of factors such as spacing and extra information such as "Nottinghamshire" through transformers? I hope the blurry images below help explain the formats

 

Thank you for any help or advice!!

I guess the UPRN Data set contained line breaks but those got lost along the way?

 

Maybe try to find those linebreaks again because it makes matching the data more complex.

 

Or you could try to split out the city names based on some dataset with city names. And first try to match on that.

 

In the Netherlands postal codes are a very strict and recognizable format 4 digits and 2 characters. And it narrows down the search almost to the exact street within a town. So first try to find the postal codes and match on those would help a lot. It is probably always the last 7 characters of your dataset?


Reply