Question

I have 2 address based datasets that need matching. I have one data set with UPRNs and the second containing all subscriber addresses that need to be matched up. How can I format my data to better my fuzzy match output?

  • 19 July 2021
  • 1 reply
  • 10 views

Within the two data sets there is the common field of subscriber address and address. However I believe there is an issue with the formatting as when I fuzzy matched the Subscriber address with the UPRN dataset the ratio was poor yet when manually checking the fuzzy match output they were mostly correct. Is there an extra step to reduce the issue of factors such as spacing and extra information such as "Nottinghamshire" through transformers? I hope the blurry images below help explain the formats

 

Thank you for any help or advice!!


1 reply

Userlevel 3
Badge +18

I guess the UPRN Data set contained line breaks but those got lost along the way?

 

Maybe try to find those linebreaks again because it makes matching the data more complex.

 

Or you could try to split out the city names based on some dataset with city names. And first try to match on that.

 

In the Netherlands postal codes are a very strict and recognizable format 4 digits and 2 characters. And it narrows down the search almost to the exact street within a town. So first try to find the postal codes and match on those would help a lot. It is probably always the last 7 characters of your dataset?

Reply