Skip to main content
Question

Analyze attribute values for whole words


Forum|alt.badge.img+4

Anyone know of a good way to analyze attribute values, and determine if a given value is a word in English?

Or maybe even check it against a custom dictionary of names? I'm trying to clean up a bunch of values that have spaces from when the PDFs were output to data via OCR, so it looks like this:

 

Attribute ValueWhat I want attribute to be corrected toThi s is a sent e nc e.This is a sentence.

4 replies

pratap
Contributor
Forum|alt.badge.img+11
  • Contributor
  • April 21, 2018

Hi,

I'm not sure 100% but I will suggest you to use attribute splitter and divide with the help of " " (space). If the list has signal letter other than "a" then merge left list{x} and right list{x} such that word will form.

Above example will be like

Thi s is a sent e nc e --> Thisis a sentence

based on the results you add further more

Pratap


pratap
Contributor
Forum|alt.badge.img+11
  • Contributor
  • April 21, 2018

david_r
Evangelist
  • April 23, 2018

That's a pretty cool, but potentially difficult issue to deal with.

How do you plan on dealing with ambiguities, e.g. "a void" vs "avoid"?


pratap
Contributor
Forum|alt.badge.img+11
  • Contributor
  • April 23, 2018

Good question @david_r

At the end, english is a language and depends on the situation the meaning of the sentence will change. In this context, we are joining word/words and trying to make a sentence so the words "a void" or "avoid" both are correct in english :)

So it depends on @dmatranga to decide...

Pratap


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings