Skip to main content
Question

Analyze attribute values for whole words

  • April 21, 2018
  • 4 replies
  • 21 views

Forum|alt.badge.img+4

Anyone know of a good way to analyze attribute values, and determine if a given value is a word in English?

Or maybe even check it against a custom dictionary of names? I'm trying to clean up a bunch of values that have spaces from when the PDFs were output to data via OCR, so it looks like this:

 

Attribute ValueWhat I want attribute to be corrected toThi s is a sent e nc e.This is a sentence.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

4 replies

pratap
Contributor
Forum|alt.badge.img+12
  • Contributor
  • April 21, 2018

Hi,

I'm not sure 100% but I will suggest you to use attribute splitter and divide with the help of " " (space). If the list has signal letter other than "a" then merge left list{x} and right list{x} such that word will form.

Above example will be like

Thi s is a sent e nc e --> Thisis a sentence

based on the results you add further more

Pratap


pratap
Contributor
Forum|alt.badge.img+12
  • Contributor
  • April 21, 2018

david_r
Celebrity
  • April 23, 2018

That's a pretty cool, but potentially difficult issue to deal with.

How do you plan on dealing with ambiguities, e.g. "a void" vs "avoid"?


pratap
Contributor
Forum|alt.badge.img+12
  • Contributor
  • April 23, 2018

Good question @david_r

At the end, english is a language and depends on the situation the meaning of the sentence will change. In this context, we are joining word/words and trying to make a sentence so the words "a void" or "avoid" both are correct in english :)

So it depends on @dmatranga to decide...

Pratap