Question

PDF Reader Text

7 years ago
May 3, 2018
7 replies
47 views

franco69
Contributor
164 replies

I've extracted text from a pdf file

but i have to group or aggregate this text now...as you see on my screenshot i have such text as "4.1" or "5.2" etc...but they are all one point with text attribute as a result from PDF Reader...

How can i aggregate them to one feature with one attribute which contains "4.1" e.g ?

I know its simple but i tried Neighborhoodfinder Neighboraggregator and so on....result isnt right..

So for you specialists no prob i think..

Greetz and Cheers

Franco

+50

redgeographics
Celebrity
3643 replies
7 years ago
May 3, 2018

Just thinking out loud here... if you extract the coordinate values for every point, sort them on the Y coordinate and then on the X coordinate and then try a NeighborhoodAggregator they'll enter that in the correct order. Build a list and concatenate that and you should be done.

none2none.fmw

Of course this will only work with horizontal text...

jakemolnar
98 replies
7 years ago
May 3, 2018

In case a little background info is interesting: this effect is due to how the data is structured in your PDF file (assuming you are using "Text per Block" mode instead of "Text per Character" mode).

The PDF authoring software that created the data chose to create separate text objects for each letter. That means that the PDF2D reader has no idea which characters are supposed to be associated with each other; every glyph is drawn separately.

franco69
Author
Contributor
164 replies
7 years ago
May 4, 2018

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

as u see on the screenshot.....hmmm....

+15

gio
Contributor
2252 replies
7 years ago
May 4, 2018

@franco69

considering the labels, they are numbers with decimals.

Use the spacing.

Spacing is equal, apart form the digit following the point object.

And the thousand separator.

So distance has 3 values, which you can use a s moduli, all other mean characters not belonging to.

So you can query the point layer by searching for the points.

Then find right value by (the smaller) distance (angle does not matter). Check if there are more decimals by searching modulus spacing distance to left search.

Find left values by (modulus) spacing distance

Spacing distance being next to smallest distance of course.

In case off overlaps (with point distance smaller than the spacing distance, else they pose no problem), do a query to separate those first.

+50

redgeographics
Celebrity
3643 replies
7 years ago
May 4, 2018

franco69 wrote:

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

as u see on the screenshot.....hmmm....

Are you grouping by text string by any chance?

jakemolnar
98 replies
7 years ago
May 4, 2018

franco69 wrote:

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

as u see on the screenshot.....hmmm....

Those characters are directly above/below each other; I wonder if you are sorting in such a way that it associates features by Y instead of by X?

jakemolnar
98 replies
7 years ago
May 14, 2018

Hey @franco69,

I saw a very similar problem and resolution here: https://knowledge.safe.com/questions/70028/conditional-feature-stringattribute-concatenator.html

You might want to check out their workflow!

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

PDF Reader Text