Question

PDF Reader Text


Badge +6

I've extracted text from a pdf file

 

but i have to group or aggregate this text now...as you see on my screenshot i have such text as "4.1" or "5.2" etc...but they are all one point with text attribute as a result from PDF Reader...

 

How can i aggregate them to one feature with one attribute which contains "4.1" e.g ?

I know its simple but i tried Neighborhoodfinder Neighboraggregator and so on....result isnt right..

 

So for you specialists no prob i think..

Greetz and Cheers

Franco


7 replies

Userlevel 5
Badge +25

Just thinking out loud here... if you extract the coordinate values for every point, sort them on the Y coordinate and then on the X coordinate and then try a NeighborhoodAggregator they'll enter that in the correct order. Build a list and concatenate that and you should be done.

none2none.fmw

Of course this will only work with horizontal text...

Badge
In case a little background info is interesting: this effect is due to how the data is structured in your PDF file (assuming you are using "Text per Block" mode instead of "Text per Character" mode).

 

 

The PDF authoring software that created the data chose to create separate text objects for each letter. That means that the PDF2D reader has no idea which characters are supposed to be associated with each other; every glyph is drawn separately.

 

 

Badge +6

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....

Badge +3

@franco69

considering the labels, they are numbers with decimals.

Use the spacing.

Spacing is equal, apart form the digit following the point object.

And the thousand separator.

So distance has 3 values, which you can use a s moduli, all other mean characters not belonging to.

So you can query the point layer by searching for the points.

Then find right value by (the smaller) distance (angle does not matter). Check if there are more decimals by searching modulus spacing distance to left search.

Find left values by (modulus) spacing distance

Spacing distance being next to smallest distance of course.

In case off overlaps (with point distance smaller than the spacing distance, else they pose no problem), do a query to separate those first.

Userlevel 5
Badge +25

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....

Are you grouping by text string by any chance?

 

 

Badge

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....

Those characters are directly above/below each other; I wonder if you are sorting in such a way that it associates features by Y instead of by X?

 

 

Badge

Hey @franco69,

I saw a very similar problem and resolution here: https://knowledge.safe.com/questions/70028/conditional-feature-stringattribute-concatenator.html

You might want to check out their workflow!

Reply