Skip to main content
Question

PDF Reader Text


franco69
Contributor
Forum|alt.badge.img+6

I've extracted text from a pdf file

 

but i have to group or aggregate this text now...as you see on my screenshot i have such text as "4.1" or "5.2" etc...but they are all one point with text attribute as a result from PDF Reader...

 

How can i aggregate them to one feature with one attribute which contains "4.1" e.g ?

I know its simple but i tried Neighborhoodfinder Neighboraggregator and so on....result isnt right..

 

So for you specialists no prob i think..

Greetz and Cheers

Franco

7 replies

redgeographics
Celebrity
Forum|alt.badge.img+50

Just thinking out loud here... if you extract the coordinate values for every point, sort them on the Y coordinate and then on the X coordinate and then try a NeighborhoodAggregator they'll enter that in the correct order. Build a list and concatenate that and you should be done.

none2none.fmw

Of course this will only work with horizontal text...


jakemolnar
Forum|alt.badge.img
In case a little background info is interesting: this effect is due to how the data is structured in your PDF file (assuming you are using "Text per Block" mode instead of "Text per Character" mode).

 

 

The PDF authoring software that created the data chose to create separate text objects for each letter. That means that the PDF2D reader has no idea which characters are supposed to be associated with each other; every glyph is drawn separately.

 

 


franco69
Contributor
Forum|alt.badge.img+6
  • Author
  • Contributor
  • May 4, 2018

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....


gio
Contributor
Forum|alt.badge.img+15
  • Contributor
  • May 4, 2018

@franco69

considering the labels, they are numbers with decimals.

Use the spacing.

Spacing is equal, apart form the digit following the point object.

And the thousand separator.

So distance has 3 values, which you can use a s moduli, all other mean characters not belonging to.

So you can query the point layer by searching for the points.

Then find right value by (the smaller) distance (angle does not matter). Check if there are more decimals by searching modulus spacing distance to left search.

Find left values by (modulus) spacing distance

Spacing distance being next to smallest distance of course.

In case off overlaps (with point distance smaller than the spacing distance, else they pose no problem), do a query to separate those first.


redgeographics
Celebrity
Forum|alt.badge.img+50
franco69 wrote:

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....

Are you grouping by text string by any chance?

 

 


jakemolnar
Forum|alt.badge.img
franco69 wrote:

Hi,

i've tried this but the sorting is not right (i've tested it because of a counter afterwards)

 

as u see on the screenshot.....hmmm....

Those characters are directly above/below each other; I wonder if you are sorting in such a way that it associates features by Y instead of by X?

 

 


jakemolnar
Forum|alt.badge.img

Hey @franco69,

I saw a very similar problem and resolution here: https://knowledge.safe.com/questions/70028/conditional-feature-stringattribute-concatenator.html

You might want to check out their workflow!


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings