Skip to main content
Solved

Combine single character text features to create words (longer strings)


0xbox0
Contributor
Forum|alt.badge.img+4

Hi,

 

I'm trying to read PDFs and it looks like each character of each word is represented as individual features. Each PDF has thousands of these text features that I need to combine to make words. I'm stumped as to how to string these single character text features together to create words.

 

I have attached an example below.

 

Any help would be appreciated. Thanks in advance.

Best answer by david_r

0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python.  Thanks tho for the suggestion.  I'll give it try again later on.

 

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

  • text_x_pos 
  • text_y_pos 
  • text_string 
  • text_bbox 
  • page_number

Sample output:

imageTo install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

See also: https://docs.safe.com/fme/html/FME-Form-Documentation/FME-Form/Workbench/Installing-Python-Packages.htm

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html

View original
Did this help you find an answer to your question?

3 replies

david_r
Evangelist
  • October 23, 2023

In the reader parameters, make sure the spatial text is read in as one feature per (text) block:

imageIn theory, that should fix it, but my experience has been mixed, in particular with narrow fonts or particular page sizes. For some documents, the only reliable results I got was from this Python module, which is really exceptionally good at locating text blocks: https://pdfminersix.readthedocs.io/en/latest/ For maximum integration into your workspace, you'll have to implement it in a PythonCaller, but it's apparently also possible to use it on the command line.


0xbox0
Contributor
Forum|alt.badge.img+4
  • Author
  • Contributor
  • October 30, 2023

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

 

Cheers


david_r
Evangelist
  • Best Answer
  • October 30, 2023
0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python.  Thanks tho for the suggestion.  I'll give it try again later on.

 

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

  • text_x_pos 
  • text_y_pos 
  • text_string 
  • text_bbox 
  • page_number

Sample output:

imageTo install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

See also: https://docs.safe.com/fme/html/FME-Form-Documentation/FME-Form/Workbench/Installing-Python-Packages.htm

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings