Solved

Combine single character text features to create words (longer strings)

1 year ago
October 23, 2023
3 replies
29 views

+4

0xbox0
Contributor
6 replies

Hi,

I'm trying to read PDFs and it looks like each character of each word is represented as individual features. Each PDF has thousands of these text features that I need to combine to make words. I'm stumped as to how to string these single character text features together to create words.

I have attached an example below.

Any help would be appreciated. Thanks in advance.

Best answer by david_r

0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        # https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

text_x_pos
text_y_pos
text_string
text_bbox
page_number

Sample output:

To install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html

View original

Did this help you find an answer to your question?

david_r
8314 replies
1 year ago
October 23, 2023

In the reader parameters, make sure the spatial text is read in as one feature per (text) block:

In theory, that should fix it, but my experience has been mixed, in particular with narrow fonts or particular page sizes. For some documents, the only reliable results I got was from this Python module, which is really exceptionally good at locating text blocks: https://pdfminersix.readthedocs.io/en/latest/ For maximum integration into your workspace, you'll have to implement it in a PythonCaller, but it's apparently also possible to use it on the command line.

+4

0xbox0
Author
Contributor
6 replies
1 year ago
October 30, 2023

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

david_r
8314 replies
Best Answer
1 year ago
October 30, 2023

0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        # https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

text_x_pos
text_y_pos
text_string
text_bbox
page_number

Sample output:

To install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html

Combine single character text features to create words (longer strings)

1 Attachments

3 replies

Reply

Helpful Members This Week

Recently Solved Questions

FMEFlow affected by CVE-2024-50379 and CVE-2024-56337

How to generate download link for a workspace app?

How to iterate though a list of strings read from an Excel doc and attribute filters

Detecting Digitization Direction Conflicts in Consecutive Lines

Issue with AttributeValueMapper and Cached Values in FME

Community Stats

Latest FME

Cookie policy

Cookie settings

1 Attachments

Reply

Related Topics

Horizontal multiple choice (1-10 scale) questionsicon

NPS (1 TO 11)icon

Is there a way i can change the visual of the NPS question?icon

Visualizing "not relevant / don't know" option for recoded MC likert-scale questions in dashboard widgetsicon

Helpful Members This Week

Recently Solved Questions

FMEFlow affected by CVE-2024-50379 and CVE-2024-56337

How to generate download link for a workspace app?

How to iterate though a list of strings read from an Excel doc and attribute filters

Detecting Digitization Direction Conflicts in Consecutive Lines

Issue with AttributeValueMapper and Cached Values in FME

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings