Solved

Combine single character text features to create words (longer strings)

1 year ago
October 23, 2023
3 replies
42 views

+4

0xbox0
Contributor
6 replies

Hi,

I'm trying to read PDFs and it looks like each character of each word is represented as individual features. Each PDF has thousands of these text features that I need to combine to make words. I'm stumped as to how to string these single character text features together to create words.

I have attached an example below.

Any help would be appreciated. Thanks in advance.

Best answer by david_r

0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        # https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

text_x_pos
text_y_pos
text_string
text_bbox
page_number

Sample output:

To install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html

View original

Did this help you find an answer to your question?

david_r
8355 replies
1 year ago
October 23, 2023

In the reader parameters, make sure the spatial text is read in as one feature per (text) block:

In theory, that should fix it, but my experience has been mixed, in particular with narrow fonts or particular page sizes. For some documents, the only reliable results I got was from this Python module, which is really exceptionally good at locating text blocks: https://pdfminersix.readthedocs.io/en/latest/ For maximum integration into your workspace, you'll have to implement it in a PythonCaller, but it's apparently also possible to use it on the command line.

+4

0xbox0
Author
Contributor
6 replies
1 year ago
October 30, 2023

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

david_r
8355 replies
Best Answer
1 year ago
October 30, 2023

0xbox0 wrote:

Unfortunately, changing the Spatial Text parameter didn't help in my case and I haven't gotten around to trying to figure out the Python. Thanks tho for the suggestion. I'll give it try again later on.

Cheers

As a starting point, here's sample code for the PythonCaller that will output a feature for each individual text block that is found within a PDF, as specified in the attribute "pdf_filename":

import fmeobjects
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LAParams
 
 
class ExtractPdfTextBlocks(object):
    def __init__(self):
        # LAParams documentation can be found here:
        # https://pdfminersix.readthedocs.io/en/latest/reference/composable.html
        self.params = LAParams(line_margin=0, 
                               detect_vertical=False, 
                               boxes_flow=None, 
                               all_texts=False)
 
    def input(self, feature):
        pdf_filename = feature.getAttribute('pdf_filename')
 
        for page_layout in extract_pages(pdf_filename, laparams=self.params):
            for element in page_layout:
                if isinstance(element, LTTextBoxHorizontal):
                    text_string = str(element.get_text()).strip()
                    text_x_pos = element.x0
                    text_y_pos = element.y0
                    text_bbox = element.bbox
                    
                    # Create a new feature for each text block found
                    new_feature = feature.clone()
                    new_feature.setAttribute('page_number', page_layout.pageid)
                    new_feature.setAttribute('text_x_pos', text_x_pos)
                    new_feature.setAttribute('text_y_pos', text_y_pos)
                    new_feature.setAttribute('text_bbox', str(text_bbox))
                    new_feature.setAttribute('text_string', text_string)
                    self.pyoutput(new_feature)
                               
    def close(self):
        pass

You'll want to expose the following attributes in the PythonCaller:

text_x_pos
text_y_pos
text_string
text_bbox
page_number

Sample output:

To install pdfminer.six, open a command prompt as local admin in the FME installation folder:

fme python -m pip install pdfminer.six

Documentation for pdfminer.six: https://pdfminersix.readthedocs.io/en/latest/index.html

Combine single character text features to create words (longer strings)

1 Attachments

3 replies

Reply

Helpful Members This Week

Recently Solved Questions

How to get a list of Asana tasks with their corresponding custom field values?

Using one AttributeRounder for different accuracies

Create date segments of two table with overlap of times

Automate Fanout of columns/splitting attributes to different output by attribute name

Tracing Multiple Networks from Sources to Valves Without Python

Community Stats

Latest FME

Cookie policy

Cookie settings

1 Attachments

Reply

Related Topics

Help with JSONTemplater - Sub Templates & Arraysicon

Query if list of terms are present in a text attribute of a CSV file; if present add a value to a new fieldicon

Replacing dashes as bullets with sequential numbersicon

FME 2023.1 - Feature Highlights

Question of the Week: Dynamic (but Hacky) Updates to Published Parameter Choicesicon

Helpful Members This Week

Recently Solved Questions

How to get a list of Asana tasks with their corresponding custom field values?

Using one AttributeRounder for different accuracies

Create date segments of two table with overlap of times

Automate Fanout of columns/splitting attributes to different output by attribute name

Tracing Multiple Networks from Sources to Valves Without Python

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings