Skip to main content

I want to read PDF files as images using OpenAI's Vision API to extract information. I realized that FME's PDF reader tool has issues with PDFs containing text created by digital signatures—it cannot read them either as text or as images. Therefore, I decided to use PythonCaller to create image files from the PDF files to ensure the data is complete. Below is my code:
 

import pdf2image
import fme
import fmeobjects
import os

class FeatureProcessor:
def __init__(self):
pass

def input(self, feature):
pdf_path = feature.getAttribute('path_windows')
images = pdf2image.convert_from_path(pdf_path, dpi=300)

for index, image in enumerate(images):
image_feature = fmeobjects.FMEFeature()
image_feature.setAttribute("page_number", index + 1)

temp_dir = "D:/temp"
if not os.path.exists(temp_dir):
os.makedirs(temp_dir)

image_name = f"{feature.getAttribute('path_rootname')}_{index + 1}.png"
temp_image_path = os.path.join(temp_dir, image_name)
image.save(temp_image_path, "PNG")


feature.setAttribute("image_path", temp_image_path)
feature.setAttribute("image_data", image.tobytes())
feature.setAttribute("image_name", image_name)


self.pyoutput(feature)


feature_processor = FeatureProcessor()

The above code works well; however, at the output port of PythonCaller, I only receive attribute values, and no spatial data of the raster is included.

How can I retrieve both the raster spatial data and the input attributes at the output of PythonCaller?

By the way, I feel that writing PNG files to temporary storage and then reading them back consumes a lot of memory and is slow. Is there a way to optimize the process of exporting and reading images back to make the output of PythonCaller more efficient and faster?

It does look like you’re going to have to read the PNG files. I’m not a Python expert at all but if I read your code correctly it tells the pdf2image to take a PDF and transform it to PNG, neither of which ever make it in to FME.


It does look like you’re going to have to read the PNG files. I’m not a Python expert at all but if I read your code correctly it tells the pdf2image to take a PDF and transform it to PNG, neither of which ever make it in to FME.

Hi ​@redgeographics 
It works on FME very well.


I’ve had similar issues reading very specific PDF files where FME created features for each individual letter and digit, making parsing very complicated. I resolved it using pdfminer.six, which I found really excellent, although I cannot guarantee that it’ll work for your use case.

The great thing with pdfminer.six is, that you can retrieve the geometry for each text block, and create FME point or BBOX geometries accordingly.

Concerning your Python script, you cannot set an FME raster simply as a sequence of bytes, it’s a bit more complicated. Looking at the FMERaster object model should be a starting point: https://docs.safe.com/fme/html/fmepython/api/fmeobjects/geometry/_rasters/fmeobjects.FMERaster.html There are also some old examples on these forums posted by ​@takashi or this more recent discussion that can perhaps give you some inspiration:

 


Reply