How to split a PDF based on text on pages

  • 15 October 2020
  • 5 replies

So I have a pain in the backside task I'd like to try and cut down to size.

Basically we use a piece of software that automatically generates reports and logs for us and they can either be generated individually or as one large PDF.


For my current project I need to generate thousands of these reports, which to do individually will be a nightmare, so I want to generate the single large PDF and have FME scan through the PDF and split out the pages.


Each report has a location reference number and the sheet number (e.g. 1 of 7) so what I would like to do is have FME scan through the PDF, read the reference and sheet numbers, split out the pages and recombine into the individual location references.

5 replies

Userlevel 2
Badge +23

It's been a few years, and the PDF writer may have improved in the meantime, but in a slightly similar case, what we did was read the PDF, extract the relevant information, and created a jsonObject that indicated which pages belonged to which report, and then split the original pdf outside of FME in c#. This was part of a web application, and you could presumably do the splitting in a PythonCaller instead.

Badge +2

You're going to be better off splitting it outside FME. You will run into a lot of issues with fonts and alignments in FME.


Take a look here and it will give you an idea of how to split the file in Adobe Acrobat if you have a copy.


I solved a similar problem by using python:

import PyPDF2


def split_pdf(input_pdf_path, output_folder):

  # Open the input PDF file

  pdf_file = open(input_pdf_path, 'rb')

  pdf_reader = PyPDF2.PdfReader(pdf_file)


  # Ensure the output folder exists

  import os

  if not os.path.exists(output_folder):



  # Loop through each page and save it as a separate PDF

  for page_num in range(len(pdf_reader.pages)):

    pdf_writer = PyPDF2.PdfWriter()


output_pdf_path = os.path.join(output_folder, f'page_{page_num + 1}.pdf')

with open(output_pdf_path, 'wb') as output_pdf_file:



# Close the input PDF file



if __name__ == '__main__':

input_pdf_path = 'input.pdf' # Replace with your input PDF file path

output_folder = 'output_pages' # Replace with the output folder path


split_pdf(input_pdf_path, output_folder)

Badge +6

Similar issue but to the further complicate things the number of pages in the split pdf is random.  Need to read text of the pdf since John doe has 2 pages, but mary smith has 4 pages etc.  Ideally if we can read the user name we want to copy that out to the split files names. So john_doe.pdf   mary_smith . pdf.  etc. 

Read the pdf with FME. Figure out if the name is on the page and use the python script provided to split the pages.