Question

How to split a PDF based on text on pages

  • 15 October 2020
  • 5 replies
  • 141 views

So I have a pain in the backside task I'd like to try and cut down to size.

Basically we use a piece of software that automatically generates reports and logs for us and they can either be generated individually or as one large PDF.

 

For my current project I need to generate thousands of these reports, which to do individually will be a nightmare, so I want to generate the single large PDF and have FME scan through the PDF and split out the pages.

 

Each report has a location reference number and the sheet number (e.g. 1 of 7) so what I would like to do is have FME scan through the PDF, read the reference and sheet numbers, split out the pages and recombine into the individual location references.


5 replies

Userlevel 2
Badge +23

It's been a few years, and the PDF writer may have improved in the meantime, but in a slightly similar case, what we did was read the PDF, extract the relevant information, and created a jsonObject that indicated which pages belonged to which report, and then split the original pdf outside of FME in c#. This was part of a web application, and you could presumably do the splitting in a PythonCaller instead.

Badge +2

You're going to be better off splitting it outside FME. You will run into a lot of issues with fonts and alignments in FME.

 

Take a look here and it will give you an idea of how to split the file in Adobe Acrobat if you have a copy.

 

https://helpx.adobe.com/acrobat/how-to/split-pdf-file.html

Badge

I solved a similar problem by using python:

import PyPDF2

 

def split_pdf(input_pdf_path, output_folder):

  # Open the input PDF file

  pdf_file = open(input_pdf_path, 'rb')

  pdf_reader = PyPDF2.PdfReader(pdf_file)

 

  # Ensure the output folder exists

  import os

  if not os.path.exists(output_folder):

    os.makedirs(output_folder)

 

  # Loop through each page and save it as a separate PDF

  for page_num in range(len(pdf_reader.pages)):

    pdf_writer = PyPDF2.PdfWriter()

    pdf_writer.add_page(pdf_reader.pages[page_num])

output_pdf_path = os.path.join(output_folder, f'page_{page_num + 1}.pdf')

with open(output_pdf_path, 'wb') as output_pdf_file:

pdf_writer.write(output_pdf_file)

 

# Close the input PDF file

pdf_file.close()

 

if __name__ == '__main__':

input_pdf_path = 'input.pdf' # Replace with your input PDF file path

output_folder = 'output_pages' # Replace with the output folder path

 

split_pdf(input_pdf_path, output_folder)

Badge +6

Similar issue but to the further complicate things the number of pages in the split pdf is random.  Need to read text of the pdf since John doe has 2 pages, but mary smith has 4 pages etc.  Ideally if we can read the user name we want to copy that out to the split files names. So john_doe.pdf   mary_smith . pdf.  etc. 

Read the pdf with FME. Figure out if the name is on the page and use the python script provided to split the pages.

Reply