Skip to main content

So I have a pain in the backside task I'd like to try and cut down to size.

Basically we use a piece of software that automatically generates reports and logs for us and they can either be generated individually or as one large PDF.

 

For my current project I need to generate thousands of these reports, which to do individually will be a nightmare, so I want to generate the single large PDF and have FME scan through the PDF and split out the pages.

 

Each report has a location reference number and the sheet number (e.g. 1 of 7) so what I would like to do is have FME scan through the PDF, read the reference and sheet numbers, split out the pages and recombine into the individual location references.

It's been a few years, and the PDF writer may have improved in the meantime, but in a slightly similar case, what we did was read the PDF, extract the relevant information, and created a jsonObject that indicated which pages belonged to which report, and then split the original pdf outside of FME in c#. This was part of a web application, and you could presumably do the splitting in a PythonCaller instead.


You're going to be better off splitting it outside FME. You will run into a lot of issues with fonts and alignments in FME.

 

Take a look here and it will give you an idea of how to split the file in Adobe Acrobat if you have a copy.

 

https://helpx.adobe.com/acrobat/how-to/split-pdf-file.html


I solved a similar problem by using python:

import PyPDF2

 

def split_pdf(input_pdf_path, output_folder):

  # Open the input PDF file

  pdf_file = open(input_pdf_path, 'rb')

  pdf_reader = PyPDF2.PdfReader(pdf_file)

 

  # Ensure the output folder exists

  import os

  if not os.path.exists(output_folder):

    os.makedirs(output_folder)

 

  # Loop through each page and save it as a separate PDF

  for page_num in range(len(pdf_reader.pages)):

    pdf_writer = PyPDF2.PdfWriter()

    pdf_writer.add_page(pdf_reader.pagesrpage_num])

output_pdf_path = os.path.join(output_folder, f'page_{page_num + 1}.pdf')

with open(output_pdf_path, 'wb') as output_pdf_file:

pdf_writer.write(output_pdf_file)

 

# Close the input PDF file

pdf_file.close()

 

if __name__ == '__main__':

input_pdf_path = 'input.pdf' # Replace with your input PDF file path

output_folder = 'output_pages' # Replace with the output folder path

 

split_pdf(input_pdf_path, output_folder)


Similar issue but to the further complicate things the number of pages in the split pdf is random.  Need to read text of the pdf since John doe has 2 pages, but mary smith has 4 pages etc.  Ideally if we can read the user name we want to copy that out to the split files names. So john_doe.pdf   mary_smith . pdf.  etc. 


Read the pdf with FME. Figure out if the name is on the page and use the python script provided to split the pages.


Do you have an issue splitting a PDF based on text on pages? Then it would help if you understood what splitting PDF is. When dividing an enormous PDF file into multiple little ones, it is necessary to separate the PDF Files. This splitting method can be useful when your device does not have a lot of storage space. In this case, I prefer the tool I used to split my PDF files into many files, which can also be useful to you, namely the Pcinfotools PDF Splitter Tool. This software is better suited for splitting PDFs and offers a variety of functionality such as copying, printing, and modifying files. While formatting, it maintains the integrity of the PDF file and can be useful for people who want a trusted application but are concerned about the content of the PDF document because some apps can delete the content of the PDFs and do not attach the attachments that are still attached to the file; they can even misplace or delete them. So, for these issues, this program can be useful for those who want to keep their PDF files secure. This software offers a free demo version and is compatible with all Windows versions.


 

 

 


Reply