Question

How to split a PDF based on text on pages

4 years ago
October 15, 2020
7 replies
970 views

rjwillshear
11 replies

So I have a pain in the backside task I'd like to try and cut down to size.

Basically we use a piece of software that automatically generates reports and logs for us and they can either be generated individually or as one large PDF.

For my current project I need to generate thousands of these reports, which to do individually will be a nightmare, so I want to generate the single large PDF and have FME scan through the PDF and split out the pages.

Each report has a location reference number and the sheet number (e.g. 1 of 7) so what I would like to do is have FME scan through the PDF, read the reference and sheet numbers, split out the pages and recombine into the individual location references.

+28

jdh
Contributor
1984 replies
4 years ago
October 15, 2020

It's been a few years, and the PDF writer may have improved in the meantime, but in a slightly similar case, what we did was read the PDF, extract the relevant information, and created a jsonObject that indicated which pages belonged to which report, and then split the original pdf outside of FME in c#. This was part of a web application, and you could presumably do the splitting in a PythonCaller instead.

jlbaker2779
194 replies
4 years ago
October 15, 2020

You're going to be better off splitting it outside FME. You will run into a lot of issues with fonts and alignments in FME.

Take a look here and it will give you an idea of how to split the file in Adobe Acrobat if you have a copy.

https://helpx.adobe.com/acrobat/how-to/split-pdf-file.html

sveinj
Contributor
5 replies
1 year ago
November 10, 2023

I solved a similar problem by using python:

import PyPDF2

def split_pdf(input_pdf_path, output_folder):

# Open the input PDF file

pdf_file = open(input_pdf_path, 'rb')

pdf_reader = PyPDF2.PdfReader(pdf_file)

# Ensure the output folder exists

import os

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# Loop through each page and save it as a separate PDF

for page_num in range(len(pdf_reader.pages)):

pdf_writer = PyPDF2.PdfWriter()

pdf_writer.add_page(pdf_reader.pages[page_num])

output_pdf_path = os.path.join(output_folder, f'page_{page_num + 1}.pdf')

with open(output_pdf_path, 'wb') as output_pdf_file:

pdf_writer.write(output_pdf_file)

# Close the input PDF file

pdf_file.close()

if __name__ == '__main__':

input_pdf_path = 'input.pdf' # Replace with your input PDF file path

output_folder = 'output_pages' # Replace with the output folder path

split_pdf(input_pdf_path, output_folder)

jhawks
Contributor
76 replies
1 year ago
June 14, 2024

Similar issue but to the further complicate things the number of pages in the split pdf is random. Need to read text of the pdf since John doe has 2 pages, but mary smith has 4 pages etc. Ideally if we can read the user name we want to copy that out to the split files names. So john_doe.pdf mary_smith . pdf. etc.

+29

jkr_wrk
400 replies
1 year ago
June 14, 2024

Read the pdf with FME. Figure out if the name is on the page and use the python script provided to split the pages.

💡 Did you know... The FeatureWriter is more intuitive. 😏

landonperez
Participant
1 reply
1 year ago
August 6, 2024

Do you have an issue splitting a PDF based on text on pages? Then it would help if you understood what splitting PDF is. When dividing an enormous PDF file into multiple little ones, it is necessary to separate the PDF Files. This splitting method can be useful when your device does not have a lot of storage space. In this case, I prefer the tool I used to split my PDF files into many files, which can also be useful to you, namely the Pcinfotools PDF Splitter Tool. This software is better suited for splitting PDFs and offers a variety of functionality such as copying, printing, and modifying files. While formatting, it maintains the integrity of the PDF file and can be useful for people who want a trusted application but are concerned about the content of the PDF document because some apps can delete the content of the PDFs and do not attach the attachments that are still attached to the file; they can even misplace or delete them. So, for these issues, this program can be useful for those who want to keep their PDF files secure. This software offers a free demo version and is compatible with all Windows versions.

jhawks
Contributor
76 replies
11 months ago
August 21, 2024

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How to split a PDF based on text on pages