Skip to main content

Hi! I have a workspace where I mainly use a PDF reader to read around 5000 pdf:s in folders and subfolders (defined by the dataset “…\\**\\*.pdf”).

The workspace is large and quite complex but the thought is to take the pdf file name, which is an object ID, and combine it with other files that contain the IDs coordinates, etc. The pdfs contain other info on the objects that get extracted through its predetermined local coordinates.

The workspace has worked excellently, but since recently it doesn’t complete. The only change I recall is that the amount of pdf:s keep increasing as times goes by. I’d say that last time it worked was then it contained around 3000-4000 files.

I also find the reasons for the error to be sporadic. Mainly the workspace runs though a few hundred pdf:s, but eventually I get the message that the pdf cannot be opened “because the file is not in PDF format, or because it is corrupted”. Sometimes the program just crashes. The pdf can be individually read with no issues. And if I just run parts of the workspace with the PDF-reader and subsequent transformers, it runs through it all with no issues.

I have tried to run the PDF through a Directory and Path-reader, followed by a FeatureReader (pdf), but the problem persists. I have also tried a WorkspaceRunner, but honestly I don’t understand how I use it when connected to a workspace with multiple readers and writers. All the WorkspaceRunner examples I’ve found are quite simple.

I hoping that someone out there recognizes this issue and can give me some pointers on what to do.

Thanks,

Victor

Is it at all possible that there are pdfs in the folder that are getting created, updated/overwritten at the same time and FME would be trying to access them? Another possible problem could be an issue with the fileserver where the pdfs are stored - perhaps there is some network issue?

 

The Workspace Runner is a good option if you want to perform the same process on many files and the fiels aren't related to each other in the process. So the workspace runner would take and process one PDF at a time, you just change the input path for the input PDF and leave the rest the same. If you are summarising data across pdfs then this will not work (unless you can group the the input accordingly).

 

You can take a PATH reader and use the path_windows attribute of each feature and pass it into a WorkspaceRunner to trigger a workspace run only with that PDF. It means you can also run the jobs in parallel which can speed up the overall process.

 

If your workspace is reading a lot of other data in the process then this option might not make a lot of sense because you'd be reading the same data over and over again and this could hamper the performance.

 

Another option would be to create 'Groups' of input pdf files so that each trigger of the workspace runs say 20 or 30 pdfs (perhaps based on a suborder). This would cut down on the number of times the other data need to be read while still allowing to process in parallel.

 

Be warned though; the workspace runner, if running in parallel, can make it pretty tricky to figure out if or where anything went wrong. The parent process (the one with the WorkspaceRunner) will not fail or give any indication if a child job has failed, and the log file of the child will, by default, also get overwritten by the next child job.


Is it at all possible that there are pdfs in the folder that are getting created, updated/overwritten at the same time and FME would be trying to access them? Another possible problem could be an issue with the fileserver where the pdfs are stored - perhaps there is some network issue?

 

The Workspace Runner is a good option if you want to perform the same process on many files and the fiels aren't related to each other in the process. So the workspace runner would take and process one PDF at a time, you just change the input path for the input PDF and leave the rest the same. If you are summarising data across pdfs then this will not work (unless you can group the the input accordingly).

 

You can take a PATH reader and use the path_windows attribute of each feature and pass it into a WorkspaceRunner to trigger a workspace run only with that PDF. It means you can also run the jobs in parallel which can speed up the overall process.

 

If your workspace is reading a lot of other data in the process then this option might not make a lot of sense because you'd be reading the same data over and over again and this could hamper the performance.

 

Another option would be to create 'Groups' of input pdf files so that each trigger of the workspace runs say 20 or 30 pdfs (perhaps based on a suborder). This would cut down on the number of times the other data need to be read while still allowing to process in parallel.

 

Be warned though; the workspace runner, if running in parallel, can make it pretty tricky to figure out if or where anything went wrong. The parent process (the one with the WorkspaceRunner) will not fail or give any indication if a child job has failed, and the log file of the child will, by default, also get overwritten by the next child job.

Hi @virtualcitymatt​!

Thank you for tips. Will sit down and play around more with this when I get the opportunity.

As a quickfix, I divided the workspace into parts, which made it work for now.


Reply