Do you process all files at once? That might certainly cause issues depending on what you do in the workspace and what your memory/temp space situation looks like.
Every now and then I process the Dutch Top10NL GML set (90+ GML files, totalling around 40-ish GB unzipped) and just writing them to a PostGIS database takes around 2.5 hours (the only processing I do is remove some attributes and filter 2 of the feature types into separate tables for areas/lines)
Do you process all files at once? That might certainly cause issues depending on what you do in the workspace and what your memory/temp space situation looks like.
Every now and then I process the Dutch Top10NL GML set (90+ GML files, totalling around 40-ish GB unzipped) and just writing them to a PostGIS database takes around 2.5 hours (the only processing I do is remove some attributes and filter 2 of the feature types into separate tables for areas/lines)
I tried to process all at once, any recommendations of the most efficient way to do this, as I don't want to have to manually select a few at a time when there are 15,000 files
I tried to process all at once, any recommendations of the most efficient way to do this, as I don't want to have to manually select a few at a time when there are 15,000 files
As long as you don't have any blocking transformers in your workspace it should process them one by one. I don't know how the OS Mastermap is set up internally, is it possible for an object to be split across multiple files (map sheets)? If so and you combine them it will be a serious hit on performance.
I'd try breaking this into pieces as redgeographics suggests.
- Create workspace 1. Reads a single GML file and writes to a file geodatabase
- Create workspace 2.
- Add a reader of type Directory and File Pathnames
- Set up that reader to read a list of the GML files
- Add a WorkspaceRunner transformer
- Set the WorkspaceRunner workspace parameter to run Workspace 1.
- It should reveal a Reader Dataset parameter. Set it to the value of path_windows
Now run workspace 2. It will read the name of the first GML file and send it to workspace 1 to process. It will be processed and added to the geodatabase. Then workspace 2 will send the name of the second dataset, third, fourth, etc.
Additionally the WorkspaceRunner lets you run up to 7 processes at the same time. I would hesitate to suggest this when writing to the same output, but Geodatabase should be OK (ie if one process locks the geodatabase when writing to it, other processes should wait for the lock to be released and then write).
Anyway, try the above using a directory of, say, twenty GML files. Once you know that works you can assess how long the total 15,000 should take. It might slow down over time, but not much I think.
By the way, which FME version are you using and which Geodatabase Writer (arcObjects or API)?
I'd try breaking this into pieces as redgeographics suggests.
- Create workspace 1. Reads a single GML file and writes to a file geodatabase
- Create workspace 2.
- Add a reader of type Directory and File Pathnames
- Set up that reader to read a list of the GML files
- Add a WorkspaceRunner transformer
- Set the WorkspaceRunner workspace parameter to run Workspace 1.
- It should reveal a Reader Dataset parameter. Set it to the value of path_windows
Now run workspace 2. It will read the name of the first GML file and send it to workspace 1 to process. It will be processed and added to the geodatabase. Then workspace 2 will send the name of the second dataset, third, fourth, etc.
Additionally the WorkspaceRunner lets you run up to 7 processes at the same time. I would hesitate to suggest this when writing to the same output, but Geodatabase should be OK (ie if one process locks the geodatabase when writing to it, other processes should wait for the lock to be released and then write).
Anyway, try the above using a directory of, say, twenty GML files. Once you know that works you can assess how long the total 15,000 should take. It might slow down over time, but not much I think.
By the way, which FME version are you using and which Geodatabase Writer (arcObjects or API)?
Hi Mark,
Thanks very much for the suggestion, I will give that a try! I am using FME 2109.0 and I believe the OpenAPI writer.
As long as you don't have any blocking transformers in your workspace it should process them one by one. I don't know how the OS Mastermap is set up internally, is it possible for an object to be split across multiple files (map sheets)? If so and you combine them it will be a serious hit on performance.
When OS mastermap is supplied as geochunks any features crossing the boundary are delivered in both grids, so it would require some post processing.
Hi Mark,
Thanks very much for the suggestion, I will give that a try! I am using FME 2109.0 and I believe the OpenAPI writer.
I think @egomm's comment below is crucial -- if the files are indeed "hairy tiles" in that features that intersect the tile may be in other tiles, you'll need some duplicate removal strategy to avoid those features being in your final dataset multiple times.
Using Mark's plan, which I think is a winning one, you'd need to a) modify the "worker" workspace (workspace 1) to check that each feature is not already written out. How? Well, I'd suggest that you create a SQLite database in your workbench as the 2nd writer. That database will have a single table that will hold ONLY the IDs (I'm assuming mastermap has these) of the features you've written. Somehow you'll have to make this initially to be empty, and then set the writer to APPEND. Be sure this column is also indexed so scanning it will be fast. Then in the main flow of the workspace, you'd use a DatabaseJoiner to join each mastermap feature to the sqlite db -- if there is a match DO NOT write that feature to your file geodb. IF there is NOT a match, write that feature to both the file geodb AND to the SQLIte table of IDs. This is a high level description but I hope it gets you going. You don't have to use SQLite but it seems like a good potential choice here.
Oh ya, and b) ensure you're only ever running 1 slave workspace at a time in the workspace runner (do it synchronously) so that the database of seen IDs stays healthy and is able to be queried consistently - if you have multiple writers pounding into the filegeodb, you're database of what has been written will not be correct at all times.