Skip to main content
I am reading a large xml dataset - about 80GB .gz file - and send each feature in a PythonCaller that simply dispatch the content in different .gz files according to the feature type (since multi-writer will keep all the content in memory/temporary files). Seems to work since I can see the size of the different files increasing.

 

 

however, Since it is being running for 8 days now (yes, 8 days), I wonder if I could eventually improve the speed of the workbench - parallel reading, processing, ... whatever could help!

 

 

Daniel
Hi,

 

 

that is indeed huge. If you have an 80GB zip-file containing XML I'd guess that the uncompressed XML would be around 800GB and maybe even more. Unfortunately, parsing XML is relatively costly for the computer, and particularly if the structure of the XML is rather complex.

 

 

To speed it up, consider unzipping the .gz onto a temporary disk first, so you don't fill up memory used by the unzipping when FME is running. Maybe using an SSD disk for the XML and the temporary files for FME (see FME_TEMP variable) can help too.

 

 

If your .gz contains several files, consider parsing only one at a time, perhaps running several instances of FME in parallell for each file.

 

 

If you're doing heavy processing on the XML data, consider staging the process: first load the XML into a proper database e.g. PostgreSQL with a 1:1 schema definition (as much as possible). Then, after the loading, process the data with FME, trying to let the database do as much as possible, e.g. WHERE-clauses rather than Testers, indexed JOINS rather than FeatureMergers, etc. Depending on your case this could lead to substantial performance gains.

 

 

David
Right it is a huge xml file - 1TB file when uncompressed - and this is why I am dispatching the feature types in corresponding xml output datasets; 

 

 

Reading the file with an XmlReader and writing the features with a small PythonCaller keeps the memory stable and low. Reading and writing .gz files does not seems to add significant overhead compared to the space it save!

 

 

I am splitting the file by feature types to eventually load them in PostgreSQL ! I will then be able to process each feature type (resulting files) with multiple workbenches. 
Hi,

 

In such a case of a huge XML file(s) I would try to use the xfMap configuration type since it is a lot better at handling large files.

 

You can also try to read the XML as text and map it via the XMLFeatureMapper, again using the xfMaps option.

 

This approach does require more knowledge of your XML schema.

 

Hope this helps,

 

Itay
Hello Itay,

 

According to the current log, it seems to read about 6000 features/second. Does xfmap would provide a significant speed increase?  

 

 

Instead of xfmap, I used the Feature Paths configuration on a small sample file to set-up the XmlReader and then switched to the big file (one single xml file).
Hi Neildan,

 

Featues paths will work well and fast on a small file, but xfMaps will preform much better on your machine resources when using the bigger file.

 

I am actually surprised that with feature paths the all translation didnt halt.

 

 
I think the magic comes from having used a small sample file when I added the XmlReader - once added to the workbench I was able to read the large file since, If my memory is right, fme stalled when I attempted to add the reader using the large file...
neat trick! but I still think xfMaps is better on the long run, just out of curiousity what kind of data are you dealing with?
OpenStreetMap history dump ...
I tried xfMaps and did not found significant differences 😞

Hi, a trick that I applied for reading a large GML file is have a pythoncreator split the (zipped) file in blocks. Depending on the structure of the file, for me it was possibe to separate a header and tailer block, and intermediate feature blocks (about 100000 lines per block). I wrapped The feature blocks in the header and tailer block and write the (compressed, otherwise it would generate lots of network traffic) block into a database clob. After this process has been finished, I could process each of the written block as a (relatively small) independent block. However, the success of this approach will depend on the structure of the XML.


Thank for the answer, it's been a while! I solved the problem in a similar way but using another tool :-)

Sent from Galaxy S7

I am also struggling with this. I created a script that uses an ATOM feed to detect download links from PDOK ruimtelijkeplannen and subsequently downloads ans reads the files dubbelbestemming.gml, enkelbestemming.gml and bouwvlak.gml which the URL are fetched from the feed. As soon as downloaded with the HTTPcaller more than an hour has passed, but in the next step reading in the GML files with the featurereader while connect a small area on the initiator it has to read the entire records structure before it is able to filter out the right records right!? It also takes more than one hour. Any tips?


Reply