Question

Speeding-up the reading of a large dataset

10 years ago
October 23, 2014
12 replies
208 views

jfd553
13 replies

I am reading a large xml dataset - about 80GB .gz file - and send each feature in a PythonCaller that simply dispatch the content in different .gz files according to the feature type (since multi-writer will keep all the content in memory/temporary files). Seems to work since I can see the size of the different files increasing.

however, Since it is being running for 8 days now (yes, 8 days), I wonder if I could eventually improve the speed of the workbench - parallel reading, processing, ... whatever could help!

Daniel

david_r
8342 replies
10 years ago
October 24, 2014

Hi,

that is indeed huge. If you have an 80GB zip-file containing XML I'd guess that the uncompressed XML would be around 800GB and maybe even more. Unfortunately, parsing XML is relatively costly for the computer, and particularly if the structure of the XML is rather complex.

To speed it up, consider unzipping the .gz onto a temporary disk first, so you don't fill up memory used by the unzipping when FME is running. Maybe using an SSD disk for the XML and the temporary files for FME (see FME_TEMP variable) can help too.

If your .gz contains several files, consider parsing only one at a time, perhaps running several instances of FME in parallell for each file.

If you're doing heavy processing on the XML data, consider staging the process: first load the XML into a proper database e.g. PostgreSQL with a 1:1 schema definition (as much as possible). Then, after the loading, process the data with FME, trying to let the database do as much as possible, e.g. WHERE-clauses rather than Testers, indexed JOINS rather than FeatureMergers, etc. Depending on your case this could lead to substantial performance gains.

David

jfd553
Author
13 replies
10 years ago
October 24, 2014

Right it is a huge xml file - 1TB file when uncompressed - and this is why I am dispatching the feature types in corresponding xml output datasets;

Reading the file with an XmlReader and writing the features with a small PythonCaller keeps the memory stable and low. Reading and writing .gz files does not seems to add significant overhead compared to the space it save!

I am splitting the file by feature types to eventually load them in PostgreSQL ! I will then be able to process each feature type (resulting files) with multiple workbenches.

+18

fmelizard
Safer
3725 replies
10 years ago
October 24, 2014

Hi,

In such a case of a huge XML file(s) I would try to use the xfMap configuration type since it is a lot better at handling large files.

You can also try to read the XML as text and map it via the XMLFeatureMapper, again using the xfMaps option.

This approach does require more knowledge of your XML schema.

Hope this helps,

Itay

jfd553
Author
13 replies
10 years ago
October 24, 2014

Hello Itay,

According to the current log, it seems to read about 6000 features/second. Does xfmap would provide a significant speed increase?

Instead of xfmap, I used the Feature Paths configuration on a small sample file to set-up the XmlReader and then switched to the big file (one single xml file).

+18

fmelizard
Safer
3725 replies
10 years ago
October 24, 2014

Hi Neildan,

Featues paths will work well and fast on a small file, but xfMaps will preform much better on your machine resources when using the bigger file.

I am actually surprised that with feature paths the all translation didnt halt.

jfd553
Author
13 replies
10 years ago
October 24, 2014

I think the magic comes from having used a small sample file when I added the XmlReader - once added to the workbench I was able to read the large file since, If my memory is right, fme stalled when I attempted to add the reader using the large file...

+18

fmelizard
Safer
3725 replies
10 years ago
October 24, 2014

neat trick! but I still think xfMaps is better on the long run, just out of curiousity what kind of data are you dealing with?

jfd553
Author
13 replies
10 years ago
October 24, 2014

OpenStreetMap history dump ...

jfd553
Author
13 replies
10 years ago
November 21, 2014

I tried xfMaps and did not found significant differences :-(

+14

helmoetz
Supporter
64 replies
2 years ago
July 6, 2022

Hi, a trick that I applied for reading a large GML file is have a pythoncreator split the (zipped) file in blocks. Depending on the structure of the file, for me it was possibe to separate a header and tailer block, and intermediate feature blocks (about 100000 lines per block). I wrapped The feature blocks in the header and tailer block and write the (compressed, otherwise it would generate lots of network traffic) block into a database clob. After this process has been finished, I could process each of the written block as a (relatively small) independent block. However, the success of this approach will depend on the structure of the XML.

jfd553
Author
13 replies
2 years ago
July 6, 2022

Thank for the answer, it's been a while! I solved the problem in a similar way but using another tool :-) Sent from Galaxy S7

sijmensp
Contributor
4 replies
1 year ago
January 19, 2024

I am also struggling with this. I created a script that uses an ATOM feed to detect download links from PDOK ruimtelijkeplannen and subsequently downloads ans reads the files dubbelbestemming.gml, enkelbestemming.gml and bouwvlak.gml which the URL are fetched from the feed. As soon as downloaded with the HTTPcaller more than an hour has passed, but in the next step reading in the GML files with the featurereader while connect a small area on the initiator it has to read the entire records structure before it is able to filter out the right records right!? It also takes more than one hour. Any tips?

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Speeding-up the reading of a large dataset