Skip to main content

Hi all,

I created a little test in order to compare the performance of python and FME.

 

For this test I created a CSV-file containing information of 1 000 000 people (ID, SEX). The goal was to count the number of men and women in this list.

I created a very basic Workspace with a reader, StatisticsCalculator and Logger in order to answer this question. The Workbench completes in 45 seconds.

When I do the same using only pythoncode, the result is print in 2 seconds.

Why is there such a big time difference between those two?

I know that the startup time of FME has to be taken into account (less than 3 seconds) and the reader converts the input to the FFS-format which will take some time. But I am sure that when I would change the pythoncode so that the features are written to an SQLite database, it wouldn't take 45 seconds to complete.

Can someone clarify this for me?

Thanks!

 

(the CSV contains only 500 000 in stead of 1 000 000 features since the file was too big)

I'm not a python programmer so this is not a definitive answer but...

You're using 2 StatisticsCalculators in FME, they both need to get all features in memory to do their work, that takes time. Also, the CSV reader isn't exactly fast. In contrast to that, the Python you've posted seems to be keeping a running total, so it only loops through the features once and doesn't need to keep them in memory.

There is a trick to use the PointCloud reader rather than the CSV reader to at least get rid of that bottleneck,but I can't reproduce it right now. Another idea would be to store the data in a database and get it out using a SQLCreator, set to order by the sex field. Then set the StatisticsCalculator to be ordered by group.


I can tell you it depends how much work you do in the python script. I have noticed that if the script does just one procedure, what one transformer does than it is possible slower and not much of an advantage over FME. But if you can run a lot in python then it can speed up. This is things I have noticed. I am sure a Safe Employee would have a better answer of why this is. Also I am sure it depends on what you are doing. .


There was a good answer from @daleatsafe on another question about reading CSV with Python.

Here's what he had to say (not all of which might apply to this question):

A fairer test would be to run the Python CSV module
against a workspace that does what you want. In general, we strongly
discourage using FME Objects instead of just workspaces.

If you
did that, you could use the CSV reader, turn on the SORT option in its
settings, and have it sort the fields by whatever the column was you
wanted to remove duplicates by. Then, in FME 2016, route the results of
that read into a DuplicateRemover and indicate that your input is
ordered. That will perform drastically better.

I think the lesson
is to work with FME's strengths as you approach problem solving with
it. Watch out for things that block up features -- that will cause
performance slowdowns. And try to thin out and remove data from the
stream as early as possible.

Now, having said all that, I am still
sure that the python raw CSV reading will beat us. For now. We are
working on some revolutionary technology which will help us with this
nasty-bad-boy-mega-CSV files and I so look forward to unveiling that. On
a stage. With FME under a black tablecloth which I'll pull off with a
flourish. Wearing a mock-turtleneck.

But that will have to wait for a while yet...


I'm not a python programmer so this is not a definitive answer but...

You're using 2 StatisticsCalculators in FME, they both need to get all features in memory to do their work, that takes time. Also, the CSV reader isn't exactly fast. In contrast to that, the Python you've posted seems to be keeping a running total, so it only loops through the features once and doesn't need to keep them in memory.

There is a trick to use the PointCloud reader rather than the CSV reader to at least get rid of that bottleneck,but I can't reproduce it right now. Another idea would be to store the data in a database and get it out using a SQLCreator, set to order by the sex field. Then set the StatisticsCalculator to be ordered by group.

Thank you for the answer!

Of course I know that using a database gives me a more performant solution. If I would have to implement this type of model, I would just fix it with SQL.

I am not looking for the option that gives better performant. I am wondering why it does take FME so much longer to read all data.

 

 

I just tried the PointCloud reader but I fail to get it working with a CSV. Only one feature is read. Do you know what I could have done wrong? (I never used (the) PointCloud(s) (reader) before.

I'm not a python programmer so this is not a definitive answer but...

You're using 2 StatisticsCalculators in FME, they both need to get all features in memory to do their work, that takes time. Also, the CSV reader isn't exactly fast. In contrast to that, the Python you've posted seems to be keeping a running total, so it only loops through the features once and doesn't need to keep them in memory.

There is a trick to use the PointCloud reader rather than the CSV reader to at least get rid of that bottleneck,but I can't reproduce it right now. Another idea would be to store the data in a database and get it out using a SQLCreator, set to order by the sex field. Then set the StatisticsCalculator to be ordered by group.

Yes, you should be able to use the PointCloud XYZ format - then use a PointCloudCoercer to turn it into points.

There's an exercise in chapter 2 of the advanced training that does that (not updated that chapter to 2016 yet). I've put the workspace on Dropbox for you (or anyone) to get. You'd need to download the FMEData dataset to run it (safe.com/fmedata).


Reply