We upgraded our fme 2023.1.1 to 2024.0.3. To test the new version, several test were done on the new environment, among which a performancetest for databases PostgreSQL and Oracle.
The workspaces that are run in a automation do 3 things:
Download open data en put this data in our own postgresql datase
Read the data from the database
Delete the data from the database
We saw strange things going on in the numbers for both our postgresql and Oracle test. We saw for the postgresql test that the CPU-time was longer than the elapsed time. Furthermore the running time in 2024 was generally longer than in 2021. Please find attached the examples of the runs on fme 2021 and fme 2024. It is clearly seen something is wrong. Can you explain why in the 21 runs the elapsed time seems to be more than the cpu time and in the 24 run it is the other way around. The used dataset is exacly the same in both workspaces publised on fme server/flow.
We did the same experiment 5 times for several different sized datasets ranging from 4.000 to 400.000. The pattern repeats itself every run. For example in the postgresql test runs:
CPU time
400000 fme21 400000 fme 24 run 1: 00:00:51:25 00:01:30:20 run 2: 00:00:51:54 00:01:32:17 run 3: 00:00:50:70 00:01:24:03
For Oracle an overview with CPU-time is added.
The new engine machines (2024), have 12 cores, 64gb ram.
The old machines (2021) have 4 cores and 24gb ram
Could you please explain the difference in cpu time between the versions and why cpu time is sometimes longer than the elapsed time. If you need more examples or the raw test-data, please let me know.
Thanks in advance,
Matthijs Kastelijns
Page 1 / 1
Hoi Matthijs,
Looks like you did your testing on FME Server. So, from the docs:
CPU Time: Total CPU time to run the job, as recorded by the FME Server REST API.
Note The REST API record of CPU Time differs from FME Server and FME Desktop logs because it includes additional start and end scripting, and is a more accurate report of total processing time.
So that’s one thing. Another I noticed is that one of your jobs actually has warnings:
see the little orange triangle with the number 10, which indicates 10 warnings.
So that would probably also account for some extra processing needed, to account for extra logging and error handling.
All of that said: Your tests can’t be very conclusive, because there is a lot of network and external resources involved. Even if you run the two test versions competely simultaneously, there will still be differences in network traffic that will interfere with the download-part of your test for example. Then there is the reading and writing to the databases: are these dedicated databases on dedicated machines, so that nobody else can use them while you perform the tests? And there is nothing else running on those machines? Is there a direct connection between FME Server and the databses, or is there networking involved as well? Were there indexes on the tables in which the data was to be inserted? If not, did you specify that FME created those indexes? Or even created the tables? That could have an impact as well. Was the OS on all machines the same? Even the firewalls could have an impact.
On the whole, there are a ton of variables in the tests as you describe them, that can all affect the end result as FME Server will show. So when I see the rather small difference, I wouldn’t be worried, to be honest.
Thanks for the quick response,
I will check the API CPU time. Let's see if those numbers make more sense.
The next point, your remarks about the variables like connectivity, firewalls etc: The two versions are connected to the same data center, have the same security setup and are connected to the same postgresql/Oarcle database. The tested dataset is also the same in both version tests. The test-runs are run on FME environments without other users or jobs, to make sure nothing else interrupt our tests. Basically we expect similar results. So far, it is not clear what causes the differences in CPU time/elapsed time. On our side it is made sure to maximum extend it's not something environment-related.
If our workspace is run by safe or another user on their own database and gives (near) equal values for 2021 and 2024, it would eliminate performance regression. That would be useful to test.
Comforting kind of results would be for example:
400000 fme21 400000 fme 24 run 1: 00:00:51:25 00:00:50:16 run 2: 00:00:51:54 00:00:49:87 run 3: 00:00:50:70 00:00:52:33
instead of our:
400000 fme21 400000 fme 24 run 1: 00:00:51:25 00:01:30:20 run 2: 00:00:51:54 00:01:32:17 run 3: 00:00:50:70 00:01:24:03
The FME environment that we host includes applications that process millions of features on daily basis. If performance regression occur at 400.000 features, running time will multiply for these jobs. That's the main concern.
Well, then you’ll have to check and compare the log files for each job, that should tell you more. Here one of the things that are mentioned, is that
The difference between the session duration and CPU times is an indication that there are aspects of the translation outside of FME's control: slow database queries, network latency, or slow HTTP requests.
Besides that:
Basically we expect similar results
But you’re not getting them, and one thing could possibly be the warnings you are getting for one of your workspaces (you haven’t mentioned anything about that). You expect similar results, but if you’re not getting them, then that means a full-on game of find-the-differences…
One of the differences may even be caused by not upgrading all transformers to the newest version, for all I know. So running the same workspace on two different versions of FME might have an impact simply because of the fact that (some?) transformers are not the version of transformer for that version of FME, and thus might not be able to make full use of FME improvements.
The difference between the session duration and CPU times is an indication that there are aspects of the translation outside of FME's control: slow database queries, network latency, or slow HTTP requests.\
Thanks for the tip, next step for us is to look at the query meta-data together with a DBA-er of look close at the logging-time when the database actions start and end.
But you’re not getting them, and one thing could possibly be the warnings you are getting for one of your workspaces (you haven’t mentioned anything about that). You expect similar results, but if you’re not getting them, then that means a full-on game of find-the-differences…
The errors are the same for the both versions. They are dataset-specfic, the tested data is Dutch BGT-data. (see attached image).
One of the differences may even be caused by not upgrading all transformers to the newest version, for all I know. So running the same workspace on two different versions of FME might have an impact simply because of the fact that (some?) transformers are not the version of transformer for that version of FME, and thus might not be able to make full use of FME improvements.
All the transformers have been successfully upgraded from fme 2021 to 2024. There might still be an under-the-hood-problem, but no mention to be seen of it in FME flow logging.