Solved

FME Cloud Performance


Badge +9

I have a large, relatively complex workspace that performs a few tasks - loads data, validates bearings, distances, areas, creates subsets of the data and then creates a number of reports, PDF, HTML and XML)

The PostGIS database is located externally on a separate hosted server.

On my laptop the workspace takes less than 2 minutes to complete. My laptop is a pretty stock standard HP, Windows 64 Bit with 8GB RAM.

I have published to FME Server Cloud and had a starter configuration established with 2 virtual cores and 4GB RAM. This configuration caused the server to melt down and the jobs just kept failing with no specific errors. I upgraded to the standard configuration with 2 virtual cores and 8GB RAM. The workspace failed after about 30 minutes and what appeared to be multiple cycles of attempting to run but still no errors in the log.

Below is the memory monitoring. The job was started at 8:06, memory peaked a couple of times and then finally failed at 8:36

What I am seeking is some advice / answers in three areas.

1. How do I determine what is the best configuration for FME Server Cloud?

2. What is best practice when it comes to creating a workspace regarding numbers of transformers, location of data etc.

3. What are the recommendations to improve performance for FME Server Cloud?

 

 

Update 26/08/19:

I did get the workspace to work when the resources where bumped up to Professional - 4 cores 16 GB, but it is still taking twice as long as the desktop version.

 

Update 27/08/19:

Data moved to FME Cloud PostGIS database and instance set to Professional. Speed has improved to around 30 seconds to run the validation. Network throughout spikes when creating reports but server load has reduced considerably. Job run at 14:40

 

icon

Best answer by redgeographics 26 August 2019, 10:34

View original

14 replies

Userlevel 4
Badge +25

One big factor is that FME Cloud likes to be close to the data, if the data coming from the external database has to do a lot of hops to get to your FME Cloud that might be a bottleneck.

Disc I/O, to temporary files might be another bottleneck (especially when it's maxing out memory), you can circumvent that by increasing disc space on the FME Cloud temp drive (bigger drive means it'll be faster)

If you take a look at the log files on both FME Cloud and your local machine and check the timings you may get an idea of where the slowdown is happening.

 

Badge +9

One big factor is that FME Cloud likes to be close to the data, if the data coming from the external database has to do a lot of hops to get to your FME Cloud that might be a bottleneck.

Disc I/O, to temporary files might be another bottleneck (especially when it's maxing out memory), you can circumvent that by increasing disc space on the FME Cloud temp drive (bigger drive means it'll be faster)

If you take a look at the log files on both FME Cloud and your local machine and check the timings you may get an idea of where the slowdown is happening.

 

Thanks. The data is quite large as one of the datasets is the cadastre for the state. It is certainly worth testing by moving the data in to the cloud.

Userlevel 4
Badge +25

Thanks. The data is quite large as one of the datasets is the cadastre for the state. It is certainly worth testing by moving the data in to the cloud.

I've used an on-demand Enterprise instance to do Lidar processing, once the data is up there (a few Gb...) it ran fine. Still needing around 24 hours of processing but I know none of the machines I have in my office would have been able to do it.

Badge +16

Thanks. The data is quite large as one of the datasets is the cadastre for the state. It is certainly worth testing by moving the data in to the cloud.

I would look into moving your data into the PostGIS database supplied with your FME Cloud instance.

Badge

It is strange that the job is failing on smaller instances. If the bottleneck was pulling the data, then the workspace might take longer to run than on your local machine but it shouldn't fail on a Starter. Since you are running the workspace locally on Windows and FME Cloud is on Linux, I am wondering if that might have something to do with it. Could you post the log file when you run on your Windows machine, the log file when the job fails on the Starter and the log file when the job succeeds on the Professional?

Badge

@deanhowell2009

During a low memory condition, the Windows operating system will start to write to the disk of your machine to compensate for the lack of memory available. On Linux, this is usually accomplished by swapping. However, for cloud-based solutions swapping is usually avoided because it resolves a symptom that is usually caused by allocating insufficient resources which contradicts with the idea of dynamic resource allocation in the cloud.

After several tests regarding the resiliency of FME Cloud instances, we also decided to not enable swapping. To guarantee stability during memory-intensive processing (e.g point clouds, rasters) FME on Linux is writing data to the FME Temp location which on FME Cloud is directly mapped to the Temporary Disk (https://knowledge.safe.com/articles/65205/fme-cloud-how-to-speed-up-your-workflows-with-the.html).

To troubleshoot your scenario I would first check if your workspace on windows is caching more data on disk (Task Manager > Performance >Cached) while running.

If significantly more data is cached, then I would check the Temporary Disk usage on an FME Cloud instance that fails to run the workspace.

Not all transformers are able to write to the temp location so a log file and the workspace would be very helpful to investigate this further.

Badge +9

@deanhowell2009

During a low memory condition, the Windows operating system will start to write to the disk of your machine to compensate for the lack of memory available. On Linux, this is usually accomplished by swapping. However, for cloud-based solutions swapping is usually avoided because it resolves a symptom that is usually caused by allocating insufficient resources which contradicts with the idea of dynamic resource allocation in the cloud.

After several tests regarding the resiliency of FME Cloud instances, we also decided to not enable swapping. To guarantee stability during memory-intensive processing (e.g point clouds, rasters) FME on Linux is writing data to the FME Temp location which on FME Cloud is directly mapped to the Temporary Disk (https://knowledge.safe.com/articles/65205/fme-cloud-how-to-speed-up-your-workflows-with-the.html).

To troubleshoot your scenario I would first check if your workspace on windows is caching more data on disk (Task Manager > Performance >Cached) while running.

If significantly more data is cached, then I would check the Temporary Disk usage on an FME Cloud instance that fails to run the workspace.

Not all transformers are able to write to the temp location so a log file and the workspace would be very helpful to investigate this further.

Thanks @gerhardatsafe, there does appear to be lots more writing to disk when running the workspace locally. I will do some additional tests to see what I can figure out.

Badge +9

It is strange that the job is failing on smaller instances. If the bottleneck was pulling the data, then the workspace might take longer to run than on your local machine but it shouldn't fail on a Starter. Since you are running the workspace locally on Windows and FME Cloud is on Linux, I am wondering if that might have something to do with it. Could you post the log file when you run on your Windows machine, the log file when the job fails on the Starter and the log file when the job succeeds on the Professional?

Here are the logs from the various configurations and locations.

The starter configuration in the cloud took close to an hour but failed with explanation - starter_log.txt

The professional configuration was successful in a few minutes pro_log.txt

Running locally is successful in a few minutes laptop_log.txt

 

Badge +9

I would look into moving your data into the PostGIS database supplied with your FME Cloud instance.

Thanks @itay, have now moved the base data into the cloud and does not really make a lot of difference using the starter cloud instance. Temporary disk usage is low, network throughput is now very low but server load is still very high.

Badge +21

Here are the logs from the various configurations and locations.

The starter configuration in the cloud took close to an hour but failed with explanation - starter_log.txt

The professional configuration was successful in a few minutes pro_log.txt

Running locally is successful in a few minutes laptop_log.txt

 

TLDR: Can you try to disable/DELETE the GeographicBufferer and (move data around it) and see if you can get the workspace running with no hickups?

 

I see from the logfiles you are using GeographicBufferer on STARTER instance and GeographicBufferer_PythonCaller_2 - can the issue be that Python 2.7 vs 3.4 ?

Also on the PRO instance you seem to use the GeographicBufferer_PythonCaller_2 and not the GeographicBufferer

@deanhowell2009

 

Badge +21

It seems the Standard-version STOPS right after the GeographicBufferer and starting with the Aggregator

Both Pro and standard are using this: 

 2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer_AttributeCreator_2_OUTPUT_-___Rejected_ (TeeFactory): Cloned 0 input feature(s) into 0 output feature(s)
2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer_AttributeCreator_2_OUTPUT_-___Rejected__Output_-___Rejected_ (TeeFactory): Cloned 0 input feature(s) into 0 output feature(s)
2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer_Output1497550044 Output Collector (TeeFactory): Cloned 3 input feature(s) into 3 output feature(s)
2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer_<Rejected>1497550044 Output Collector (TeeFactory): Cloned 0 input feature(s) into 0 output feature(s)
2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer Output Output Renamer/Nuker (TeeFactory): Cloned 3 input feature(s) into 3 output feature(s)
2019-08-26 23:02:03|  21.7|  0.0|STATS |GeographicBufferer <Rejected> Output Renamer/Nuker (TeeFactory): Cloned 0 input feature(s) into 0 output feature(s)

But only Pro goes on with the Aggregator-part:

2019-08-26 23:02:13|  32.1| 10.4|INFORM|Aggregator (AggregateFactory): Preparing to divide 3 features into groups
2019-08-26 23:02:14|  32.6|  0.4|INFORM|Aggregator (AggregateFactory): Dividing 3 features into groups
2019-08-26 23:02:15|  33.7|  1.2|INFORM|Reprojector_6: Splitting feature table
2019-08-26 23:02:15|  34.0|  0.3|STATS |Sorter_6 (SortFactory): Finished sorting a total of 3 features.
2019-08-26 23:02:15|  34.0|  0.0|STATS |Sorter_6 SORTED Splitter (TeeFactory): Cloned 3 input feature(s) into 9 output feature(s)
2019-08-26 23:02:15|  34.0|  0.0|STATS |Reprojector_6 (TeeFactory): Cloned 3 input feature(s) into 3 output feature(s)
2019-08-26 23:02:15|  34.0|  0.0|INFORM|Processing base feature(s)...
2019-08-26 23:02:15|  34.0|  0.0|STATS |NeighborFinder_2 (ProximityFactory): Input Summary:  15 base feature(s), 3 candidate feature(s)
2019-08-26 23:02:15|  34.0|  0.0|STATS |NeighborFinder_2 (ProximityFactory): Output Summary: 15 matched feature(s), 0 unmatched base feature(s)
2019-08-26 23:02:15|  34.0|  0.0|STATS |Coord_System_Retriever_2 (TeeFactory): Cloned 3 input feature(s) into 3 output feature(s)
2019-08-26 23:02:15|  34.0|  0.0|STATS |Reprojector_4 (TeeFactory): Cloned 3 input feature(s) into 3 output feature(s)
Badge

I would look into moving your data into the PostGIS database supplied with your FME Cloud instance.

@itay & @deanhowell2009

just a quick note here that we generally do not recommend to use the built-in PostGIS for production, especially not on a Starter instance. It's more of a proof of concept tool to get started quickly. If lots of work is offloaded to the DB it can impact FME Server performance, because they share the same resources.

Best performance regarding PostGIS is most likely achieved with an RDS instance on AWS in the same region as your FME Cloud instance.

Badge +9

@itay & @deanhowell2009

just a quick note here that we generally do not recommend to use the built-in PostGIS for production, especially not on a Starter instance. It's more of a proof of concept tool to get started quickly. If lots of work is offloaded to the DB it can impact FME Server performance, because they share the same resources.

Best performance regarding PostGIS is most likely achieved with an RDS instance on AWS in the same region as your FME Cloud instance.

Thanks @gerhardatsafe, it is our preference to have the database separated but it did show an amazing improvement in performance when the data was stored locally.

Badge +9

TLDR: Can you try to disable/DELETE the GeographicBufferer and (move data around it) and see if you can get the workspace running with no hickups?

 

I see from the logfiles you are using GeographicBufferer on STARTER instance and GeographicBufferer_PythonCaller_2 - can the issue be that Python 2.7 vs 3.4 ?

Also on the PRO instance you seem to use the GeographicBufferer_PythonCaller_2 and not the GeographicBufferer

@deanhowell2009

 

Thanks @sigtill for your input, it is very much appreciated. Both instances are running the same workspace but I did notice the same thing that the STARTER does seem to slow down around the bufferer.

Reply