Solved

How is it possible that 6 of 8 engines in FME server fail during the night running jobs? The error related to this faillure is attached in the text below. How can we prevent it to happen again?

  • 18 August 2022
  • 9 replies
  • 2 views

There is a FME server setup with 8 engines, each on their own server.

 

Error: Authentication failed: java.sql.SQLException: COM.safe.fmeserver.database.FMEServerDBException: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections

 

Anonymized event before the engine faillure:

INFORM  xxxxxxx  418049 : Received HTTP Basic credentials for user: xxxxxxxxx

 

Temporary solution is restarting the FME-services on the servers.

 

Possible cause: recursive connection error during authentication check which eventually shuts down the engine?

icon

Best answer by mkastelijns 3 October 2022, 12:08

View original

9 replies

Badge +10

Hi @mkastelijns​ ,

 

Thanks for the question. It would be helpful to know the build of FME Server, and if you are using the default Postgres database or your own. FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation? Lastly, it would be helpful to know if and if there are multiple cores / web applications. I encourage you to make a case and we can help refine the issue. That all being said, based on the error, it sounds like we might need to increase the max_connections on the Postgres database.

 

Hi @mkastelijns​ ,

 

Thanks for the question. It would be helpful to know the build of FME Server, and if you are using the default Postgres database or your own. FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation? Lastly, it would be helpful to know if and if there are multiple cores / web applications. I encourage you to make a case and we can help refine the issue. That all being said, based on the error, it sounds like we might need to increase the max_connections on the Postgres database.

 

Hello Richard,

 

Thanks for the quick response. Some more information about the FME server 2018 setup:

 

We have a build:

FME Server 2018.1 - Build 18520 - win64

We have our own postgresql database that we can configure ourselves including the number of connections.

There are multiple cores / web applications; every server has 1 engine 1 core and the web application.

 

If you need more information, please let me know.

 

Kind regards,

 

Matthijs Kastelijns

Badge +10

Hello Richard,

 

Thanks for the quick response. Some more information about the FME server 2018 setup:

 

We have a build:

FME Server 2018.1 - Build 18520 - win64

We have our own postgresql database that we can configure ourselves including the number of connections.

There are multiple cores / web applications; every server has 1 engine 1 core and the web application.

 

If you need more information, please let me know.

 

Kind regards,

 

Matthijs Kastelijns

Hi @mkastelijns​ ,

 

Hopefully, you were able to test increasing the number of connections. I would suggest 100 connections should be enough. If you want to make a case we could review the logs and configuration to see if we can find any reasons you're running out of logs.

Hey Richard,

 

We have increased the number of connections in the backend postgresql database from 300 to 500. Since then there are no more issues. The engines remain up and running. Thanks for your advise.

Badge +10

Hey Richard,

 

We have increased the number of connections in the backend postgresql database from 300 to 500. Since then there are no more issues. The engines remain up and running. Thanks for your advise.

Hi @mkastelijns​ ,

 

Thanks for the response. I'm happy it's stable, but 500 or even 300 connections are much higher than we have ever seen. We are curious about the setup and the reason this is happening. If you are concerned and want to open a case we would be happy to investigate and see if we can make a reproduction of this scenario.

Hi @mkastelijns​ ,

 

Thanks for the response. I'm happy it's stable, but 500 or even 300 connections are much higher than we have ever seen. We are curious about the setup and the reason this is happening. If you are concerned and want to open a case we would be happy to investigate and see if we can make a reproduction of this scenario.

14 FME server machines on two clusters, every machine has its own core, tomcat and standard engine. So far the changes have greatly improved the machines. The last 6 months there have been no more issues.

Badge

Hi @mkastelijns​ ,

 

Thanks for the question. It would be helpful to know the build of FME Server, and if you are using the default Postgres database or your own. FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation? Lastly, it would be helpful to know if and if there are multiple cores / web applications. I encourage you to make a case and we can help refine the issue. That all being said, based on the error, it sounds like we might need to increase the max_connections on the Postgres database.

 

@richardatsafe​ 

 

Could you elaborate on "FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation?"

 

We recently upgraded to FME Server 2022.1.1 and are experiencing jobs failing with no error message in the job logs. All of our jobs more or less run under Automations and most run in a short amount of time. The jobs that fail will run independently on server without issue.

Badge +10

@richardatsafe​ 

 

Could you elaborate on "FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation?"

 

We recently upgraded to FME Server 2022.1.1 and are experiencing jobs failing with no error message in the job logs. All of our jobs more or less run under Automations and most run in a short amount of time. The jobs that fail will run independently on server without issue.

Hi @vn1​ ,

 

I don't think I am able to generalize this to specific usage at this point. This issue is very uncommon and typically restricted to larger 10 engine + installations, but there may be a usage pattern in there as well. If you suspect you are seeing a similar issue please create a case with the log files, particularly the database logs, and we can explore it further.

Badge

@richardatsafe​ 

 

Could you elaborate on "FME Server usage may also be useful, for example are you running many short jobs or if you are extensively using automation?"

 

We recently upgraded to FME Server 2022.1.1 and are experiencing jobs failing with no error message in the job logs. All of our jobs more or less run under Automations and most run in a short amount of time. The jobs that fail will run independently on server without issue.

Thank you for the response. If anyone stumbles on this, we were able to resolve our issues by implementing automatic retries, seen in the article below. It's still unknown why a couple of our automations would fail with no error messages in the job/system log files. They may still continue to fail, but will eventually succeed with automatic retries.

 

https://community.safe.com/s/article/Configuring-Guaranteed-Delivery-in-FME-Server-Automations-with-Automated-Retries

Reply