Question

Health check failures after 24hrs when running FME Server in AWS using Aurora PostgreSQL Serverless for the FME Server Database


I'm running FME Server in AWS using Aurora PostgreSQL Serverless for the FME Server Database and I check the health of the server by calling the fmerest/v3/healthcheck?textResponse=true&ready=true REST API periodically.

 

Everything runs ok for 24hrs but after that, the health checks start to fail and do not recover again. From the errors logged (below), I think it might be a connection pooling issue. I don't get the same issues on an Express install and there's a line on https://aws.amazon.com/blogs/database/best-practices-for-working-with-amazon-aurora-serverless/ that says "Aurora Serverless closes connections that are older than 24 hours. Make sure that your connection pool refreshes connections frequently.", so I think this might be the cause.

 

I've copied some errors from the logs (can provide full logs if required), am I right in thinking this is caused by Aurora Serverless killing the connections or is this something else? Is there a setting anywhere that will force FME to refresh the connections before 24hrs or do I need to look at a different database option?

 

fmescheduler log:

Tue-28-Jun-2022 08:04:38.933 AM   ERROR    fmehealthnodeclient   SQLException: An I/O error occurred while sending to the backend.
Tue-28-Jun-2022 08:04:38.936 AM   ERROR    fmehealthnodeclient   org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
COM.safe.fmeserver.api.FMEServerException: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
at COM.safe.fmeserver.database.ManagerBase.execute(ManagerBase.java:146)
at COM.safe.fmeserver.database.healthNode.HealthNodeOps.healthNodeKeepAlive(HealthNodeOps.java:40)
at COM.safe.fmeserver.database.healthNode.HealthNodeClient.run(HealthNodeClient.java:103)
at java.lang.Thread.run(Thread.java:748)

fmeserver log

Tue-28-Jun-2022 08:05:39.803 AM   ERROR    fmeenginemgrnodeclient   402902 : Failed to connect to Job Queue. Please ensure Job Queue is started.
Tue-28-Jun-2022 08:05:39.804 AM   ERROR    fmeenginemgrnodeclient   Could not get a resource from the poolredis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
at redis.clients.util.Pool.getResource(Pool.java:53)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
at COM.safe.fmeserver.JobRouterConfig.checkActiveQueueNodeAlive(JobRouterConfig.java:205)
at COM.safe.fmeserver.FMEServerJobRouter.checkActiveQueueNodeAlive(FMEServerJobRouter.java:146)
at COM.safe.fmeserver.jobs.EngineManagerNodeOps.checkActiveQueueNodeAlive(EngineManagerNodeOps.java:107)
at COM.safe.fmeserver.jobs.EngineManagerNodeClient.executeLeaderOp(EngineManagerNodeClient.java:98)
at COM.safe.fmeserver.database.NodeClient.run(NodeClient.java:123)
at java.lang.Thread.run(Thread.java:748)

I'm running FME Server v2022.0.1.1


4 replies

Badge

Hi @allanb​, I agree according to Amazon's documentation it does appear that the default 24hr timeout is causing FME Server to lose connection to the database. Based on what I've found, this behaviour is controlled by the idle_in_transaction_session_timeout parameter. A default installation of Postgres sets this value to 0 which disables the timeout, but Amazon has set this value to 24 hours (in milliseconds). It seems possible to view and edit this parameter to change it to 0. We don't test FME Server with Aurora Postgres, so I'm not sure if this will cause any other unexpected results on the Aurora Postgres or FME Server side.

Hi @allanb​, I agree according to Amazon's documentation it does appear that the default 24hr timeout is causing FME Server to lose connection to the database. Based on what I've found, this behaviour is controlled by the idle_in_transaction_session_timeout parameter. A default installation of Postgres sets this value to 0 which disables the timeout, but Amazon has set this value to 24 hours (in milliseconds). It seems possible to view and edit this parameter to change it to 0. We don't test FME Server with Aurora Postgres, so I'm not sure if this will cause any other unexpected results on the Aurora Postgres or FME Server side.

I'll try it out, it will obviously take a while to adequately test the change.

 

It's worth pointing out that https://aws.amazon.com/blogs/database/best-practices-for-working-with-amazon-aurora-serverless/ mentions this parameter as something that can block scaling, so it's possible I might run into some different issues instead, I'll find out.

I'll try it out, it will obviously take a while to adequately test the change.

 

It's worth pointing out that https://aws.amazon.com/blogs/database/best-practices-for-working-with-amazon-aurora-serverless/ mentions this parameter as something that can block scaling, so it's possible I might run into some different issues instead, I'll find out.

Updating the idle_in_transaction_session_timeout parameter makes no difference unfortunately. I set it via a DB cluster parameter group and checked the setting had been applied by running "Show all" against the DB, it was set to 0. I then created another DB without setting idle_in_transaction_session_timeout, ran "show all" and this also returned 0, leading me to think that this setting is not what's killing the connections.

 

I could try a non serverless database, but connections going away for 10-20 secs is stock behaviour of RDS, its used to refresh hardware, fail-over, scale up +down, so it's likely I'd still have the same issue but maybe not on such a consistent basis. Wouldn't I have exactly the same issue if any DB Server needed to failover, regardless of whether it was an RDS DB or not? Is this not something FME should be able to recover from without needing to restart services or anything like that? After the DB connection is lost, the job router goes from active to offline and the health checks fail and never recover.

 

Do you have any other suggestions I can try?

 

Thanks

Badge

I'll try it out, it will obviously take a while to adequately test the change.

 

It's worth pointing out that https://aws.amazon.com/blogs/database/best-practices-for-working-with-amazon-aurora-serverless/ mentions this parameter as something that can block scaling, so it's possible I might run into some different issues instead, I'll find out.

Hi @allanb​, looking back at the idle_in_transaction_session_timeout I realize this is related to transactions and not connections. It does seem that this issue requires extra configurations on the Aurora side to prevent the auto-closed connections. I'd suggest seeing if the AWS documentation includes any information on this and possibly contacting AWS support to see if it's possible to prevent old connections from closing. If there isn't an existing control within Aurora that allows this functionality, we can investigate to see if there's something FME Server can do to maintain the connections.

Reply