Solved

ArcGIS Online Reader/Writer Error: Code 503

  • 28 December 2018
  • 34 replies
  • 104 views

Badge

Hello,

So we have some workbenches (FME 2018.1.1; 20181203 - Build 18578 - WIN64) that either read or write to AGOL after performing a bunch of operations on the data. These workbenches were working fine up until a week ago, when all of a sudden they started throwing a very generic 503 error message and failing completely.

If I understand correctly, a 503 means it's a server side error (so it's likely something with ESRI), right?

We've been having issues with our organization's ArcGIS Online portal for the last couple of days (layers being randomly dropped from maps, apps refusing to load) and I'm starting to think that their December 2018 AGOL update messed some things up, including the way FME's readers/writers communicate with it. Is anyone else having trouble working with AGOL lately?

Thanks!

 

icon

Best answer by runneals 3 March 2020, 19:04

View original

34 replies

Badge +2

Hi @rinfante,

Sorry to hear that you are getting this error. 503 is a generic server-side error (i.e. 503 Service Unavailable Error). This could also potentially be a security change - perhaps SAML auth was added?

Are you able to reach your portal through a web browser or do you only get this error when attempting to read/write data with FME?

Badge

Hi @rinfante,

Sorry to hear that you are getting this error. 503 is a generic server-side error (i.e. 503 Service Unavailable Error). This could also potentially be a security change - perhaps SAML auth was added?

Are you able to reach your portal through a web browser or do you only get this error when attempting to read/write data with FME?

Hi Chris,

 

Though we were able to reach the portal through the browser, it kept dropping layers randomly from the maps and the apps. This is why I was wondering if some instability in ArcGIS Online was somehow affecting the way FME communicated with it.

ESRI ended up identifying a bug that they're still looking into.

Anyway, we haven't had too much trouble with it the last couple of days (we might get a 503 error the first time we try a workflow, but it'll work on a subsequent attempt right afterwards). It was just really bad when I posted this question, it must have failed like ten times before I decided to create a non-AGOL workflow.

Thanks for your reply!

 

 

Hello rinfante, we've been having the same issues in our AGOL organization, unfortunately it hasn't gotten better.

Could you please tell me if your services come from https://services3.arcgis.com/? We have other organizations and only the one at that server is getting 503 errors.

 

Also could you please tell me what country are you/your organization from?

Badge +1

Hi guys,

I've been seeing similar issues with a customer over the last few days (they have been seeing it over January). We have been getting the same 503 errors when trying to do reads or inserts/updates and deletes into layers via the AGOL FME Reader/Writer. This has happened with both FME 2016 and FME 2018.1 (build 18578). It's also intermittent, a Workspace will fail, but restart it straight away and it will run fine.

@ppgabmail thats an interesting comment about services3 as whilst I need to double check I'm pretty sure that their services are coming from that source.

Two things that "seemed" to help were reducing the number of features read/written so that the transactions were smaller, and also using the FeatureHolder transformer to force FME to completely read all the data on run, and then have all the data available available to write etc. There's a performance hit but reducing the number of simultaneous actions against the ESRI connection "seemed" to give better stability.

It may have been random chance however so I wouldn't swear by it, and at best it's a bit of a band-aid rather than a solution sorry.

I've not had any issues with FME reading/writing to AGOL with other users in the last week so it doesn't appear to be an issue with the FME reader/writer and it does sound like there is an issue with the services3.arcgis.com end point.

Hi guys,

I've been seeing similar issues with a customer over the last few days (they have been seeing it over January). We have been getting the same 503 errors when trying to do reads or inserts/updates and deletes into layers via the AGOL FME Reader/Writer. This has happened with both FME 2016 and FME 2018.1 (build 18578). It's also intermittent, a Workspace will fail, but restart it straight away and it will run fine.

@ppgabmail thats an interesting comment about services3 as whilst I need to double check I'm pretty sure that their services are coming from that source.

Two things that "seemed" to help were reducing the number of features read/written so that the transactions were smaller, and also using the FeatureHolder transformer to force FME to completely read all the data on run, and then have all the data available available to write etc. There's a performance hit but reducing the number of simultaneous actions against the ESRI connection "seemed" to give better stability.

It may have been random chance however so I wouldn't swear by it, and at best it's a bit of a band-aid rather than a solution sorry.

I've not had any issues with FME reading/writing to AGOL with other users in the last week so it doesn't appear to be an issue with the FME reader/writer and it does sound like there is an issue with the services3.arcgis.com end point.

We are testing and trying to workaround this error for almost a month now, managed to narrow it down but no solution I'm afraid.

 

The 503 error comes from AWS CloudFront, which basically means an overload on the server.

 

We did not get it from FME, but rather from the JS API, but the error is directly at the API REST point, you can get the error just by navigating to the API through your browser.

 

And yes it has a higher chance of happening when you do several queries in a row, in our case it happens almost 100% of the time since we're gathering the attachments of a layer in batch, and you must do a separate attachment query for each feature.

 

I do believe it's an overload on the services3 machine, it's an old server that most sample layers from ESRI are hosted in, if you create a new AGOL account right now it will be at services9 (as of today), which will give you no errors.

 

This error has completely halted our development cicle and application deployment, still waiting for an official reply by ESRI

Badge

Hey guys,

I'm sorry to hear about all of this. It's actually puzzling how hard it is to find any resources on this specific issue. For a while back then when I made wrote my initial post it felt like we were the only ones having this problem.

It looks like our data runs on the services7 machine. Oh, and we are based in Calgary, Alberta, Canada by the way.

We haven't had any issues since the end of December, so whatever overload was going on there has just moved to Services3.

Would you mind sharing whatever ESRI comes up with just for information's sake? It is so frustrating to have this pop-up and not be able to do anything through no fault on your own.

Good luck.

 

Badge +14

We are testing and trying to workaround this error for almost a month now, managed to narrow it down but no solution I'm afraid.

 

The 503 error comes from AWS CloudFront, which basically means an overload on the server.

 

We did not get it from FME, but rather from the JS API, but the error is directly at the API REST point, you can get the error just by navigating to the API through your browser.

 

And yes it has a higher chance of happening when you do several queries in a row, in our case it happens almost 100% of the time since we're gathering the attachments of a layer in batch, and you must do a separate attachment query for each feature.

 

I do believe it's an overload on the services3 machine, it's an old server that most sample layers from ESRI are hosted in, if you create a new AGOL account right now it will be at services9 (as of today), which will give you no errors.

 

This error has completely halted our development cicle and application deployment, still waiting for an official reply by ESRI

@ppgabmail Do you have an esri case #/bug #? We have been hitting this constantly and we have M2, which should increase capacity quite a bit.

 

I will say after reading what you posted, it seems like almost any of the 4xx or 5xx are generally coming from cloudfront (the edge) or the machines they connect to inside.

Badge +14

Just opened a ticket with esri and working through this with them. Will keep you informed what we figure out.

I'm having slightly different errors when trying to write to AGOL using FME (504, Memory Pressure) but it sounds like my problem is similar to the ones described in this thread (see my other question for details). I only started working with FME and AGOL in December so I can't say if my workbench would have worked before.

In my case, the workbench usually works fine if I run it while in FME and hit the "Run Translation" button. If I try to use Windows Task Scheduler: sometimes the workbench successfully finishes but other times it fails and returns various errors.

Our map services are stored on the services3 AGOL instance. I've tried some suggestions from FME Support and others without success (reduce features written, FeatureHolder, reduced number of transformers, reauthenticate, etc).

Badge +9

I'm having slightly different errors when trying to write to AGOL using FME (504, Memory Pressure) but it sounds like my problem is similar to the ones described in this thread (see my other question for details). I only started working with FME and AGOL in December so I can't say if my workbench would have worked before.

In my case, the workbench usually works fine if I run it while in FME and hit the "Run Translation" button. If I try to use Windows Task Scheduler: sometimes the workbench successfully finishes but other times it fails and returns various errors.

Our map services are stored on the services3 AGOL instance. I've tried some suggestions from FME Support and others without success (reduce features written, FeatureHolder, reduced number of transformers, reauthenticate, etc).

Hi @bfausel, Do you have a support case open with us for this now? If you don't, can you please create one using this form: https://www.safe.com/support/report-a-problem/ We can help you dig into this some more and see if we can narrow down the source of the problem you're seeing.

Badge +1

Just opened a ticket with esri and working through this with them. Will keep you informed what we figure out.

Hi, we did the same and apparently it was due to "confluence of several factors which each independently push the hive slightly closer to these 503 errors" (they've not disclosed what these are). ESRI have been working to fix these factors and we have found Hive 3 to be much more stable since. So hopefully its related and you'll start to see the same stability!

Hi @bfausel, Do you have a support case open with us for this now? If you don't, can you please create one using this form: https://www.safe.com/support/report-a-problem/ We can help you dig into this some more and see if we can narrow down the source of the problem you're seeing.

Hi Laura, yes I do have a support case open and they have suggested a number of things to try. I have been working through the suggestions as time permits. There was a lot to digest though. I will post an update if one of the solutions work.

Badge

Hi, we did the same and apparently it was due to "confluence of several factors which each independently push the hive slightly closer to these 503 errors" (they've not disclosed what these are). ESRI have been working to fix these factors and we have found Hive 3 to be much more stable since. So hopefully its related and you'll start to see the same stability!

We started getting a 404 error that kept crashing our AGOL processes a few weeks ago. Sometimes they'd write fine while other times they'd just throw the error, claiming they couldn't find the URL they were trying to access.

We opened a ticket with ESRI and they gave us the usual, non-committal replies ('dashboard shows the systems are running fine', 'have you contacted SAFE instead?', etc)...but then they supposedly 'fixed a bug' (they haven't disclosed what the bug was, but I suspect it's related to the one you mentioned) and the problem went away (or seems like it has so far).

So yeah, if nothing else all these tickets we've opened with them seem to have helped.

Badge +14

Hi, we did the same and apparently it was due to "confluence of several factors which each independently push the hive slightly closer to these 503 errors" (they've not disclosed what these are). ESRI have been working to fix these factors and we have found Hive 3 to be much more stable since. So hopefully its related and you'll start to see the same stability!

@gavinpark3 @rinfante Still working through these issues with esri support. We were able to reproduce 502 errors, but weren't ever able to reproduce the 503 errors as of yet. The weird thing is that we stopped seeing the bulk of the 503 errors on 2/20 (although we only saw 2 on the 26th). They did have me create an index on the service for a field that I query with my FME job, although that didn't work.

I have a sneaking suspicion that what ever "it" is, is related to the data store and/or indexes. We get another error (either a ConnectionResetError or ConnectionAbortedError) quite a bit which we found is resolved when we rebuild the spatial index.

Also a tip for working with esri support... I usually go about it from referencing the "ArcGIS Data Interoperability extension" instead of "FME", as they can open up your workspaces (although they don't have certain transformers like the attribute manager) and they can reproduce issues ;)

Badge +14

Hi Laura, yes I do have a support case open and they have suggested a number of things to try. I have been working through the suggestions as time permits. There was a lot to digest though. I will post an update if one of the solutions work.

5xx errors are server based. 4xx are client based. So any 5xx errors are from ArcGIS Online. I have gotten some weird funky errors which seem to be an issue between the REST endpoint (where you write data to) and how it gets to the database, which was a sql specific error code.

Badge +14

Update: STILL working on the case with esri 2+ months later... Am going to have another follow-up call with a developer again in the next week, so may have more updates.

  • In talking with one of the developers for ArcGIS Online, they did suggest trying to decrease the # of records being written (which didn't appear to help for us).
  • The developer also suggested using the append function (Python & REST API) instead of inserting each record directly to the REST service to reduce load (still working on this). It appears viable for jobs that don't produce real-time or near-real time data sets, as it does require ArcGIS Pro 2.3+ for how it's written now. The other issue we ran into with the current form is that if the script times out or doesn't finish all the way is that the item that is being appended to the service may not be deleted after it is appended, which will prevent future jobs from running successfully and the data from updating (there's ways around this, but I haven't had the time to deep dive this and clean up the script). NOTE that this method is highly suggested for jobs that have a bunch of records being written to ArcGIS Online and are not near-realtime. See the thread here for more information on implementation and go vote for implementation into the writer here.

If you are still experiencing this issue, give me a shout so I can include your information on my case with esri and let them know others (besides myself) are also encountering this issue. (david.runneals@iowadot.us)

Badge +14

Update: STILL working on the case with esri 2+ months later... Am going to have another follow-up call with a developer again in the next week, so may have more updates.

  • In talking with one of the developers for ArcGIS Online, they did suggest trying to decrease the # of records being written (which didn't appear to help for us).
  • The developer also suggested using the append function (Python & REST API) instead of inserting each record directly to the REST service to reduce load (still working on this). It appears viable for jobs that don't produce real-time or near-real time data sets, as it does require ArcGIS Pro 2.3+ for how it's written now. The other issue we ran into with the current form is that if the script times out or doesn't finish all the way is that the item that is being appended to the service may not be deleted after it is appended, which will prevent future jobs from running successfully and the data from updating (there's ways around this, but I haven't had the time to deep dive this and clean up the script). NOTE that this method is highly suggested for jobs that have a bunch of records being written to ArcGIS Online and are not near-realtime. See the thread here for more information on implementation and go vote for implementation into the writer here.

If you are still experiencing this issue, give me a shout so I can include your information on my case with esri and let them know others (besides myself) are also encountering this issue. (david.runneals@iowadot.us)

UPDATE 2019/05/30: Just chatted with esri, and we determined that our 5xx errors from ArcGIS Online when reading (query) are caused by Amazon Cloudfront (CDN caching mechanism). esri is going to be taking a look into this further. They also mentioned there is a better way of doing queries to the server level cache, which I'll share with the FME Product team once I get it. (For those of you interested in this more, here's a great blog post on the caching implementation in ArcGIS Online.)

In regards to writing (applyEdits), they are pushing the append functionality still. Go vote for the idea implementation here. I'm going to be working with them to clean it up a bit better and make it more reliable/foolproof.

Update: STILL working on the case with esri 2+ months later... Am going to have another follow-up call with a developer again in the next week, so may have more updates.

  • In talking with one of the developers for ArcGIS Online, they did suggest trying to decrease the # of records being written (which didn't appear to help for us).
  • The developer also suggested using the append function (Python & REST API) instead of inserting each record directly to the REST service to reduce load (still working on this). It appears viable for jobs that don't produce real-time or near-real time data sets, as it does require ArcGIS Pro 2.3+ for how it's written now. The other issue we ran into with the current form is that if the script times out or doesn't finish all the way is that the item that is being appended to the service may not be deleted after it is appended, which will prevent future jobs from running successfully and the data from updating (there's ways around this, but I haven't had the time to deep dive this and clean up the script). NOTE that this method is highly suggested for jobs that have a bunch of records being written to ArcGIS Online and are not near-realtime. See the thread here for more information on implementation and go vote for implementation into the writer here.

If you are still experiencing this issue, give me a shout so I can include your information on my case with esri and let them know others (besides myself) are also encountering this issue. (david.runneals@iowadot.us)

Hi David,

I too am experiencing the dreaded 503 error when trying to read my hosted feature layer on ArcGIS Online Organizational account. I run this script every minute from FME Server. Weird that for many hours in a day it runs fine, then it goes 30 minutes or longer not running and giving the 503 error. It seems to start failing at around 3pm (PST). Our services are hosted on services5 server and we are in BC, Canada.

Thanks for all the work you are doing trying to resolve this. I would like this to be resolved as the app that uses the service is mission critical (fire dispatch).

Regards,

Cameron

 

Badge +14

Update 2: One thing that esri suggested (although they cautioned against using it in production) is to try appending -nocdn onto your service URL (ie https://services-nocdn.arcgis.com) to bypass the CDN where it appears these 5xx errors are originating from. When used with a private service, you may have to create a new item with this URL and then save credentials with the item. If used with a public item, you should be able to use the ArcGIS Feature Service reader, you can just use https://services-nocdn.arcgis.com.

Badge +14

Hello! Have had a ticket open with ESRI since February. Chatted with the dev/product team at the UC and finally got a bug logged. Feel free to contact your support people and have them add you to it, so you can keep tabs on it. It is: BUG-000123780 : Intermittent 503 and 502 errors with 'query' and 'applyEdits' requests to ArcGIS Online hosted feature service

Also per esri's suggestion, Safe implemented a change (still undergoing QA so it's not quite out yet) in build 19610 (2019.1.1) and build 19725 (2019.2) that should help reduce the failure of FME jobs due to these errors.

Badge +14

UPDATE: Per our esri account manager, the AGOL dev team implemented a fix to this yesterday. I have only seen 1 or 2 error notifications since then so it appears like it is working. :) cc @carsonlam @rinfante @gavinpark @bfausel @lauraatsafe

UPDATE: Per our esri account manager, the AGOL dev team implemented a fix to this yesterday. I have only seen 1 or 2 error notifications since then so it appears like it is working. :) cc @carsonlam @rinfante @gavinpark @bfausel @lauraatsafe

Thanks for the continued updates @runneals. I'll be upgrading to the latest FME version when released and will see if the error notifications have been reduced in our environment.

Badge

UPDATE: Per our esri account manager, the AGOL dev team implemented a fix to this yesterday. I have only seen 1 or 2 error notifications since then so it appears like it is working. :) cc @carsonlam @rinfante @gavinpark @bfausel @lauraatsafe

Would you happen to know if the fix is implemented on FME Server 2019.2?

Badge +14

Would you happen to know if the fix is implemented on FME Server 2019.2?

@mariofederis There really wasn't a "fix" since the errors are all coming from ArcGIS Online. SAFE implemented suggested improvements from the AGOL team (retry 3 times and then fail). The AGOL dev team also suggested trying to reduce the number of records that are being written for each write request to no more than 100-200 ("Features Per Request") which should help improve things. I have seen a maximum of 5-10 50x errors occur per day, which is a major improvement, although still a problem for those jobs that run less than once a day.

Badge

Update 2: One thing that esri suggested (although they cautioned against using it in production) is to try appending -nocdn onto your service URL (ie https://services-nocdn.arcgis.com) to bypass the CDN where it appears these 5xx errors are originating from. When used with a private service, you may have to create a new item with this URL and then save credentials with the item. If used with a public item, you should be able to use the ArcGIS Feature Service reader, you can just use https://services-nocdn.arcgis.com.

@runneals can you explain the pitfalls of appending the nocdn string? Why would esri caution against it?

Reply