Starting from around Wed, 02 Nov 2022 13:00 GMT we’ve been seeing users receiving 502 and 504 errors on the following endpoints:
/rest-read/rpc/me
/rest/rpc/last_seen_at
These errors aren’t showing up consistently, but if a user is seeing them they seem to be pinned to his session. Also the errors have been getting worse over the last hours, rendering out platform more and more unusable.
On our Backend (the actual endpoints these redirects/rewrites are leading to) I’m only seeing a few 50x errors from earlier in the day, before 06:00 GMT on 02 Nov 2022. My suspicion therefore is there is some caching issue going on here. But since the redirect/rewrite part is fully opaque to me, I can’t investigate further into this myself.
I collected these requests from the response headers of the failed 50x requests to Netlify:
x-nf-request-id: 01GGWK9KW5W53E92Q75631N6T6
×-nf-request-id: 01GGWK9KW5AKDW52AYE8R86BGA
×-nf-request-id: 01GGWN6YCRMQ20JZJPW97ZZ76V
×-nf-request-id: 01GGWN6YCRHSPJE3C2ZMAR48F4
×-nf-request-id: 01GGWPK1Z2YMQFE2FHBP6NARZ0
×-nf-request-id: 01GGWPK1Z28ZPW9W9CFX8YYKCP
Towards the end of the day the number of errors have been increasing and the issue is still going on!
Hi @Wats0n and sorry to hear about the trouble! From our side, those connections look similar. For 3 of them, we returned a 502 timeout status, since this happened:
visitor connects to your netlify site
our CDN node immediately looks up how to handle the route and finds that you have a proxy redirect configured
it sends request to your server…
…which fails to answer within 40 seconds, so we stop waiting and return a 502 timeout.
To fix this, ensure that your server begins to send content - at least http response headers - within the first 30 seconds of a request. Presumably you can examine the logs on your server to understand what is causing the requests to be so slow there, since it is unlikely this is the customer experience you intend your visitors to get.
The other 3 requests were all POST requests, which returned 504’s. I don’t see these as reaching your server at all, which is unexpected, and I am not sure why that might have happened. I can see that out of all requests to https://app.talentspace.io/rest/rpc/last_seen_at in the past week, there were over 115k successful accesses, and only a handful of failures like this before today. There were quite a few today (hundreds) but I am not sure why that was; they do seem to have wrapped up around 4 hours ago.
Can you let me know which ×-nf-request-id (and timestamp) you’re seeing the request timeout because our server is not responding?
I had a quick look and it looks like none of them are reaching our server (ALB or WAF).
And will you be investigating further why the 504’s failed on your end without reaching our server?
Hi @amelia thank you for continuing the investigation.
The problem is unfortunately still not resolved. I’ve setup special monitoring for it and still see it happening today.
My original thought was that Netlify has a bad cache (504 error) and sporadically returns that. So I changed the redirect from /rest/rpc/last_seen_at to /v1/rest/rpc/last_seen_at, but this did not help:
As you can see the proxied redirects fail with 504 error, but the direct request succeeds without error. And as mentioned before there are no 50x errors visible on on my endpoint.