Netlify Function invocations fail with 502 error ~0.1% of the time

I’ve observed that ~0.1% of my app’s Netlify function invocations fail with a 502, and the response info suggests it’s coming from Netlify and not any intermediate servers in between. Details below. This is a higher fraction than I’d expect from a production service. Obviously, I handle this gracefully with retries, but I’d prefer to see this rate much lower and would like to understand what’s causing it.

Details
In general, the 502 response always has details like this:

{
    "status": 502,
    "body": "",
    "headers": [
      { "value": "Netlify", "name": "Server" },
      { "value": "01FWYDPYSG7948J52F2P6MCB3S", "name": "X-Nf-Request-Id" },
      { "value": "Sun, 27 Feb 2022 20:24:51 GMT", "name": "Date" },
      { "value": "0", "name": "Content-Length" }
    ]
  }

Here are some example X-Nf-Request-Id values for 502s from the last few days:

  • 01FWYDPYSG7948J52F2P6MCB3S
  • 01FWS996KVNXBE2THMTXQT85DS
  • 01FWRNX159J8N33X3TJ60YWZ2H
  • 01FWS2N7MSF07SN9X6MYPACA47
  • 01FWQHNXRBNGM7HG5RNWX05GJD

My investigation so far has found the following:

  • It’s not specific to a particular function. I’ve observed occasional 502s in several functions
  • It’s not specific to a particular Netlify site. I have two Netlify sites, and in both sites all functions experience occasional 502s. (The examples above are all from one site, for simplicity.)
  • It’s not due to my own code. I can see in my logs that for these occasional 502s, the Netlify function isn’t being invoked at all
  • It doesn’t appear to be due to any intermediate server between the function caller and Netlify. I see the occasional 502s for calls originating from different sources, i.e., both from my own backend within my cloud provider and also from an external third-party service
  • I’ve been unable to reproduce it myself. I created a new test function on the same site as the examples above and invoked it ~25k times via a local script, with a concurrency of 5 requests in flight at a time, and none of those requests had a 502. Given the rate I’m observing of 0.1%, I would have expected to see some failures in this test. That suggests to me the issue isn’t a simple constant probability of failure but is tied to some other factor (like overall Netlify load).

hi there, thanks for your report on this.

we are taking a look at what you are describing - we don’t have anything specific to share with you yet, but we are actively taking a look.

is the problem still occurring?

Hi Perry,

Yes, it’s still occurring. In fact, it’s been going on for at least a few weeks, but I only now decided to investigate it more closely and raise it here.

Here are some additional examples from yesterday (out of several more that occurred yesterday):

  • 01FX3NDP85ZBV1TTKP6VTN0CZS
  • 01FX3DHEAX1674WAFXCHA4J3FS
  • 01FX2S8JP3S467FF0YFZKJTGTV

Hi @AndrewK, thank you for your patience while we investigated!

Our logs show read: connection reset by peer errors when looking up the request ids you’ve shared. We suspect that something on your end may not be respecting DNS TTL and the 502s are us occasionally restarting ATS for memory pressure.

Would you be able to take a look on your end to see if that is the case and keep us posted?

Hi Audrey, thanks for investigating and for the reply.

I’ll see what I can find on my end. However, as I mentioned, my Netlify function invocations are coming from two different sources, some from my own service, and some directly from an external service. Since both sources experience 502s, it would surprise me that both have the same issue. Nevertheless, I’ll take a look.

I haven’t found anything on my side yet.

To make sure I understand your hypothesis, are you saying the following?

  • Netlify is running ATS (Apache Trafficserver)
  • From time to time, Netlify restarts ATS. I assume you do this in a blue-green deployment, i.e., first start up a new ATS instance and transition traffic to it by repointing DNS records, then stop the old one
  • You suspect my service is not respecting DNS TTL, and as a result, it is sometimes hitting the old ATS instance (either while or after it is stopped)?

Hi Andrew,

Netlify is running ATS(Apache Traffic Server).

Without going too deep into deployment methodology, we drain connections to existing servers and take them out of rotation before doing any planned operations that would affect visitors. Since our DNS TTL is 20 seconds, if you reuse a connection or the results of a DNS lookup for much longer than that, it can lead to downtime that only affects you, since “real visitors” will be routed appropriately based on their repeated DNS lookups.

I hope this helps, and please keep us posted on what you find.

1 Like

Hi Audrey,

I’m now able to reproduce the issue reliably. For your edification, I initially wasn’t able to reproduce the issue because I was invoking a deployed function directly from my local machine. After your message, I tried invoking the function from my local machine using my production service as a proxy, and that does result in occasional 502s (at about the 0.1% rate I observe). So, this is consistent with your hypothesis that somewhere in my production service’s networking stack, there’s a stale connection or DNS lookup.

Unfortunately, I don’t have visibility into the resolved IP address that my production service uses for each function invocation. I do plan to investigate that, but it will be difficult, and before I do, I’d like to further validate the hypothesis. Here’s an idea how:

If I provided you with, say, 200 recent Netlify request IDs for function invocations made over a short interval, one or two of which failed with a 502, do you have visibility on your side to be able to populate the IP that each of those requests hit? We could then see if the one(s) that failed with a 502 were indeed hitting a different IP than the ones that succeeded. (If possible, you could also indicate whether the IP for the requests that failed with a 502 were indeed known to be stale.)

Hi Andrew,

It’s a pretty manual process to trace the CDN back to the IP per function invocation. Would you be able to send the requests using a custom UA to help us filter the requests? Also, would it be possible to reduce the number of request ID lookups (200 requests seems like a lot :slight_smile: )?

1 Like

Hi Audrey,

Thanks for being open to my suggestion. Understood that since this involves manual work on your side, mapping 200 request IDs to IPs would be a lot. I’ve included an example below with just 30, which I hope is more manageable.

I’m not able to set a custom user-agent string, but as you’ll see below, I’m using Hasura, and the user-agent in all of the requests is hasura-graphql-engine/v1.3.3. That’s not unique to me, but I hope it’s uncommon enough on your side that it helps with the filtering. I can also set an arbitrary http header if that would be helpful.

To generate this example, I made 200 requests with a concurrency of 10. All of these requests went through the same Hasura server as a proxy to invoke the Netlify function. I ran batches of 200 a few times before ultimately observing a 502 error (consistent with the ~0.1% overall failure rate I’ve been seeing). In the run that generated a 502, it occurred about a third of the way through. From those 200 requests, I’ve included below just the one request that failed along with the successful requests sent shortly before and after, for a total of 30. Hopefully that’s a manageable number.

As you’ll see, the failed request doesn’t obviously stand out from the successful ones aside from having a longer response time. Several requests were issued just a few milliseconds before and after the failed one, and they succeeded. That doesn’t seem consistent with the stale IP hypothesis, but let’s see whether the IP address for the failed request is different from the surrounding successful ones.

Note: the timestamps below won’t exactly match what you see, since they’re from my own machine’s clock and include additional network latency.

CSV:

requestAt,responseAt,httpStatusCode,userAgent,netlifyRequestId
2022-03-23T01:31:44.808Z,2022-03-23T01:31:45.088Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDVS12S043A457QM54JH
2022-03-23T01:31:44.809Z,2022-03-23T01:31:45.082Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDVR0NK1WFG44M9H7EVX
2022-03-23T01:31:44.817Z,2022-03-23T01:31:45.113Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDWQWRWF98FY9QD875KF
2022-03-23T01:31:44.819Z,2022-03-23T01:31:45.114Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDWPWT9VPH5E7ZHMBPZN
2022-03-23T01:31:44.820Z,2022-03-23T01:31:45.115Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDWQV36DZM8MBWJVRV56
2022-03-23T01:31:44.823Z,2022-03-23T01:31:45.117Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDWP0THBSKDJV6S01SK7
2022-03-23T01:31:44.825Z,2022-03-23T01:31:45.111Z,200,hasura-graphql-engine/v1.3.3,01FYT6DDWBQFHDDCJMG0ETC93M
2022-03-23T01:31:45.060Z,2022-03-23T01:31:45.336Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE42HWWQJHRTP70X445Z
2022-03-23T01:31:45.076Z,2022-03-23T01:31:45.387Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5B0QG2C652PV6079DR
2022-03-23T01:31:45.082Z,2022-03-23T01:31:45.367Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE59XGR34VT8ECVC99YF
2022-03-23T01:31:45.088Z,2022-03-23T01:31:45.382Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5AWPAWA8FPNRQA2Y0X
2022-03-23T01:31:45.102Z,2022-03-23T01:31:45.397Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5X18MAC8XBJTAS38P1
2022-03-23T01:31:45.111Z,2022-03-23T01:31:45.372Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5AK26ZRPNDY1RWAE4K
2022-03-23T01:31:45.113Z,2022-03-23T01:31:45.392Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5K81D441M3S9MZ5SP3
2022-03-23T01:31:45.115Z,2022-03-23T01:31:45.887Z,502,hasura-graphql-engine/v1.3.3,01FYT6DE62EZ1RK1T21X7D093Z
2022-03-23T01:31:45.116Z,2022-03-23T01:31:45.390Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5KW2HG7Q5D7N4TNRAY
2022-03-23T01:31:45.117Z,2022-03-23T01:31:45.385Z,200,hasura-graphql-engine/v1.3.3,01FYT6DE5XHPN0DGRVBA9RNBXG
2022-03-23T01:31:45.338Z,2022-03-23T01:31:45.614Z,200,hasura-graphql-engine/v1.3.3,01FYT6DECN3XCPAS72WAB7YXMG
2022-03-23T01:31:45.368Z,2022-03-23T01:31:45.650Z,200,hasura-graphql-engine/v1.3.3,01FYT6DED9WR1TYDQ04M7CG91E
2022-03-23T01:31:45.373Z,2022-03-23T01:31:45.657Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDAWNQZ29QEW2X16EX0
2022-03-23T01:31:45.382Z,2022-03-23T01:31:45.662Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDMVMF6CA3RYEEY001H
2022-03-23T01:31:45.385Z,2022-03-23T01:31:45.671Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDY9VEESTN1REBVE0V0
2022-03-23T01:31:45.387Z,2022-03-23T01:31:45.675Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDZ3JC0WJEQH49CKRQP
2022-03-23T01:31:45.390Z,2022-03-23T01:31:45.678Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDWVEXSF68XQG2J3ZTN
2022-03-23T01:31:45.392Z,2022-03-23T01:31:45.673Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEDZMXXT64TPK9PTKZSM
2022-03-23T01:31:45.397Z,2022-03-23T01:31:45.698Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEF5KAG87DFC7TPR1RES
2022-03-23T01:31:45.615Z,2022-03-23T01:31:45.890Z,200,hasura-graphql-engine/v1.3.3,01FYT6DENMCYZ363AZJQB5MVXW
2022-03-23T01:31:45.651Z,2022-03-23T01:31:45.925Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEPH64WCKH5B8XNR21KV
2022-03-23T01:31:45.658Z,2022-03-23T01:31:45.918Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEPH7K2GF8K08EDX2Q50
2022-03-23T01:31:45.662Z,2022-03-23T01:31:45.922Z,200,hasura-graphql-engine/v1.3.3,01FYT6DEPHNP71SMAFJ56764VZ

Hi Audrey,

Just wanted to check in whether the example I provided is helpful. I’m also happy to generate a fresh one if the above has already fallen out of your log retention.

Thanks.

Hi Andrew,

Very sorry for the late reply! Thank you for sharing your data with us and for making it more manageable :slight_smile:

As you’ve noted, I had trouble matching the timestamps but I was able to search within the proximity and pulled a sample of 10 requests (9 successful and 1 failed). The 502 occurred 2 minutes after a successful request and milliseconds before the next successful request with the same IP.

Timestamps are in EST:

@timestamp=1648001677.364 status=200 address=3.130.180.148 
@timestamp=1648001677.669 status=200 address=3.130.180.148
@timestamp=1648001737.710 status=200 address=3.139.139.191
@timestamp=1648001738.001 status=200 address=3.139.139.191
@timestamp=1648001798.089 status=200 address=3.130.180.148
@timestamp=1648001799.160  status=200 address=3.130.180.148
@timestamp=1648001858.371 status=502 address=3.139.139.191
@timestamp=1648001858.565 status=200 address=3.139.139.191
@timestamp=1648001918.943 status=200 address=3.139.139.191
@timestamp=1648001919.398 status=200 address=3.139.139.191

The hypothesis could still hold true here.

1 Like

Thanks Audrey.

In my example I included Netlify request IDs in the last column. I see they may be scrolled out of view to the right. I’m sorry I didn’t make clear that I included them.

Is it possible to get the IP address hit by those specific request IDs? That would be the best way to provide evidence for the stale IP hypothesis, because those requests were all issued over a very short interval.

Or is it only possible to identify the traffic by searching for approximately matching timestamps? If the latter, then I agree this path of investigation will be very difficult, since I have plenty of regular production traffic on top of these small tests that I’m doing.

Hey @AndrewK,

I’ve listed the IP Addresses as per your list above (first 10 as the query is pretty time consuming to do individually):

52.73.153.209
52.203.36.44
52.203.36.44
52.73.153.209
52.73.153.209
52.73.153.209
52.203.36.44
52.73.153.209
52.73.153.209
52.203.36.44

I also tried a more generalised query based on your account ID and user agent and I could see these 2 IP Addresses being used the most times.

Hi Hrishikesh,

Thanks for your reply.

It’s helpful to have these first 10, but to properly investigate Audrey’s hypothesis, we need to specifically look at the request that resulted in the 502 (01FYT6DE62EZ1RK1T21X7D093Z, which is in the middle of the list I provided) and the surrounding successful requests.

I realize it’s time consuming to query these individually. Nevertheless, would it be possible to find the IP hit by the request that 502’d (01FYT6DE62EZ1RK1T21X7D093Z) as well as a few of the surrounding successful requests? It’s especially important to see the IP address of the successful requests after the one that 502’d, because if any of those successful requests hit the same IP as the one that 502’d, that would be evidence against the hypothesis we’re investigating.

Hey there, @AndrewK :wave:

Thanks for your patience here.

We already have an issue open regarding 502s reported by users. There are multiple different error messages in that issue, and we have added your situation to it as well. The developers have been working on fixing those and they had made some progress, however the situation was not completely resolved.

We will let you know or post an update once it’s completely resolved. Again, thanks for your patience here!

Hi Hillary, thanks for the update.

I’ll be on the lookout for that update/post when the issue is resolved, and then I’ll test again to see if I can reproduce the issue or not.

1 Like

Sounds good- thanks so much!

Hey @AndrewK, apologies for the delay! We’ve heard from engineering and they have made a change that should help – can you let us know if you’re still seeing this behavior as of now?

1 Like

Hi @amelia, thanks for the update.

I haven’t seen any change in the rate of 502s over the last two weeks through yesterday.

However, today I haven’t had any new 502s (whereas on most days I would have by now). Maybe that’s because the change you alluded to was released today?

Also, I can no longer reproduce the 502s using the test script I was using to generate examples for my earlier posts. That’s encouraging.

Tentatively, it does looks like the issue is resolved, but I’d like to see whether any additional cases crop up over the next few days.