Unresponsive Netlify Function (never executed, connection timeouts)

I am facing an untraceable issue with Netlify function timeouts. The function is called as a webhook from Sanity.io, and is frequently timing out - not in the expected 10s time limit way, but being completely unresponsive and not logging or returning any errors, or seemingly being executed at all.

The site in question is arthaus-tickets, hosted on tickets.arthaus.mt. This is a SvelteKit application, with a number of API endpoints exposed (via SvelteKit itself). All of the endpoints are rolled into one render function, as is standard for any SvelteKit installation. The endpoint I’m mainly having trouble with is hosted under tickets.arthaus.mt/api/invoice, and is called as a webhook from Sanity.io. Multiple times over the past week or so (and the frequency seems to be increasing), this url is hit, but times out. This is not an issue with a 10s execution limit as far as I can tell - there is nothing logged to the console (the function logs as soon as it is called), there is no 10s timeout error, no error payload is returned, and the connection times out after 30s. Calling the API manually, from my local machine, via Postman, executes the same function perfectly well (logging and all).

A separate endpoint is also handling Stripe webhook events, and I’ve also noticed that these time out occasionally too. It seems that the Stripe retry mechanism may leave longer gaps between attempts, which could explain why the timed out webhook posts eventually succeed, while the Sanity.io ones are either very delayed or fail completely.

This is proving impossible to debug, as there is no information on any logs of any failed runs, and failures are sporadic (so it’s not a misconfigured URL or something like that).

I’m quite at a loss of where to look next.

I get an error saying cannot accept GET requests and I’m not sure how the POST or any other method should look with data. Could you let us know how to reproduce the behaviour?

That’s the issue, it’s super hard to reproduce. The endpoint only accepts POST requests, but when I make a POST from my machine it goes through every time I’ve tried. A fair amount of the webhooks requests also go though, but some seem to be inexplicably swallowed by the void. I lost a few this morning. I’ve actually just checked now the behaviour has repeated itself - a booking made at 16:44UTC is trying to post the webhook payload, but keeps timing out. I’ve attached the last two failed logs from Sanity’s side here:

As you can see, the request is timing out after 30s, but there’s no error returned. The function logs don’t indicate anything (but the same function is called any time the SvelteKit site is accessed so all runs could be general accesses):

When the API endpoint does run, it logs something like this:

Nothing indicates a failed execution, it’s as if the network requests themselves aren’t getting through. Could the IPs be hitting some sort of firewall or rate limiting? The requests should be originating from these IPs:
34.79.12.229
35.205.99.116
35.190.215.189

Update:
After manually triggering the same endpoint through with Postman, the function completed and returned a 200 status code in 3.11s.

I manually triggered the webhook for another failed booking, and once I had done that the Sanity webhook for that enpoint succeeded (resulting in a double execution, which is not great).

Maybe Sanity isn’t able to connect to us at all? Do you have any response headers from Sanity dashboard that shows x-nf-request-id or maybe the IPs Sanity is using?

It seems like it, for the instances that the webhooks fail. The IP addresses are in my previous message. No request headers I’m afraid, and I’m not sure if I’d be able to get them from sanity, but I could try get in touch with their support. Then again, the requests seem to be timing out completely, there is no body on the failed ones.

Can we start by checking for successful connections from the above IPs? If we can only find successful connections from one or two of them, it would indicate that there’s an issue with one of the IPs communicating with Netlify, which would explain why the issue is so sporadic.

I’ve run some more research on my end, and it seems that events from the two 35.x IPs aren’t coming in – I’m not sure if this is a coincidence or part of the problem, given I’ve only been gathering this additional data for a day.

Any updates on this please?

Sorry for the delay. It would appear that Sanity IPs might be getting blocked, but we don’t know why, yet. We’d confirm that with the devs and let you know.

Hi @james-camilleri,

After checking with the devs, we can reliably say that when we see one of those IPs (34.79.12.229), but 35.205.99.116 and 35.190.215.189 have no blocks as we also don’t see them at all. There have been no connection attempts to Netlify that have reached Netlify from those IPs.

I’ll follow up with Sanity and see if there are any issues from their end, although they’ve also said that everything’s ok on their end.

1 Like

Hey James!

I just wanted to get in touch here to let you know we followed up with the devs to do some more digging into this one, and our networking team was able to find that we did indeed have a block on one of the IPs that you listed above. I truly apologize for the hassle here and for us not identifying this in our logs sooner - we’ve allow-listed these IPs; can you let us know if you’re still running into the same issue?

Sorry for coming back to this so late - got caught up in the Christmas bustle. The application is less active now (we’re not selling tickets for any events currently) so it’s a bit harder to test. I will try execute some transactions to see if we hit any issues however.

Is there a way to ensure this won’t happen again? Or make it easier to debug if it does? Sanity.io sometimes change their web hook IP addresses (according to their documentation at least) and this problem essentially meant I had to manually execute a number of steps in an automated checkout system during peak sales period, which was naturally rather troublesome so I’d like to avoid it in the future.

We’ve currently added Sanity’s IPs: sanity.io/files/webhooks-egress-ips.txt to the allow list. If Sanity changes those, unfortunately this would break again.

Sanity recommend that the text file you linked to is ingested in an automated manner, as they always update it when origin IPs change: GROQ-powered webhooks. Is that something that you would be able to configure?

The devs are evaluating automatically updating the IPs. We don’t do it currently for Sanity, but would try to implement it.

This is now done. Sanity IPs will be auto-updated.

That’s great news, thanks for letting me know!