URL Rewrites: Instant cache miss (empty page) or 502 response

Hello!

Since a couple of days ago we’re getting a lot of errors from one of our sites (gh-wxp-web).
We have a custom domain (goodhabitz.com) configured with several rewrite rules as we’re progressively migrating from a legacy site to our new setup.

We can consistently reproduce this right now, our tests are returning conneciton closed or timeouts when trying to reach our site via our custom domain + matching a rewrite rule. When visiting the upstream server everything works correctly.

An example is the following:

When cURL verbose we see the following:

❯ curl -vI https://goodhabitz.com/pl-pl
*   Trying 75.2.60.5:443...
* Connected to goodhabitz.com (75.2.60.5) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=careers.goodhabitz.com
*  start date: Jan  1 12:30:23 2024 GMT
*  expire date: Mar 31 12:30:22 2024 GMT
*  subjectAltName: host "goodhabitz.com" matched cert's "goodhabitz.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://goodhabitz.com/pl-pl
* [HTTP/2] [1] [:method: HEAD]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: goodhabitz.com]
* [HTTP/2] [1] [:path: /pl-pl]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> HEAD /pl-pl HTTP/2
> Host: goodhabitz.com
> User-Agent: curl/8.4.0
> Accept: */*
>

Eventually it just hangs there.
When retrying, sometimes we just get:

❯ curl -I https://goodhabitz.com/pl-pl/
curl: (56) Failure when receiving data from the peer

It is important to note that this setup was working without any problems for many months now. We have not changed anything in the recent days that could’ve caused this.

Can you help us check some of the internal logs to understand what’s happening with the rewrites?

Thanks

I’m getting a 200. Is this resolved? If not, could you let us know how to trigger this behaviour or share any request IDs?

Hey @hrishikesh!

Thanks for replying. Indeed by the time you replied things were looking good again. However, we’re having the same issue just now with another domain pointing to the same site (same behaviour):

❯ curl -I https://careers.goodhabitz.com/
HTTP/2 200
age: 0
cache-status: "Netlify Edge"; fwd=miss
content-type: text/plain; charset=utf-8
date: Mon, 29 Jan 2024 08:54:15 GMT
server: Netlify
strict-transport-security: max-age=31536000
x-nf-request-id: 01HNA6T8TB20V4P8BNEGTJFJJB
content-length: 0

Instant reply with cache-status: "Netlify Edge"; fwd=miss, however I’d expect the request to the upstream server to take place. The upstream server replies correctly:

❯ curl -I https://goo-careers-prod.azurewebsites.net/
HTTP/2 200
content-type: text/html; charset=utf-8
date: Mon, 29 Jan 2024 08:56:36 GMT
server: Microsoft-IIS/10.0
cache-control: public,max-age=3600
strict-transport-security: max-age=31536000; includeSubDomains; preload
request-context: appId=cid-v1:19f62e17-ecd4-4852-b6cd-045ea8c1760c

We would really appreciate your help with this matter as it is preventing all our public websites from serving correctly via Netlify

Are you saying that, that’s not happening? Based on my testing and the internal logs, this seems to be working fine.

hey again @hrishikesh, it is working again indeed.

When I made these posts it was consistently not working and nothing changed on our side since then, it just started working again. With the request Id’s (I may have more around) would you happen to have a chance to see what a given request is doing? I’d be specially keen on understanding why the direct “fwd=miss” reply with no attempt to load from upstream.

The business is asking me to provide more clarity during our post-mortem and right now the only thing I can say is: We don’t know because it is unclear why the Netlify CDN started responding this way.

Any input would be more than appreciated, also to prevent this from happening in the future.

Thanks!

For example we’re having some reports right now from Australia where the Netlify CDN is responding immediately with cache miss to our main local url GoodHabitz | Empower your people | GoodHabitz Online Training

age: 1
cache-status: "Netlify Edge"; fwd=miss
content-length: 0
content-type: text/plain; charset=utf-8
date: Wed, 31 Jan 2024 15:54:55 GMT
server: Netlify
strict-transport-security: max-age=31536000
x-nf-request-id: 01HNG3NZ11HABBANHER739TEXY

Can you help us identify what’s going on on that Request Id?

I think I have a clue and would greatly appreciate if you could help me understand this better @hrishikesh. There’s a piece of the puzzle I think only someone inside Netlify can help me confirm.

For full context, as I mentioned above, we have two legacy sites (let’s call them usptreamA.example.net and upstreamB.example.net). We also have two domains we pointed to Netlify: (www.)domain.com and careers.domain.com.

Our Apex should serve the new site (fully from Netlify), but we have several rewrite rules for local sites (sub-paths) under our main domain (eg. domain.com/en-gbupstreamA.example.net/en-gb) to serve the legacy site in parallel while we migrate fully away from it. The same goes for the careers subdomain (i.e. careers.domain.comupstreamB.example.net)

Now, as I mentioned this setup has worked beautifully for the last year, it has been for us a showcase of the power of Netlify in the organisation.

As I mentioned at the beginning of my post, we started having reports that this setup wasn’t working as expected some time between Tue and Wed last week, we didn’t change anything in our side that could’ve caused this and quickly started reviewing our _redirects configuration to see if anything looked out of the ordinary.

One thing we have noticed (and since fixed) is that the domains upstreamA.example.net and upstreamB.example.net where added as custom domains in the Netlify configuration panel. Of course these domains never pointed to Netlify and also were never intended to, they are our upstream servers for the legacy sites and will remain where they are.

We believe these were mistakenly added (during the initial setup of the redirect months ago) because of a misunderstanding. Never causing a problem before. Removing them seems to bring things back to normal.

My hunch here is that perhaps something improved in how Netlify does DNS for redirects/rewrites and that was causing the issues we saw this last week. For example perhaps now the domains configured in the site are automatically resolved from configuration to optimize response times (?).

Do you think this may have been the cause of this issue? If so, could you provide a little bit more insight into this? or if not please help us get in touch with someone inside Netlify? I tried reaching out support via the website but I don’t have a response yet.

I’m not sure why you feel the fwd=miss relates to load from upstream. We can load the data from upstream and still have fwd=miss.

Regarding the domains and DNS, unless the DNS points to Netlify, we don’t do anything for that domain (or rather, we can’t). The traffic would simply not hit Netlify for us to be able to do anything with it.

In all of the request IDs you have shared so far, I can only see we served a 200 succesfully. If we had encountered an error during proxy, we would have returned an error. But perhaps, it’s related to this thread: Proxy redirect have stopped working?