[Static web generators] - How to handle file-name changes on a new build with atomic deploys

I’m using https://nextjs.org/ as a static build generator (but I assume this also happens for Hugo/Gatsby/… and any other static site generator.)

Scenario:

  • We build a website every hour (this is what’s happening in our usecase)
  • During the deployment there are users on the website. They still have an app that expects certain files to be present and be dynamically loaded in the future.
  • An atomic deploy happens
  • The user tries to navigate the website and it doesn’t find the file, website crashes. (while writing I’m thinking that this should be handled on the application level, but I’m still posting the question to get some better insights and to be sure of my assumption).
  • After a rebuild this problem disappears

Ideas of what might happen:

  • The browser is still expecting files and caching them somehow even after a deploy?
  • It works in a private window. So it has to do with browser caching? Right? Starting to doubt everything I know…
  • Any other ideas are welcome, because I’m honestly getting a bit desperate haha :smiley:

Thanks in advance!

So I’ve continued to investigate why this would be happening. And I keep coming back to the same conclusion. There are 2 main files missing when our website crashes. That is either from the _app-[chunk].js or the [lang]-[chunk].js

My conclusion at this point is that, somehow some CDN edge’s don’t have all the files necessary/are still processing them. Is that something that could be possible.

Since I’ve understood from previous interactions that an atomic deploy means all the files should be present.

Then why is that some files seem to be unrequestable in one browser, but not in another. What am I missing here. Any input at this point would help?

Looking forward to chat about this. And if this is even possible.

For now I’m assuming that this cannot be attributed to Next.js, because in most cases it will still show the website correctly. (which means the files are definitely generated)

Any takers? :smiley:

Here you can see that the pages folder is present:

Here at the same time you can see they aren’t:


If anyone would be able to help me with this, that would be greatly appreciated! Spending a lot of time on this :frowning:

To make this issue even worse, it’s the app folder that initializes Sentry… Which means that if the files aren’t there, we have no way of actually knowing how many people are impacted by this.

hey @BramDecuypere, thanks for all of your research on this. The person most likely to have answers for you is on PTO this week, but he will be back next week, and I will make sure he sees your post. I bet we can figure this out. thanks for your patience.

Hey @perry, any news on this one?

This experience is starting to become incredibly painful. We have around 8k users per week. The website wasn’t functional for at least 2 hours today. I’m breaking my head about how this is happening. But I have no idea how the CDN is responding to the build process.

We are using CircleCi to build and upload the files after tests succeed. But sometimes there are just (crucial files missing in the deploy).

Is there any chance I could get some insights on what is actually happening on an upload to the CDN? At what point does the ‘atomic deploy’ start rippling through?

Any insights are helpful at this point… Because I’m really clueless about why this is happening and how to fix it. There doesn’t seem to be a logic behind it.

How long would it take for a full atomic deploy to ripple through to all edges? Is it possible some clients are still getting on an older edge, when would they get that ‘refresh’?

Is it possible that 2 uploads at the same time could mess up the process (I assume not as I believe they would be deployed sequential from what I understand to be an atomic deploy)

Perry was waiting on me - sorry to be slow to get back to you!

This article describes your situation and our best suggestions for working around it:

TL;DR each deploy must be self-contained and NO files from an old deploy will be available in the new deploy unless you do something to cause that to happen (for instance, such as mentioned in this response: Angular JAMstack - Scully not prerendering API data - #9 by fool)

So, our suggestion at a high level is not to asset-fingerprint your filenames.

Hey @fool, thanks for the response. Just to make sure I completely understand what’s going on here…

  • We are rebuilding the complete application on every build
  • Sometimes files seem to be missing

Could you answer these?

  1. Do you have an idea on how long it would take for a deploy to be spread across all edge servers? Because sometimes, we are just having a different result on two different devices. Where one is finding the files and the other is not. (even accross browsers on the same device).

  2. So what I’m really looking for, is it possible to have incomplete deploys? And if that happens, how long before those deploys would be completely gone?

  3. So if I understand correctly, there is no way to make sure to support this usecase unless you keep the filenames consistent without fingerprint:

    • User comes on website and react app is fully hydrated.
    • Deploy happens with different filenames
    • User navigates and requests old filenames
  4. Is there any way to check if the files really aren’t there? Or why they aren’t found?

Looking forward to your answer!

I don’t believe that recap is correct. I believe this is what is happening:

  • as you say, you rebuild the complete application on every build
  • no files are “missing” after build - what you build is what we deploy. I believe the symptoms you are seeing are caused by stale content cached somehow at a client, which might have a variety of root causes:
  1. Something like someone having your site loaded in a browser tab, you doing a deploy which does NOT contain the chunk files by the same names, and them clicking (without a reload first) on something that tries to leverage the old files from that old deploy. This is “correct” behavior from the browser’s PoV - but from our side, we can say “it was correct-when-loaded, but that time has passed”. This situation is described in detail in the first article I linked and is the basis of my advice not to use asset-fingerprinted filenames.
  2. Or, something like service worker misconfiguration is causing old cached versions of content referring to the not-present-in-the-new-deploy-by-design-of-your-build-pipeline to be used in some browsers, with the observed effects. This is written up in more detail in this blog post: International Service Worker Caching Awareness Day

Have you determined if that could be the case?

Regardless, I can still answer your questions!

  1. a deploy will be updated within about 2 seconds across all nodes of our CDN in normal operation. Could a node “miss an update” somehow, and serve stale content for longer, somehow? Absolutely! But that is super rare and almost certainly not the case here.

  2. No. Deploys are atomic. You’ll either have the deploy, or you’ll have no deploy. Partial updates are not possible in our system (except in the way I describe above - somehow a node doesn’t have the latest content, and your browser contacts more than one node in the course of browsing, so you get inconsistent results. But again, we monitor for and have not observed this behavior in recent months, and most certainly not since you started this thread!)

  3. There are several workarounds you can do as described in the article I linked. My favorite is "upon any user action that would load a file, if you see a 404, check for a new deploy (you’d need to deploy a “version” file that could be checked by the browser), and if you find one, hard reload the tab. But to answer your actual question again, there is no direct support in our service for “promoting” old files to new deploys, or automatically doing some fallback behavior like the above.

  4. Not sure what you’re asking here. If you get a 404, the file is not there. I mean, I can look in our database for any deploy and confirm what files, and extract individual content if there is a need. I guess you could try downloading a copy of the deploy to self-confirm (see below screenshot for an icon that is on every successful build logs page, which downloads a copy of THAT SPECIFIC DEPLOY), or use this API call to confirm:

image

If you still think some files are missing after build, we will absolutely keep working with you on it - that’s a major fail of our CDN if it is true and we want to be sure that isn’t happening! But, I feel fairly confident in my above analysis - our system is deploying what you built, and old content will refer to files that are not present in a newer deploy. This is informed by several dozen debuggings of similar situations, so while this is experience speaking, speaking as a self-identifying fool, I am open to being wrong once we’ve got a better handle on the symptoms :slight_smile:

Thanks for the very thorough explanation! I’ll investigate and come back with new information and update this answer.

1 Like

Not using a service worker, so that is excluded.

I believe it absolutely could be the case for a weird situation of browser caching… (probably not weird, just not completely understood by me)
But even for that, there are some things I haven’t seen before.

I could swear I’ve seen this happen in FRESH private navigation window as well. But after looking at a problem like this, I’m very well aware that it is hard to be 100% confident about this.

So I’ll check even more for browser caching, and have a look at how this could be handled… I’ll probably need to include the Next.js guys into this. (Since they chose to use fingerprinting their files as the way to go on static exports)

For now we decided to move our deployment process completely back to Netlify (we were going through CircleCi since a few months, and ofcourse, that’s a hunch. Like most of these things. That it is happening since then)

To be completely transparant, I haven’t had it happen enough to put my finger on what exactly goes wrong. But the business people have seen it more than they would like… (and thus even more then I would like)

The absolute most useful thing you could gather, if you get to a place where you can reproduce it, would be a HAR file: HAR Analyzer

It contains unique identifiers per-request that will help us understand when a request was served, and from which deploy. More details about these are described in this article , but a HAR file is preferable since it will get that detail for EVERY asset, and getting the x-nf-request-id for a 404 isn’t as helpful as getting the one for the asset that referred to the 404’ing path (since that will show us “served long ago” → we’d expect the behavior your see, or “served fresh” → bug here).

Thanks for your help in troubleshooting!

Hey @fool, I’ve created a video to show what’s happening. I haven’t worked with the HAR file yet.

But the video explicitly shows a private navigation window having this issue.

I hope this one is useful as well… I’ve added some comments, to explain what we are talking about. It’s six minutes, but I hope you can give me better insights on why/what/… is happening with your better understanding of how CDN’s work (and better internal understanding of the platform).

Here is the video… I was ‘lucky’ enough to have someone in our team have it and guide him to what I believe would be a way to debug it, or at least try to understand it better:

Start full video link: debug session commented - YouTube
Time of opening private navigation window: around 4:27 which was most surprising to me…

I’ll try to get a reproduction with this HAR file as well. But for me the part where it feels like there is something happening that feels like a ‘race-condition’. So my first thought was the edge servers. But as you established in the previous answers, that shouldn’t be the case.

Thanks for your time @fool. Much appreciated. Hoping to get this sorted out pretty quickly since it has been eating at our conversions for at least a few weeks.

EXTRA INFO: (Just because I have the feeling it could be relevant)

  • We are using both the _redirects file and the redirects inside the TOML.
  • We are using ./node_modules/.bin/netlify deploy --site $NETLIFY_SITE_ID --auth $NETLIFY_ACCESS_TOKEN --dir=out --prod to push it online
  • There was some time overlap where it could be possible that 2 builds were deploying at the same time from Circle CI
  • We have the feeling this issue is from between 3 weeks and a month and half (although it feels more frequently in the past 2 weeks)

[edit]: @fool, added links to two har files related to the request. This time it’s another file for which it is happening, but equally important. With the same breaking result for the end user:

Looking forward to your feedback!

… The more I look at this, the more I start to have the feeling the redirects might have to do with it. (it says serving from disk cache… For a file that should be fetched. So I guess that must be saved in some weird way and make it more permanent, so that the (hard) refresh doesn’t work??)
But for sure wouldn’t want to point you into the wrong direction, so it’s better to follow your own guts on this one for sure!

@fool did you have time to check it? Our team is having the feeling (I’m sorry to say, but everything is still inuitive on this one…) that it is happening less now that we are:

  1. Building on Netlify again. (before it was CircleCI and uploading)
  2. Building with webhooks (before we were rebuilding every hour, now it’s on changes which is basically every 20 minutes or so)

I’ve just gone through 200 hotjar video recordings. 32 had this bug of the file not being there. That’s about 16% of the users. Ofcourse it’s only from a short timeframe and low amount of users. But if that’s the trend, it’s really painful. Still no idea what is going on here. Since it looks like it’s happening on all type of devices. And happens very randomly.

Is there any way to see how many 404 hits we get for our website? Without using google analytics since that is not being loaded?

[EDIT] - We’ve had a build fail on Sep 4 at 4:22 PM on Netlify. The text says something like this: Could something like this potentially trigger files not being present and the website still being loaded onto the CDN?

4:27:32 PM: ┌─────────────────────────────┐
4:27:32 PM: │   Netlify Build Complete    │
4:27:32 PM: └─────────────────────────────┘
4:27:32 PM: ​
4:27:32 PM: (Netlify Build completed in 3m 53.7s)
4:27:32 PM: Caching artifacts
4:27:32 PM: Started saving node modules
4:27:32 PM: Finished saving node modules
4:27:32 PM: Started saving build plugins
4:27:32 PM: Finished saving build plugins
4:27:32 PM: Started saving pip cache
4:27:32 PM: Finished saving pip cache
4:27:32 PM: Started saving emacs cask dependencies
4:27:33 PM: Finished saving emacs cask dependencies
4:27:33 PM: Started saving maven dependencies
4:27:33 PM: Finished saving maven dependencies
4:27:33 PM: Started saving boot dependencies
4:27:33 PM: Finished saving boot dependencies
4:27:33 PM: Started saving go dependencies
4:27:34 PM: Finished saving go dependencies
4:27:34 PM: Build script success
4:27:34 PM: Starting to deploy site from 'out'
4:27:35 PM: Creating deploy tree asynchronously
4:27:35 PM: Failing build: Failed to deploy site
4:28:05 PM: Failed to inform the API about a failed build, please retry the build or contact support
4:28:05 PM: Finished processing build request in 5m9.067341459s

Hi,

I took a look at the har files and noticed that most of your files has a hash. Based on [Support Guide] How does Netlify’s CDN handle caching files?, you’ll want to try disabling hashing and see if you still see the errors. Let me know how it goes.

That would mean changing our core framework Next.js. At this point, that isn’t really an option. Thanks for your suggestion though. The more I look into this, the more I’m having the feeling Vercel/ZEIT Now might be a better fit for in this usecase.
Although I’m really in love with Netlify for the most part and most of my other projects will stay here. The ones with Next.js will give me more flexibility and futureproofness on their platform. (although I know you guys have some superior specs on different levels)

Thanks for the help!

Although I’m no SSG whiz, we do have next-on-netlify which you may want to check out?

Hey @Scott. Thanks for checking in! The plugin is great to reduce the build time, but doesn’t really help with the problem we are having here. As there will still be name changes (because of how the framework works)

hey @BramDecuypere ! tried to glean as much from the video as possible. thanks so much for sharing so much detail! would love to try to diagnose what’s going on with you. i’ve recently started work on next-on-netlify and have suspicions this is, unfortunately, specific to nextjs and its existing compatibility with netlify (on our way to improving that!). i myself have seen weird caching behavior with only specific chunked page files that next generates. can you tell me the cache-control header on the failing and succeeding chunk requests?

Hey @BramDecuypere, I assist @lindsaylevine on the next-on-netlify package :slight_smile:

This is a shot in the dark, but could you share your _redirects and netlify.toml file with us? We have seen some issues related to those and NextJS in the past.