Need help to understand recent Netlify issues

Hi.
I would like to start with big kudos for quickly resolving Netlify’s system failure that we’ve recently seen.
I really don’t want to rub salt in Netlify’s wound, but I would like to know, for my own peace of mind, what could have caused that, the whole CDN system, dashboard, and sites, to be down.
As a non-expert, I understand that DNS failure can be blamed for it but usually, there is DNS failover put in place to prevent such things.
Does anyone really know what happened and wouldn’t mind sharing? I’m not familiar with the term “Origin Servers”.

Cheers.

Hey @webshaped.biz, I can try to answer some of what you’ve inquired about here, although as a community volunteer, definitely do not take my word as hard ahead of a response from a Netlify employee.

I had to go check status.netlify.com, since I didn’t recall hearing of any major outages. It looks like prior to October 8th, the only other recent issue with the origin servers was on September 22nd when there was some latency trouble. With the outages that occurred most recently, it looks like customers on pro and enterprise tiers were largely unaffected, so those that have invested in Netlify to guarantee a lower risk of outages are getting what they’re paying for. If you’re someone who depends heavily on having your site be online 99.9% of the time (i.e. running an API that needs to be up at all times, a storefront that could cost your business a minimum of thousands of dollars if down or maintaining infrastructure for clients that have trusted you to keep up) and are currently using Netlify’s free plan, moving to a paid plan will definitely give you the peace of mind you’re looking for.

Computing infrastructure isn’t perfect, and for a company like Netlify that has a massive free-user base there’s bound to be hiccups in the system, overtaxing of limited resources shared by thousands, and even with fail-safes in place sometimes it just isn’t enough to compete with a critical issue that brings the whole system to a halt.

In the case with the issue that took place on October 8, the CDN, dashboard and sites appeared to remain online; only the origin servers and build pipeline were down. In a Content Delivery Network setup there are 2 different types of machines: origin servers and edge servers. Since it’s not practical to deploy a site to hundreds of servers all over the world, the site is usually deployed to one “origin server” and that is turn mirrored to many “edge servers”. The edge servers serve the site to the users that are geographically closest to them to keep latency low. On a regular basis, the resources cached on the edge servers expires and forces the edge servers to re-pull from the origin. Netlify takes this a step further by instantly invalidating the edge server cache so that your updated site can be made available immediately.

When the origin servers goes down though, this creates a problem, as now the deployment system has nowhere to deploy files to, meaning that no builds can go out. Secondary to this, no sites can get updated because the edge servers can only serve what they have cached until the origin servers come back online and a new copy can be pulled and cached.

In a perfect world, there wouldn’t be any downtime at all, but as I said above if downtime is something critical to your business then a Pro or Enterprise plan would definitely give you more peace of mind. Thankfully, redundancies and backups exist so when servers do go down they can be brought back up in a matter of minutes to hours rather than days.

I’m sure the Netlify team can cite more specifics, but in the meantime hopefully this explanation helps some.

For more info, check out this article from CloudFlare on how CDN’s work: https://www.cloudflare.com/learning/cdn/what-is-a-cdn/, or explore Netlify’s product pages at: https://www.netlify.com/products/.

Cheers!

1 Like

Hi @noelforte and let me start with wow! As this being a community help forum, your time and effort put into this elaborate answer just left me speechless. Thank you very much.

Yes, on October 8th there was an incident but in my case, and few others retweeting me, everything went down, the Dashboard at app.netlify.com and all of my sites.
You helped me understand more about the Origin servers and their role in a CDN, of which I know something from several AWS entry-level courses.

I understand what could be the problem with deployments when Origin servers are down but I was trying to find out what would be an event, aside from DNS failure, that could cause all my sites to be down at the same time as my Dashboard as well. All this data replication and redundancy of cache/edge servers and nevertheless it was all whipped out for 10 minutes or so.

Being in the middle of several sales pitches, where I mentioned the reliability of CDN as the static website delivery system, I was struck a bit with the aforementioned event albeit resolved all so quickly and verbally on the @netlifystatus Twitter, kudos for the team for that.

Once again, thank you for your time and effort @noelforte

Hey there,

I think the best way to sum it up would be that out www and app sites both run on Netlify. So, when our customers exhibit issues, so do we. Using an alternative service would go against everything we stand for. However, when anomalies like this arise, it can hit hard. Few and far between though, that said!