[Incident] Netlify service outages affecting ability to depend on Netlify - please comment here

hi there google5,

by yellow, do you mean this:

on netlifystatus.com?

it is our policy to leave incidents in monitoring until we feel sure that the issue is resolved - we do this for all incidents. We will move to resolved as soon as we can confidently say that it is resolved. I would check again in 5-10 minutes.

as far as categorizing the incident is concerned, i am seeing that there is currently a meeting underway to discuss and that we will be releasing as much information as we safely can when we can. Unfortunately i don’t have a timeline on that, it could be later today, or it could be once we have had time to get the team together to do a retrospective. Once/If that timeline gets clarified, i’ll bring that info here asap.

regardless of the label on the status page, i do agree with you and everyone else who feels this was a severe, impactful incident - and i don’t think anyone at netlify disagrees. I promise we will treat it as such, that promise is based on seeing department heads etc etc who are jumping into a call to discuss as i am typing this.

again, definitely not the experience we want you or anyone else to have, and i totally get that this was a rough one, i would be upset too if i were in your position.

for now I’m glad we fixed this for y’all, and we will be moving to resolved (if things stay stable), shortly. :pray:

1 Like

Thanks for the reply.

Yes. I meant the status page.

No issue with leaving it in monitoring. The issue is it never turned to “partial outage” or “outage” at any time.

So I genuinely don’t know if engineering is aware that a subset had outage - and therfore don’t know if our outage issue is an unknown secondary issue which needs to be raised separately. Latency is not the same issue as 503 / 500 status.

Hi folks, and thanks for the thoughtful and reasonable feedback you’ve provided. We especially appreciate the folks who can be kind to our Support staff even when things are going wrong.

I run the Support team at Netlify and we are in charge of customer communications during incidents. Perry is a leader on my team so everything the’ve said also reflects my opinion. Nonetheless, I take full responsibility for what was said and its impact on you and your business. I do want to start with an apology for the problematic messaging and spotty updates on this incident, and also for the increased frequency of incidents in the past month which is also reflected on our statuspage. I take personal ownership in how our service (both customer service, and our web hosting and other features) works, and I am truly, personally sorry for the impact that this had on your websites and businesses.

One of our company values is transparency, and that’s why we both have a status page and have conversations like this in public. I really and truly appreciate your feedback on the wording and do understand that it didn’t reflect your observed reality. We are not trying to hide anything, but as we could tell from our monitors and logs, the impact was truly partial across all of our services - many requests were served correctly, and some websites were served perfectly at all times due to the nature of our CDN. We attempted to reflect that status while we wrote messaging. It would have been incorrect to say that everything was completely down, because it wasn’t - the majority of our frequently served content is cached at the CDN edge and that cache continued to be served correctly. Our team is aware that yes, the impact was much higher for some customers than others, and we made the call to use generic “one size fits most” wording in the moment, which I understand is the cause of much of the dissatisfaction in this thread. We will work to improve this!

So, how will we do better in the future? We have a few plans:

  1. For this and every impactful incident we have, there is always a company-wide learning review meeting during which we discuss not only what went wrong, but what we can improve and how we have worked to prevent the same problem from resurfacing. I bring feedback like what has been provided in this thread (and via twitter, and via our helpdesk) to these meetings, so your voice will be heard.
  2. As Perry mentioned, I have added a discussion point to our scheduled learning review for this incident, about how we can better (more obviously and from your point of view, correctly) classify this impact in future incidents. But, I do see your point that it seemed like “much more than a degradation” from your point of view and we can and will do better!
  3. We are always open to your suggestions. There have been some good ones in this thread but if you think “gee, if only they’d do X that nobody has mentioned yet, everything would be so much better!”: please let us know that feedback. Another one of our company values is that the best idea can come from anywhere, including our customers, so I would be happy to discuss further your suggestions for improving this and related processes in the future. I can assure you that “improve service stability” is already on the list as a top priority, so we are probably looking more at suggestions around communications than anything else, but feel free to post other ideas and we will pass them on to the relevant teams on your behalf.

I want to close by echoing what Perry had to say earlier, that this is not the level of service we intend to provide, and assure you that we are working to improve both stability and our communications with customers around problems when we miss the mark.

Please let me know if you have any remaining concerns or suggestions, and my team will be happy to discuss further and ensure that other stakeholders and decision makers in the business see any feedback which could help us improve in the future.

5 Likes

The one thing I am missing here is: “We will make sure to send out AT LEAST an email to all affected customers, the moment it’s clear that sites are down”

Hi @fool thank you for the thorough explanation.

Is it possible that there’s a gap in the logging - e.g. containers that didn’t start didn’t produce logs, or a service-connection bug stopped logs from being transported?

This would explain the gap between what you describe and what we observed.

We saw 100% of requests fail with 500-series errors (for hours).

Regarding the frustration: imagine your site goes down and your carrier refuses to acknowledge even in hindsight there is a downtime incident.

This is why I think there were failure modes you didn’t see in your charts.

Could you tell me what sort of site you have? The impact as I understand it was more about “what your site layout is” - do you use proxying, or functions? password protection? On demand builders? Is your site used constantly (and thus would be in cache), or rarely (and thus, would be uncached content with which we were struggling)?

Certain features were heavily impacted; most customers do not use many of those features though.

Literally just static HTML. Nothing that could break. We also tried connecting directly to alternative routes, bypassing the domain and just using the Netlify app URL, etc. and all was 500 or 503 throughout

We were mid-build when the incident started. Is it possible that the update and build happening together caused the service to be “lost” in a way which didn’t emit logs (example: internal auth became invalid or IP address changed and lost communication with the pod, which could include log transport)? Or some other form of corruption.

Our site was literally static HTML. No tooling. No build step. Obviously HTML can’t emit 500-series errors. And it wasn’t intermittent and certainly wasn’t a latency issue. I think this was not the issue you describe in your incident report but but likely some complication of it.

The issue you describe is definitely not what we experienced. The issue we experienced was certainly outage.

No advanced features. Pure HTML. But a fresh build would have invalidated the cache. Could the build during an issue have written a corrupt cache?

Something along these lines would make sense, as you seem quite sure there was no complete outage for any users. We are certain there was. So this leaves failure modes which are invisible to observibality.

EDIT: I’ve hit my max replies for first day on the forum. So I can only reply by editing here @perry my API ID is ccf08d9a-467a-458e-bfdd-9ea35462e6a4

Hey madsem,

given that sometimes parts of Netlify go down when our customers sites are impacted by an incident, the safest and best way for us to interact with customers such as yourself is to use an external service like Statuspage. Prior to using statuspage, we did have our emergency infrastructure impacted by incidents, and were unable to communicate reliably at all, which is of course terrible and very unhelpful on about a bazillion levels. So, we try and make things as simple and off-service and user-error-by-our-team-when-s***-hits-the-fan proof as possible.

While I understand that you’d like to have a real time way to hear about service impacts, an email from us directly isn’t going to be part of what we can offer.

you can, though, subscribe to status page alerts - there is more information on that available here:

let us know if you have more questions!

1 Like

hey again,

could you share the API ID (safe to share) or netlify site name of one of the sites that was impacted the way you are describing so we can do some more sleuthing?

Thanks @perry for enabling me to make more posts.

APP ID: ccf08d9a-467a-458e-bfdd-9ea35462e6a4

I too would be interested – all 4 of my sites are RedwoodJS so they have functions and content, frontend to backend via graphQL. I use offsite databases and images from an S3bucket – one of the app’s api id is: 3da55a61-0072-4669-90f0-44b6166ac31e – thanks!!

Any news on this?

A bunch of sites went down with 500 codes. There were many many posts. The incident log, text, and categorization still doesn’t acknowledge partial outage 20hours later.

Hey there, @google5 :wave:

Thanks so much for following up. Our team is working hard to analyze not only what went wrong, but also what we can improve for the future. We will post a full write up on our status page. Additionally, I will share the link here when it is live.

1 Like

Hello folks,

Here is the write up of yesterday’s incident, shared on our status page: Netlify Status - Increased errors and latency affecting multiple services

Please do not hesitate to follow up with additional questions,
Hillary

1 Like

Hi Guys,

We also experienced very significant downtime on our sites. We are close to going live with one of these sites - and are of course concerned about yesterday’s events. A way in which you could help is provide suggestions on what services we could use - and how best to integrate them with neflify - to serve a cached version of the site when netlify has issues. Given what happened yesterday - we need to setup such a service before we can consider going live with the site.

Hi folks and thanks for all your continued thoughtful input! We brought a lot of it to the team including our head of engineering and sales teams and we have made a lot of plans to continue improving not only how we communicate but how we can help our network and your sites be more resilient to similar issues in the future.

Wanted to provide some specific followup to folks who gave additional feedback or asked about specific sites, and again in the spirit of transparency I’ll share this here since these factors influenced everyone who experienced trouble loading their sites yesterday:

Having reviewed the sites reported by @google5 and @ajoslin103 I can tell you why you received 500’s: your sites were not cached on our CDN. Your usage was extremely low before the outage, and thus, your content generally was not in cache. While this is less of a “how to avoid this problem” which @allansolutions has requested, it is some guidance as to why your sites were not as resilient as those of many other customers during the incident yesterday; they had very low usage. I won’t out you specifically on numbers, but I looked over the preceding 2 days in our internal logs and your usage was not rather low on both of those API ID’s.

Since our caching is opportunistic - meaning we don’t “add things to cache” unless people are requesting those assets actively. So, if nobody requests any assets for awhile, they will drop out of cache. Our cache is further per-CDN-node, so having it in cache on one node will only help people who are routed to that node; our other 80+ nodes don’t share that cache directly.

I don’t have any good and cheap suggestions for you, @allansolutions . Either you can:

  • trust us to do better in the future based on our past track record and our transparency during incidents like this (this is the largest problem we’ve had in the past 5 years by my count - last one this big was here: Learning Review for our Feb 2 Origin Outage)
  • try to game the system so that your content is in cache - but that will be expensive since you’ll need something like a geographically distributed monitor that downloads most of your content continuously to keep the cache warm, and you’ll pay the bandwidth bills for it in addition to the monitoring bills.
  • you decide that “at a small price like $0, $19 or $99/mo, I can live with some downtime” since you presumably get as much value out of our platform typically as you are paying, or you would have picked a different platform that offers you better value and potentially better uptime.
  • you decide that your website is worth getting an SLA for, and you upgrade to an Enterprise account here - which not only provides an SLA with refund guarantees in case of downtime, but also puts you on a separate CDN that is both more insulated from problems, and has far less cache contention, so your content stays in cache even if rarely requested.

One thing that we have immediately changed is that we’re going to suggest on any incident that affects uncached content not to deploy. Deploying clears the cache! So if you try to deploy to fix the problem…that will only make it worse. Leaving that note here since that is something immediately actionable you can consider in future similar situations, should there be any.

These aren’t suggestions that you need to change anything. But I think with an uptime of 99.99% in pingdom over the past year on the CDN everyone in this thread uses, I feel like we are doing pretty well even for our 0$ customers. Not that we don’t want to do better, and aren’t always trying - but we aren’t down most of the time or even much of the time.

Hopefully that transparency enables everyone here to make the right choice for their business.

2 Likes

You guys do amazingly.

To be clear - the people who follow-up after the issue are the people who care MOST not least.

Thank you @fool for the write-up. This is approximately what I suspected had occurred.

The issue I have is the discrepency between what we know occurred (serving outage for some users) - vs. the official stance (no mention of hosting outage even in the postmortem writeup).

From all the anecdotal reports, and from the investigation you mentioned - it’s an outage which affected some users but not all. Surely this is the definition of the “partial outage” tag. There’s no mention of different infrastructure-reliability tier on the pricing & plans page.

Core infrastructure (serving) going down should be treated as an outage regardless of significance of the user. The same people building free sites also have day-jobs with large-enterprises.

1 Like

It’s TOTALLY fair for lower-plan users to get hit worse. This shows good design. But transparency would be:

  • [degredation - latency & errors]
  • secondary incident [partial outage - hosting outage for some lower tier users]

When a secondary more severe issue hits less important users, why wasn’t that logged/communicated. It TOTALLY fine that less expensive tiers get hit, but how can you say that serving completely stopping for those users doesn’t even count as an incident.

Howdy Y’all,

Thank you for the lively conversation and the push to help us be better! I am incredibly sorry for the incident and just as the awesome netlify support team have pointed out, we are all committed to doing better at every opportunity and this was a BIG opportunity. In the future we will ensure we are clearly articulating the user experience in the event of failure, during the incident we choose what we thought was best to describe the user experience based on our understanding of impact, system monitors, and our runbooks. We are committed to being transparent and as accurate as possible when issues arise and have taken this feedback seriously, although we did not have a hard system outage, it was still a partial outage for those who did not have cached content and the user experience felt like an outage although our systems were up, they were just in a degraded state but that is not what was experienced. I completely agree we can do better and I am committed to ensuring we do. I appreciate the feedback and commentary and look forward to serving y’all better!

1 Like

Thanks @danaiszuul ,

That makes total sense.

The thing is - it was a partial outage (as you say). 9 days on from the incident it’s clear it’s official categorization will remain degradation.

To me where the records don’t match reality that’s a bad look. And at the time it was certainly very frustrating.

This is basically a documented case of under-reporting downtime.
If that’s the practice - it casts significant doubt over uptime figures & other stats.

When Netlify’s record is genuinely excellent, why cast that shadow.