[Incident] Netlify service outages affecting ability to depend on Netlify - please comment here

Thanks for reaching out, @niels-nijens.

Our mitigation was not fully effective and we are seeing more latency, timeouts, and errors for serving uncached content, API responses, and builds for all customers. Our team is working hard on a fix.

We will continue to update our status page: https://www.netlifystatus.com/

2 Likes

Let’s see how soon the Netlify team will fix this.

1 Like

Honestly, the reason we came to Netlify and Jamstack in general was exactly that we wouldn’t have to deal with this so often. And I’m not saying this because I want to troll, but because I am running a business. We will at least go have a look at Vercel now

1 Like

Hey there, @madsem :wave:

Thank you for taking the time to reach out and share this feedback-- we take customer experience very seriously and I assure you that I will share your feedback with all of the relevant teams as we work to fix this.

Thanks @hillary, I am mostly annoyed by the fact that there was no notification or anything. I found out by looking at our clients paid ad campaign stats that something wan’t right.

Outages can happen, but I believe it’s important to be upfront about this and not downplay it. Notify your customers.

1 Like

@hilary myself and many others in this thread (likely tens of thousands who didn’t post) think outages do happen, but refusal to acknowledge the outage is extremely frustrating when our sites go down

site is back up, but we need a correction to the “degradation” narrative as it feels like a shady policy to pretend no outage occurred to protect the uptime stats etc.

Without acknowledgment we don’t know whether Netlify engineering is even aware. Or are there gaps in the monitoring which we need to inform somebody about.

I put a more elegant version of this in another thread:

3 Likes

hi everyone,

Support Leadership here - thank you for your patience as we continue to work with the folks who are working to remediate the underlying problems.

I understand your frustrations - hearing you loud and clear - and of course that is never the experience we want you to have. As soon as we have a better idea of what, when, how and why these incidents have been impacting services - we will be happy to share as much information with you as we can.

I promise that we understand the impact and are working as hard as we can to remediate. :pray:

3 Likes

Thank you for the update - but please, correct the categorization to “partial outage” at least. We are not lying or mistaken. It’s also a very clear-cut definition.

1 Like

Agreed; if this doesn’t rise to the level of at least “minor outage” I’m not quite sure what does. What myself and many others who are in directorial or client-facing positions need today is a cogent and transparent report we can point to that explains what happened and what steps are being taken to address it into the future.

Netlify, truly, I love you, but you’re bringing me down. As a smaller agency we’ve come to rely on your services and you’ve been critical to our growth, and we’ve been huge boosters for what your platform provides. I still am. But I struggle with the status page’s response because ultimately we have clients of our own that we answer to and “degraded performance” doesn’t capture their experience or give me what I need to calm them down. I know this is a rough day for you all, and I appreciate that you all are working as hard as you can. The extent of the problem might not yet be fully understood. But what our clients understand right now is that their sites were fully broken and the status doesn’t make them (or many here) feel heard. Thank you, and I hope you all get to take a vacation after this one.

Edit: I see “We will be writing a public root cause analysis describing what led to the issue and how we’ve resolved it.”. Thank you! (And also, everything appears to be resolved on our sites)

3 Likes

It’s like somebody in Netlify has said "OK let’s communicate this but use the word ‘degredation’ not ‘downtime’ or ‘outage’ ".

totally hear you mfan, and i will ask someone else who has a more higher level view to weigh in on the process of classifying the incident as soon as we can.

for the time being, we think we have things fixed -

are you still seeing issues? if yes, can you report back here with some information about where and what you are seeing? thank you.

1 Like

google5 - i promise that this was not malicious. I understand the impact, and i hear that you are angry, frustrated, and feel like your trust has been broken. But accusing us of trying to mislead our customers isn’t appropriate, and will never be appropriate - we have built our track record not on infallability, but on transparency and honesty, and this incident is no exception. Please be mindful that complexity isn’t always readily apparent, and incidents like this one is one of the times where we try and prioritize speed over accuracy.

more details as soon as they become available, and please do share with us if you are still seeing errors.

2 Likes

I’m not still seeing errors. All I want is an acknowledgment that this was a downtime incident for myself & many others here.

It’s not purely principle… it’s also pragmatism. Many of us checked the status-page, saw an update about latency and a reality of 500 status for hours, so we did troubleshooting in that time which we wouldn’t have needed to do if there was a clear/correct reporting on the status page.

Troubleshooting info:

If it helps with your troubleshooting, I had a build in-flight when the downtime hit which could have put the cache into an unknown state or something. It’s plausible that some timing component caused some of us to have full outage. Rebuilds and rebuild-with-cache-clear did not resolve it.

Please don’t dismiss the people who had 500 status for hours.

Even if this only affected 0.1% who happened to have builds in flight, or be on a certain pod or whatever - it’s an outage incident for those users and not calling it that feels very dismissive. If this was only some users, that’s a partial-outage status. We shouldn’t have to campaign to essentially log a ticket (in this case telling you we experienced outage not degradation).

Can you directly answer the question of why the categorization remains yellow?

hi there google5,

by yellow, do you mean this:

on netlifystatus.com?

it is our policy to leave incidents in monitoring until we feel sure that the issue is resolved - we do this for all incidents. We will move to resolved as soon as we can confidently say that it is resolved. I would check again in 5-10 minutes.

as far as categorizing the incident is concerned, i am seeing that there is currently a meeting underway to discuss and that we will be releasing as much information as we safely can when we can. Unfortunately i don’t have a timeline on that, it could be later today, or it could be once we have had time to get the team together to do a retrospective. Once/If that timeline gets clarified, i’ll bring that info here asap.

regardless of the label on the status page, i do agree with you and everyone else who feels this was a severe, impactful incident - and i don’t think anyone at netlify disagrees. I promise we will treat it as such, that promise is based on seeing department heads etc etc who are jumping into a call to discuss as i am typing this.

again, definitely not the experience we want you or anyone else to have, and i totally get that this was a rough one, i would be upset too if i were in your position.

for now I’m glad we fixed this for y’all, and we will be moving to resolved (if things stay stable), shortly. :pray:

1 Like

Thanks for the reply.

Yes. I meant the status page.

No issue with leaving it in monitoring. The issue is it never turned to “partial outage” or “outage” at any time.

So I genuinely don’t know if engineering is aware that a subset had outage - and therfore don’t know if our outage issue is an unknown secondary issue which needs to be raised separately. Latency is not the same issue as 503 / 500 status.

Hi folks, and thanks for the thoughtful and reasonable feedback you’ve provided. We especially appreciate the folks who can be kind to our Support staff even when things are going wrong.

I run the Support team at Netlify and we are in charge of customer communications during incidents. Perry is a leader on my team so everything the’ve said also reflects my opinion. Nonetheless, I take full responsibility for what was said and its impact on you and your business. I do want to start with an apology for the problematic messaging and spotty updates on this incident, and also for the increased frequency of incidents in the past month which is also reflected on our statuspage. I take personal ownership in how our service (both customer service, and our web hosting and other features) works, and I am truly, personally sorry for the impact that this had on your websites and businesses.

One of our company values is transparency, and that’s why we both have a status page and have conversations like this in public. I really and truly appreciate your feedback on the wording and do understand that it didn’t reflect your observed reality. We are not trying to hide anything, but as we could tell from our monitors and logs, the impact was truly partial across all of our services - many requests were served correctly, and some websites were served perfectly at all times due to the nature of our CDN. We attempted to reflect that status while we wrote messaging. It would have been incorrect to say that everything was completely down, because it wasn’t - the majority of our frequently served content is cached at the CDN edge and that cache continued to be served correctly. Our team is aware that yes, the impact was much higher for some customers than others, and we made the call to use generic “one size fits most” wording in the moment, which I understand is the cause of much of the dissatisfaction in this thread. We will work to improve this!

So, how will we do better in the future? We have a few plans:

  1. For this and every impactful incident we have, there is always a company-wide learning review meeting during which we discuss not only what went wrong, but what we can improve and how we have worked to prevent the same problem from resurfacing. I bring feedback like what has been provided in this thread (and via twitter, and via our helpdesk) to these meetings, so your voice will be heard.
  2. As Perry mentioned, I have added a discussion point to our scheduled learning review for this incident, about how we can better (more obviously and from your point of view, correctly) classify this impact in future incidents. But, I do see your point that it seemed like “much more than a degradation” from your point of view and we can and will do better!
  3. We are always open to your suggestions. There have been some good ones in this thread but if you think “gee, if only they’d do X that nobody has mentioned yet, everything would be so much better!”: please let us know that feedback. Another one of our company values is that the best idea can come from anywhere, including our customers, so I would be happy to discuss further your suggestions for improving this and related processes in the future. I can assure you that “improve service stability” is already on the list as a top priority, so we are probably looking more at suggestions around communications than anything else, but feel free to post other ideas and we will pass them on to the relevant teams on your behalf.

I want to close by echoing what Perry had to say earlier, that this is not the level of service we intend to provide, and assure you that we are working to improve both stability and our communications with customers around problems when we miss the mark.

Please let me know if you have any remaining concerns or suggestions, and my team will be happy to discuss further and ensure that other stakeholders and decision makers in the business see any feedback which could help us improve in the future.

5 Likes

The one thing I am missing here is: “We will make sure to send out AT LEAST an email to all affected customers, the moment it’s clear that sites are down”

Hi @fool thank you for the thorough explanation.

Is it possible that there’s a gap in the logging - e.g. containers that didn’t start didn’t produce logs, or a service-connection bug stopped logs from being transported?

This would explain the gap between what you describe and what we observed.

We saw 100% of requests fail with 500-series errors (for hours).

Regarding the frustration: imagine your site goes down and your carrier refuses to acknowledge even in hindsight there is a downtime incident.

This is why I think there were failure modes you didn’t see in your charts.

Could you tell me what sort of site you have? The impact as I understand it was more about “what your site layout is” - do you use proxying, or functions? password protection? On demand builders? Is your site used constantly (and thus would be in cache), or rarely (and thus, would be uncached content with which we were struggling)?

Certain features were heavily impacted; most customers do not use many of those features though.

Literally just static HTML. Nothing that could break. We also tried connecting directly to alternative routes, bypassing the domain and just using the Netlify app URL, etc. and all was 500 or 503 throughout

We were mid-build when the incident started. Is it possible that the update and build happening together caused the service to be “lost” in a way which didn’t emit logs (example: internal auth became invalid or IP address changed and lost communication with the pod, which could include log transport)? Or some other form of corruption.

Our site was literally static HTML. No tooling. No build step. Obviously HTML can’t emit 500-series errors. And it wasn’t intermittent and certainly wasn’t a latency issue. I think this was not the issue you describe in your incident report but but likely some complication of it.

The issue you describe is definitely not what we experienced. The issue we experienced was certainly outage.

No advanced features. Pure HTML. But a fresh build would have invalidated the cache. Could the build during an issue have written a corrupt cache?

Something along these lines would make sense, as you seem quite sure there was no complete outage for any users. We are certain there was. So this leaves failure modes which are invisible to observibality.

EDIT: I’ve hit my max replies for first day on the forum. So I can only reply by editing here @perry my API ID is ccf08d9a-467a-458e-bfdd-9ea35462e6a4