Improve outage information flow

First off, thank you all for your work today. I’m sure it was a tense and difficult day. I appreciate all you do at Netlify, so please take my note here in the spirit of constructive criticism. We all know outages happen, but information makes a critical difference.

Today’s outage of sites was reported a ten minutes after I got my first alert from my monitoring service, and a full nine minutes after my first client wrote to find out what was going on. During that time I refreshed the status page and waited for tweets, but I had nothing to tell my clients. I don’t know if there’s any sort of standard; perhaps I’m expecting too much. But I can say it was a lonely time.

Those of us who build sites for clients work with non-technical people who are responsible for their organization’s website. Each and everyone of them feel as though their site is mission critical and they want to know what’s going on.I realize that you may not have much more information than what you publish, but more details and more frequency of updates would be a welcome change.

The coincidence of today’s outage following so quickly after last week’s build issue made some of our clients very nervous. This reflects poorly on our judgement, even though we have hard data on the exceptional uptime our sites’ experience.

But, essentially, we have to tell the story of what’s happening to our clients so they feel assured. If we have nothing to say, it can become an agonizing time, but if we are empowered with knowledge our clients feel confident.

Thanks!
Bud

3 Likes

Hey Bud and sorry to have been involved in this clearly awful situation. I can speak to your specific points but at a high level, we tried hard to be communicative, but I can see we missed the mark, and are thus looking to do better in the future.

We did file a status page incident as quickly as we could, but I’ve added a hopefuly workflow step to try to get out an “investigating” post before we understand the problem. Today’s problem we waited to full understand before posting, which is what led to it taking a few minutes longer than our availability monitors, which about matched up with yours - ~10 minutes from alerts to statuspage post. This was a situation that did not escalate slowly, so we all found out about it at the same time; if we see it before customers we’ll often “beat it to the punch” in announcing if we can’t turn things around before service impacts are felt. To address your I think non-rhetorical question, our standard for publishing anything on our status page is and will likely remain “as quickly as we can get you useful, non-misleading information”.

We do intend to publish more frequently, and while frantic work was going on for the entire duration of the outage, we should get better at updating frequently even if the situation doesn’t change, just so you and your clients will know we’re still working on things and we have some dreams of automating that process, but have not yet managed to do so. Doesn’t mean we can’t manually work to do better there in the future. We have few outages that are this long running so it’s not usually a problem; I understand that it makes it all the more important that we communicate well when someone DOES go wrong and stay wrong.

Our usual pattern is not to try to explain during the outage, but afterwards in a learning review post, such as this one:

We don’t do this for every case as sometimes the root cause analysis was “database configuration issues were seen with a new config; we’ve reverted”, but in this case I’ll at least get something public facing here that talks about the problem and how we are mitigating or preventing its future occurrence. But hopefully you can understand that we don’t always know the shape of what’s wrong in customer-ready wording during an incident.

I’ll post a follow-up in this thread, and we’ll try to do better at communication in the future. Thanks again for your thoughtful and level-headed post and I’ll be happy to address things further if there are follow-ups.

3 Likes

wanted to follow up with this post from our learning review: Basic ADN service degradation incident report