Stability & Downtime Issues - Standard Network (Pro/Business)

There has seemingly been an increasing frequency of “downtime” and issues with sites on the “Standard Network”.

Overnight a client raised that their site was “down” 10 minutes before the Netlify status page was updated.

Is there a post that I’ve missed that explains the increase in incidents this year compared to previous years?

The overview page currently indicates:
March - 22 Incidents
April - 13 Incidents
May - 15 Incidents

It could be that it’s the same as always, you’ve improved reporting and we were just lucky in previous years, but even our monitoring indicates less issues historically.

In recent months we’ve been contacted by every client at least once to report “downtime” and having some overarching understanding as to “why” would assist when trying to allay their concerns.

Howdy, @nathanmartin I appreciate you reaching out! This has been a tough time and there isn’t any denying that. We have tightened our reporting and want to ensure we “accurately as possible” report systems issues. Yes, you will have seen more events on the status page because we don’t want any surprises for you and our other customers that rely on Netlify. There isn’t a significant uptick in events per se, just better reporting as we continue to refine our culture of reliability. We have a LARGE global network and due to network attacks some nodes get hammered harder than others, so when calculating the scope and the impact of one of these attacks is what we continue to refine since we have traffic rules in place that ensure uptime and move traffic as events happen. Most customers in affected nodes don’t experience issues just some short-term latency as traffic is routed around, however, some customers might experience different side effects. Uptime for our standard edge is our #1 priority and we will continue to improve this service as we also continue to improve our overall incident response and status. We want to get to a place where we can drill in even finer where customers are specifically impacted vs just standard edge is impacted. Hopefully, that makes sense and if you would like to chat about it more, let me know! I am happy to schedule a call with you Nathan!

2 Likes

Thanks @danaiszuul

No need for a call, we’re currently jumping through hoops migrating away from Netlify, (unrelated to downtime - just fallout from the recent massive price increase and associated poor communication), so it’d be a waste of both our time.

It’s also better if details are public and transparent, to assist others with similar question and provide confidence in the overall system.

Improved reporting would explain the higher totals per month on your status page, but it doesn’t directly account for the increase in issues reported by our own monitoring and raised as support tickets by clients.

The only explanation that I can see in your response is “network attacks”?
Is this problem now more prevalent than it was in previous years?
Has something about the network changed that is making them have greater impact?

I had heard via other channels that you were performing a “cloud migration” as a possible explanation for downtime/issues this year. Was/Is that the case?

That is a bummer you are moving on and we couldn’t serve you :frowning: However, since you are still open to providing feedback and helping our customers, I hope we can win you back one day!!

In regards to your questions, as our global network advances the opportunity for bad actors increases, and we are not immune to humans hitting us maliciously at a greater volume. We did perform a system upgrade/migration in February, but that was a different part of the stack. So these issues are not related to that work.