What's up with the constant high latency and errors lately?

There are incidents almost every day in July.

What’s going on here?

I’m about ready to move all my apps away from Netlify.

None of the resolution statuses indicate what has actually happened or why or how Netlify intends on reducing these errors in the future.

I think some more clear communication is required at this point.

Hey @darcy,

I run the Support team and would be happy to talk with you about this.

Let’s start with a message from our head of engineering - my boss Dana. This is a statement she made a while ago which still holds true:

Her message highlights a big part of why you’ve seen so many incidents listed there - we often list incidents that impact only a small handful of customers, because we have a company value of transparency and try to live it even when the news isn’t good. Why might you not be impacted by a reported outage? for instance: If your site is cached well on our CDN, it is usually the case that most incidents have no impact on serving your site.

Another thing to consider is that we generally don’t use our statuspage for root cause analysis (we have posted a very small number there, but generally we send them 1:1 to customers who request them in the helpdesk, rather than posting on statuspage; generally we get few requests.) If the impact is larger, we’ll post a blog post like these:

…but the reality is that while we do a root cause analysis for every customer-facing incident (and some that are not even directly customer-impacting, such as failures in our backups), but we do it in a call a day or two after the incident, where all affected teams can join that call, and collaborate on improving our service to prevent recurrence. This call is not something that happens during an incident to the level which we’d need, to report meaningfully about next steps on directly on our statuspage.

All of this is not me trying to deny what you’ve observed: we report many service degradations, particularly in the past 2 quarters! Our network and operations teams are constantly working to mitigate problems with our CDN and service and the results are that we have generally been improving our uptime and performance - even though we are reporting more downtimes, they tend to be shorter and less impactful, which means the work our teams are doing is working.

Of course, if you need a contractual guarantee of uptime, that is available on our Enterprise plans. I understand that isn’t what you were asking for, but for the next person who perhaps is looking for a mitigation for the problem such as a refund in case of problems - it does exist, at our higher account levels.

That’s great to know.

So my questions concerning transparency regarding another issue (which has now reoccured) were just missed then, and not intentionally ignored?

It was a direct request for the information, but based on this:

Would I need to request the information via the help desk and then re-post it myself here?

Your desire for transparency is commendable, but it cannot lead to our team spending (more) hours (than we already do), to tell you everything in the world about our business. We do our best, and appreciate your tolerance of us needing to support millions of other customers as well.

As you posit, we weren’t trying to hide from anyone why drag and drop was misbehaving for some users lately.

For the most part, it has never worked well because folks try to upload massive sites that require too much memory in the browser for their machine, or their network is not stable, and those things cause the upload to hang. So, that’s why our team wrote this support guide 3 and a half years ago, one of the first ones we wrote: [Support Guide] My drag and drop deploy is stuck in "Uploading" status

This primary failure cause for this least-desirable and least-capable upload method has been the case in the past, and will be in the future too.

But, why did drag and drop uploads break more than usual for some people with some sites lately? A bug, which we fixed, which popped up while we made otherwise invisible-to-you improvements in our authentication in browser while people use our UI. It impacted mostly sites with larger than average files, so our tests did not catch that misbehavior.

@fool While I understand where you’re coming from, you can’t have your cake and eat it too, you can’t both claim to be a company with a core value of transparency, and simultaneously wax lyrical to me about how much of a waste of time it is for the team to provide.

It also doesn’t matter if the “Drag & Drop” functionality is the “least-desirable and least-capable upload method”, you’re still offering at as a feature, and it’s not a great look to have it just hang when the builds are too large, even if you deem it to be “user error”.

I can’t imagine there’s anything preventing your team from engineering the solution better, such that it actually provides a contextual error on the build page sooner.

Your customers can only go on the information they’re provided, which is generally why they seek transparency, and has been an issue that I’ve encountered several times from Netlify, being berated in a passive aggressive and often condescending fashion because I didn’t know something that the internal staff know.

It’s quite simple, if you don’t tell people, they don’t know.

Your support guides are actually quite difficult to find, as is the forum itself difficult to search, so you can’t just point at something that isn’t contextual and go… “we said it here 3 years ago, we never need to say it again or improve upon our message delivery”.

I’m not sure if you’re suggesting that the reason for the last 7 or so people reporting “drag & drop” issues were all that their “projects were too large”, but had you mentioned it in any of the previous threads, and especially at the point where I asked for transparency, then it would have changed the way I responded to everyone encountering the issue.

For example, I would have advised them to check the size of their build and visit that support guide.