Why is netlify down so often the last days?

hey my site is https://geobingo.io/ and its going down a lot the last couple of days since i switched back to netlify :frowning: look at the gif
netlify
it seems like the issues on https://www.netlifystatus.com/ are not really resolved

1 Like

Hi

This is related to the incidents we have publicly announced on our StatusPage:

This is not the service level we aim to provide. This has been a challenging time for our reliability, and we are actively and holistically working to improve stability of our entire platform with several teams of senior staff focusing on this right now.

We apologise for these issues, but as I mentioned we’re actively working on stability as whole.

Everytime I checked it said All Systems Operational. I think I am just unlucky. When I made the post there was no incident for the 31th :frowning: also it happened again after the last incident don’t mark it as resolved please if it’s not.

@s0er3n thank you for your feedback.

The incidents are not related and happened in different times and have different resolutions marked on them.

My advice is keep monitoring the status page since there’s a natural lag between the event happening and the statuspage update because our team takes time to analyse and properly convey the message externally (i.e. we need to analyze what’s going on and that takes some time).

Usually we statuspage within the same hour of the event (most of the times less).

Hope this helps!

Hey @gualter thank you for your honesty and I am glad to hear Netlify are working on it. I am also very thankful for Netlifys Status website which reports outages.

It is quite worrying that there has been 26 incidents in August and there also hasn’t been a month since January with less than 10 incidents. There hasn’t been an incident free month since August 2019.

From my perspective as a developer I get a lot of heat from my employer asking why the website it always down, my answer is usually ‘our hosting provider is having problems’ but theres only so many times I can give this answer, so I have started to actively look at other providers.

Can you provide any other information about the reliability work the teams are doing? Do they know the root cause and whats the plan to fix it? Is there a rough time frame of when the work will be completed? How many incidents does Netlify deem as acceptable? What steps are you taking to make sure that this unreliability doesn’t happen again in the future?

I really don’t want to move off Netlify as I love the service and the support team are fantastic, however without answers to the above questions I just look silly when I am regularly asked what the issue is with our site.


@gualter do you have time frame when its going to be resolved? So i can move my site off netlify on my own server or railway for the time being. Is the DNS server also effected by the outages? Even the netlify site is down for me. I am running my own monitoring tool and i get the status almost immediately. Shouldn’t this be possible for your status site also. The netlify status page feels a lot like its hiding some information on purpose like uptime etc… Also it seems like the incidents only get shown after they are resolved so it always shows operational. It’s not very helpful if you are searching for the cause of problem and the status page doesnt reflect the real status.

Unfortunately it may be as this reply says: [Incident] Netlify service outages affecting ability to depend on Netlify - please comment here - #39 by fool

  • you decide that “at a small price like $0, $19 or $99/mo, I can live with some downtime” since you presumably get as much value out of our platform typically as you are paying, or you would have picked a different platform that offers you better value and potentially better uptime.

We understand the impact recent degradations have had on our customers and apologize for the service interruption. Netlify considers the performance and availability of customers’ sites our top priority and we did not live up to that standard here. In this post, we’ll detail the measures taken and the work underway to mitigate the network performance as well as the improvements we have planned.

Any customer downtime is taken seriously and reflected in our commitment to transparency. Whenever there is customer impact we seek to provide clear and honest information to our customers on our status page and we strive to have metrics that demonstrate the impact as accurately as possible.

In August, CDN availability was impacted as a result of DDoS attacks against a number of critical service providers. While we have mitigated much of the intended harm, we know some customers have been significantly impacted by these events.

When an attack occurs the impact is typically increased latency or 5xx HTTP response errors to visitor web requests in a subset of CDN locations. When netlifystatus.com reports an event it does not necessarily indicate a complete outage, in fact, customer sites are primarily impacted by increased latency as our CDN tries to route traffic to less-busy regions, though sometimes impact will include a percentage - rarely 100% - of error responses as well.

We are committed to providing the best customer service and the best development experience possible. We have completed recent updates to our defensive tooling and are prioritizing improvements to deliver advanced load shedding and circuit breaking to prevent impact from significant traffic volumes associated with these events.

The Netlify Engineering team has prioritized the work planned on our roadmap to implement these advanced services as quickly as possible. This work includes, but is not limited to, further scaling capacity to allow us to absorb extremely large DDoS attacks until properly mitigated, advanced traffic processing, attack fingerprinting, IP reputation flagging, and architectural changes.

We look forward to a brighter future having gained considerable knowledge from these events. Netlify works continuously to minimize and prevent these threats to its network. Our forthcoming infrastructure improvements will provide resiliency to all our points of presence. As attack sizes grow in the future, advanced application logic will quickly detect attacks, isolate traffic, and scale to meet traffic needs.

As a company that hosts critical systems for our customers, we are trusted with the responsibility of ensuring those systems remain available. We hope the transparency and active measures in this post can regain some of that trust.

9 Likes

We migrated away from Netlify because of all those outages.

As we figured out if you pay 99$ per seat or nothing. Your app will run on a shared network ( infra ) with millions of other apps. And they have NO SLA guarantee at all. The only way to get a reliable service according to Netlify is upgrading to ENTERPRISE which starts at 3000$ per month.
But even then they have many incidents. ( as mentioned in this thread )

Also, the status page does not show small disruptions that can cause your app to be unavailable on the shared network. It’s primary monitoring for the “premium” network.

It took us two days to migrate to GCP ( our cloud platform )
Not only does it have 99% SLA it’s also much cheaper and works faster ( Since it’s not a low-quality shared infra that runs millions of apps )

The only Netlify features we were using were CDN + redirects + rewrites of requests
Which can be easily achieved with GCP.
And can be easily configured with Terraform.

Deployments can be executed in GitHub Actions with cache.
It’s also much faster than Netlify builds.

The only thing we lost was rolling back and atomic deployments

But we never really had to roll back, and just running rsync + CDN cache invalidate works well for us.
If anyone needs technical help to achieve the same with GCP ill be glad to help!

When we reached netlify they said:

All our plans besides the ENTERPRISE plan are for people with a hobby, that can accept occasional downtimes. ( even business plan )
If we would have known that in advance, we would never choose Netlify in the first place.

Can Netlfiy report on the uptime across time periods (August, July, June etc), so we can take informed decisions on whether to host productions sites with them? We use Netlify for our development infrastructure, but without this kind of transparency it will likely result in us hosting our site elsewhere, and ultimately migrating away from Netlify - all because we don’t know what the uptime is.

They have a “premium” network tier for the Enterprise plan
The outages report is mainly for this tier

All other plans go to the “other tier”

where millions of sites are hosted on the “other infrastructure”
they don’t provide ANY SLA for the “other tier”

PRO plan and FREE plan get the same “infra”
If reliability is important for you, I suggest not using Netlify for production. Unless you are an ENTERPRISE client

In previous years Netlify made reference to a reported uptime on pingdom of 99.9% on their free plans - not a guarantee, just an observation of actual uptime. This was useful information.

We are a paying customer of $240/month and reliability is indeed important. To know that the observed uptime this year is xx.x%, without an SLA, could be acceptable. However we have no such information. There must be a lot of customers which require a site to reliably be online, will happily pay $240 (or even a bit more), but cannot take the leap to $3,000 month (enterprise plan)

If we wanted to migrate to an alternative means of hosting our production site, where we could have more information regarding expected uptime, what would be good options in this price range? Perhaps just a VPS where a merge to master triggers a github action that performs the production deploy? Or an alternative platform such as Cloudflare Pages?

I suggest using a reliable cloud provider like AWS , GCP …

GCP is the cheapest.

create a multi-region bucket
configure a site for that bucket
enable CDN on that bucket + cloud armor if you wish
github action to rsync or rclone to that bucket
configure cache in gh actions so the build and deploy will take < 1m ( faster than netlify )

that’s it. I can share a single terraform file that does it all in < 100 lines of code.
and the gh actions setup if you need it. NP

ill just put it here

1 Like

Thanks for digging up several different stale threads to post this resource.

1 Like