The 404.html page is shown for people who entered a non existing url on the website. As it says in the docs: " If you add a 404.html page to your site, it will be picked up and displayed automatically for any failed paths."
I expect it should return a 404 status so Google knows this is a non existing page…
So I know now where the problem lies… I am using React with React Router and for this to work on Netlify I have to Redirect * to index.html like the following in netlify.toml
[[redirects]]
from = “*”
to = “/index.html”
status = 200
This is why even when a page exists it returns status 200. Sorry for the confusion…
In React Router I can specify a “Not Found” Route and it shows when a route has not been found. However, it still shows statuscode 200 instead of 404, which is logical, but not what I want. Any recommendation on how to display 404’s in SPA’s? I am really stuck here…
I know @hrishikesh’s reply was not for me, but I’ll answer for my own site as well.
Could you explain a use case for your users going to /404.html on your site?
That is not a normal use case for my site. That’s the problem.
I stand with @gregraven and @coelmay. Since Netlify did not make an error in serving /404.html, sending a 404 status code does not seem valid to me.
404.html should not be served successfully in this case though, due to its special meaning.
The soft 404 issue being talked about here says that a 404 page is returned for a page that’s not found but a 200 status code is sent, or in other words, “server” sends a OK status code, but client-side page shows an error.
That’s not quite correct. The problem is that the server sends both statuses, one in the header and another in the body, which are then received by the client. Only the status in the page body is correct in this case however.
If it’s being excluded automatically, that sound like a good thing to me.
It is! However, search engines may fail to detect a soft 404 correctly for a variety of reasons. In addition, when one is detected, they may potentially penalize the rest of the website along with it for not following the standards.
Also, while the discussion mostly focused on SEO so far, I would like to remind people that there are other web crawlers beyond those used by search engines which may not check this discrepancy at all, causing additional issues.
You requested a file, we served that file successfully. It should be a 200
I’m confused by this, if Netlify supports custom 404 files per Redirect options | Netlify Docs, I would expect this to be handled all the way. Anyway, I have liked to initial comment to upvote this. Hopefully, we will have a solution.
Of course, the /404 is not the page developers intend to put out there to be public, it’s rather a shortcut to handling all not-found pages. However, a web server is a utility software serving content over HTTP, it should not have application logic like checking whether a page has error text in it. It’s responsibility is the HTTP level.
Google’s search engine logic is their business application logic. The “punishment” for having a soft-404 on a page not being indexed. This is basically what we want, don’t we? So, no harm in /404.html being excluded form the search automatically by the G.
However, as we’re all bothered with those warning messages in the Web Search console, we can fix it by simply disallowing /404 and /404.html in robots.txt.
Problem solved
P.S. this logic and the question existed since web servers had been around. No need in redirects.
This conclusion does not take into account that the soft 404 detection is a heuristic, or that search engines crawlers aren’t the only bots around.
The fact remains that as long as we have a 404 page that returns a HTTP 200 status, the contradictory nature of the information will cause issues one way or the other by humans and bots alike, regardless of the backend implementation. The argument that “fetching the 404 page should return a 200” is a logical fallacy based on Netlify forcing the exposure of the 404 page against a dev’s intentions.
I did take into account the heuristic nature. The point here is that heuristics is a business logic of specific applications (search engine and other crawlers, content scrapers, hacker scripts, etc.).
Second question is, Netlify is a convenient web hosting (pipeline, optimized engine, web server). So, they have to be application-agnostic and to have as few non-standard settings as possible at the same time provide best value for customers. So, I see a good decision in holding closer to web server standard behaviour.
Third part is why do you care about those crawlers? As you mentioned SEO issue and cite google’s docs, I assume you want to be nice for Google Search engine.
Of course, robots.txt is not preventing from indexing because it’s a suggestion from the owner of the website to crawlers and other bots about what’s there useful. But if you put your 404 in Disallow and don’t have it linked anywhere on your or other websites, you’re good. Google won’t index it. I do it for all my websites and I don’t have warnings about it in the console.
If your page is already in the index and you want to remove it, you can add “noindex” tag, return 410 status or do whatever you want. I have no idea why the google bot would look for /404 pages in the first place. Let’s leave it to them.
What you named a misconception was about people who thinks robots.txt would hide their pages from public. So, G suggests adding password protection. we’re discussing SEO here as you started with “soft 404” complaint.
My point was that if that heuristics are prone to failure, so they can’t be relied upon.
I agree, and that’s exactly why I created this thread: so that Netfilty aligns with web standards while it currently does not. (Obviously some people in this thread disagrees with me about said standards, but I have yet to see authoritative evidence in support of their arguments.)
I care about by website having a good user experience, regardless of whether a human or a bot uses it.
You have no control over that beyond your own websites though, hence why it’s not a good solution to prevent search engine indexing despite a good probability of success. That’s one way search engines may find pages disallowed by robots.txt. And besides, it doesn’t resolve the core issue anyway.
Returning HTTP 410 in Netlify for the 404 file requires the same workaround from my initial post to be effective, so that doesn’t help. As for the “noindex” solution, I would like to point out that it doesn’t work along with a robots.txt Disallow directive, and that it’s just a worse workaround for the soft 404 problem than the one I proposed in my initial post as it only applies to search engines and has other negative SEO consequences.
I have no idea how anybody would ever believe such a thing, as it should be obvious from the name itself that robots.txt only applies to robots. This is not what I was referring to, and I’m pretty sure that’s not what the Google documentation I linked to in my previous post meant either. The big red warning box there clearly explains that the misconception is about hiding pages from search results, not from the public.
That was never my intention. SEO is just one aspect of the soft 404 problem and the most obvious one to web administrators, hence why I gave it as an example. The Internet Archive relies on HTTP response codes for archival. API users rely on HTTP response codes in their business logic. Security assessment tools partially rely on HTTP response codes for analysis. Returning the wrong response may have far-fetched and unexpected undesired consequences to a bunch of systems. That’s what I want Netlify to fix.
I believe the existence of a contradiction between HTTP headers and body should be enough to prove that the current behavior is obviously wrong, unless some kind of standard authority has allowed said contradiction to exist for historical reasons, hence my previous comment on that matter.
That said, that is a good point @nathanmartin. While not conclusive in itself, I do have some authoritative evidence: RFC 9110.
What this basically say is:
The concept of a “resource” is arbitrary in nature.
HTTP only defines an interface to resources.
The 2xx class of status codes indicates that the client’s request was accepted.
The 200 status code for GET requests means that the response represents the target resource.
The 4xx class of status codes indicates a client error.
The 404 status code means that the target resource was either not found, or that the server does not want to disclose its existence.
As such, the question becomes: is a 404 page a resource, and if so, should a server allow and disclose said resource through its HTTP interface?
While the argument “a 404 page is a resource” makes sense in the context of Netlify since it currently interprets all deployed HTML files as resources, I can’t think of a good reason why anybody would want to accept such a request and/or disclose the existence of such a target resource due to the issues raised so far in this thread. Even if someone had the best custom 404 page in the world, I would argue that it would be better represented as a separate, more adapted resource rather than a contradictory one, as 404s should only represent client errors.
So while I don’t believe there is an authoritative source that strictly allows or disallows soft 404s, this reasoning is why they are generally considered bad practice by experts in the field, and why I personally consider them bugs.
Regardless, I believe preventing disclosure of the 404 page should at the very least be an option to Netlify users since RFC 9110 allows it and as web administrators should ultimately be the ones in control of their HTTP interface.
For what it’s worth, I agree with you that the existing handling is bizarre.
I’m not proposing this as the solution, but imagine if the string response for a “page not found” was set via a different means, perhaps the Netlify UI. It would be hard for an argument to be made that requesting /404 should return a 200 status. With no file there, it’d be crazy to return 200 for that route.
It’s the chosen implementation, where the string response for a ‘page not found’ is defined in a convenient /404.html file, that results in it being a file, deployed the same as any other, whereby it seems sane that a direct request would return a 200 status.
But why is it a deployed resource the same as any other?
let me try to gather a bit of conclusions we had through the discussion:
There is no standard that says that an existing page (even consisting of an error message) should be returned with 404 status.
404 status by standard (recommended logic) stands for “isn’t there” or “existence is not disclosed”
The main problem (not 100%) is, though not clear how, your /404 page got into Google’s index and it generated for you a warning in search console
Netlify has redirects config that you can use to workaround the issue
You believe, the 404 status for /404 and /404.html should be a simple config rather than an non-obvious workaround
have I stated it correctly?
Regarding point 5: nginx when you configure error_page 404 /404.html behaves same as Netlify, direct request to /404.html has 200 OK. If you want 404 status you need to add an extra rule:
location = /404.html {
return 404;
}
Honestly, I can’t see any problematic consequences in a real world scenario that would make you do that change to behaviour. Perhaps, you could talk about your case more and we’d be able to think on the solution.
As for netlify feature, if I were making that decision on their side, I’d probably either gave you some response status override command similar to redirects or, more likely, ignore it as a rare case. Any feature you add, accumulates to the legacy pile that you need to support and deal in the future when users change and they want other things. I’m not convinced that having special heuristic rule to send 404 status by a page name (or even based on what page is set to handle 404 error code) is a logic for a generic web-server part of netlify is.
I should have mentioned this earlier, but my problem with using an edge function for this use case is that it is anti-Jamstack. One of the core principles of Jamstack is pre-rendering, and having an edge function that always returns the same content all the time violates that principle. This can be partially mitigated by configuring manual caching for that edge function, but this would still increase latency for some users.
That said, I believe this is a better workaround than my original one with the silly redirects, but I wouldn’t consider this a proper solution still.