How do I prevent Google from crawling domains that are not my custom domain?

davidf · December 8, 2021, 3:37pm

My site is available at: https://rubber-ring.co.uk

But since I’m deploying through Netlify, the content of my site is also available at a number of other domains:

The base .netlify.app domain name - https://rubberring.netlify.app
The staging branch of the netlify.app domain name - https://staging--rubberring.netlify.app
Deploy preview urls - e.g. https://61b0cc8ce76df90007091d1b--rubberring.netlify.app

My understanding is that Netlify have automatically added a canonical link tag to the response headers of the base .netlify.app domain; checking rubberring.netlify.app there is indeed a canonical link tag.

Unfortunately, this link tag does not appear to be present in the response headers of the staging branch or the deploy preview urls.

I’m wondering if there is a way to prevent Google from indexing any of them at all?

I’ve tried creating a _headers file my build path with the following code, but I’m not seeing any difference in the response headers:

https://rubberring.netlify.app/*
   X-Robots-Tag: noindex

http://rubberring.netlify.app/*
   X-Robots-Tag: noindex

https://staging--rubberring.netlify.app/*
   X-Robots-Tag: noindex

http://staging--rubberring.netlify.app/*
   X-Robots-Tag: noindex

https://*rubberring.netlify.app/*
   X-Robots-Tag: noindex

http://*rubberring.netlify.app/*
   X-Robots-Tag: noindex

Could you please guide me in the right direction for how I need to set up my _headers file? And also, how

Thanks

hrishikesh · December 8, 2021, 5:13pm

_headers unlike _redirects won’t work for domain-based matching. You can either:

add a robots.txt on those deploys
add noindex meta tag in those deploys

davidf · December 8, 2021, 7:42pm

Hi Hrishikesh

Thanks for responding. Since the code in my staging branch is more or less the same as the code in my main branch, it sounds like option 1 is going to be the best approach. I have a follow up question…

Do you know what the contents of my robots.txt file needs to be in order to

Keep the main branch indexed
Keep the staging branch (https://staging--rubberring.netlify.app) and deploy previews ( e.g. https://61b0cc8ce76df90007091d1b--rubberring.netlify.app) un-indexed

It would be great if you could provide an example of how the file should look

Thanks

nathanmartin · December 8, 2021, 11:50pm

@davidf Ideally to achieve the indexing of the main branch, but not the staging/preview deploys, you should output different robots.txt files depending on the “Deploy Context”.

I explain the approach here regarding passwords, but it’s much the same for adjusting the output for any purpose:

Effectively you would configure your contexts to run a build command that outputs a permissive robots.txt for your main branch and a restricted one for any other branch.

davidf · December 9, 2021, 7:22am

Hi Nathan

Thanks for sharing.

I did some Googling around the details you shared, and I found an article here which discusses your solution:
https://www.jondjones.com/frontend/jamstack/how-to-create-a-robots-txt-with-netlify-that-works-on-any-environment/

I noticed that it mentioned a further solution using netlify-sitemap, which is a package I am actually already using in my project.

It seems that I’ll just have to set the config to generate the robots.txt file based on whether the ‘ENVIRONMENT’ .env variable is production or not. The one part of this that I’m unsure about is how do I make sure the ‘ENVIRONMENT’ variable is production in my main branch but ‘DEVELOPMENT’ in the rest of my branches? This article suggests that it’s possible, but I’m not sure that Netlify allows different env variables for each environment?

Edit:

So did a bit of further digging through the docs and it looks like you can set the environment variables in the toml file.

My toml file now looks as follows (not sure if all of these are necessary, but making sure I cover all bases)

[context.production.environment]
  ENVIRONMENT = "prod"
[context.branch-deploy.environment]
  ENVIRONMENT = "dev"
[context.staging.environment]
  ENVIRONMENT = "dev"
[context.deploy-preview.environment]
  ENVIRONMENT = "dev"

My next-sitemap.js file looks as follows

let policy = {
  userAgent: '*',
};

if (process.env.environment !== 'prod') {
  policy.disallow = '/';
}

module.exports = {
  siteUrl: process.env.NEXT_PUBLIC_ROOT,
  generateRobotsTxt: true,
  robotsTxtOptions: {
    policies: [policy],
  },
};

Is this correct, and is there a way for me to test that the correct robots.txt files are successfully being generated?

Also, is there a way to delete all of the previous deploy previews?

nathanmartin · December 9, 2021, 8:13am

If it works, it works.
I’d just check the robots.txt by either visiting the /robots.txt url of the generated site, or by downloading the output of a specific build e.g.

There is no way to delete deploy previews, (other than perhaps making a request to Netlify support).

There’s a thread here that you can lend your voice to:

davidf · December 9, 2021, 9:34am

Thanks for the further info Nathan.

I’ve tried this out and, after capitalising ‘process.env.ENVIRONMENT’ in next-sitemap.js, I’m now getting the result I want for the main/production branch and the staging branch.

Just an issue I’m still dealing with:

I’m still having problems with the deploy previews (the versions of the site that are accessed by clicking the ‘Preview Deploy ’ link in your screenshot). They seem to have the same robots.txt as the main/production site, despite specifying the ENVIRONMENT variable as ‘dev’ in the netlify.toml file under [context.deploy-preview.environment]. Do you know why this might be the case?

Digging into the docs on deploy contexts a bit further, the deploy-preview is defined as “a deploy generated from a pull request or merge request”. I’m not sure that this matches the scenario I have here as these preview links are not generated from a pull or merge request, they’re simply generated, along with an update to the main/production site, each time I push a change to the main branch. Is there another way I should be targeting these previews?

nathanmartin · December 9, 2021, 9:56am

I’ve never utilised the [context.deploy-preview.environment] context myself, but I believe that link points to the result of the build which in this case would be the result of the main build… which should be indexed (hence having the robots.txt for main/production).

davidf · December 9, 2021, 10:04am

Hi Nathan

Thanks for clarifying. So it looks like we’re still without a way to prevent these previews from being indexed. I’m hoping someone from Netlify can chip in to say how to either target them with the netlify.toml file or disable them completely

nathanmartin · December 9, 2021, 10:24am

Netlify may also be able to clarify the “preview” terminology and features too.

Ultimately I’d imagine the previous deployments not being deleted is tied to both the cdn and the fact Netlify let you instantly re-deploy any previous build.

hrishikesh · December 9, 2021, 11:59am

Wow, long discussion to catch-up on. So happy to share any additional insights.

Yeah, there’s no way at all. Even contacting us is not a solution here as deploys cannot be deleted.

Netlify automatically does that for you. Those preview links send the noindex HTTP header:

(the last line in the above screenshot). This header is automatically applied to any permalink-based deploy so that it doesn’t end up in search engines. Almost all popular search engines respect this header and I’ve never seen my preview deploys indexed by Google.

With that being said, here’s a little clarification about the contexts. Each build will show you what context it’s running with:

All the links of this deploy (the custom domain, the Netlify subdomain and the deploy permalink, that’s the preview deploy button), have the same context. So, you cannot individually target these links. You can use _redirects to redirect all those domains to the same domain, but that’s about it.

The contexts that was being talked about above (branch, deploy-preview) are the other two contexts usually used apart from production. They can be individually targeted with netlify.toml.

davidf · December 9, 2021, 1:22pm

Amazing! Thanks very much for sharing, not sure how I missed the x-robots-tag when checking the preview links as I did check

Topic		Replies	Views
Prevent Google from indexing default ___.netlify.app domain when using a custom domain Support netlify-dns-https-ssl , seo	12	4023	March 6, 2022
Prevent specific domains from being indexed Support deployment	1	1222	August 13, 2020
Netlify website blocking indexing on google Support netlify-newbie	3	581	December 22, 2023
Does search engines like google index unfinished .netlify.app sites? Support netlify-newbie	4	5339	September 7, 2020
Updates receiving for custom domain and also for netlify domains Support building , deployment	3	1410	May 3, 2020

How do I prevent Google from crawling domains that are not my custom domain?

Related topics