JS files missing when running split tests

Site name: ae-prod (audioeye.com)

Recently started seeing an elusive issue related to split testing which I suspect could be Netlify related.

Context: We have 2 branches being served using split testing, each at 50% distribution. We’re using Gatsby V4.7.1, and I’ll post the package.json at the bottom of this post. I have successfully ran split tests without this issue for the past few months, but something has changed in the last 1-2 weeks that has created issues.

We started noticing within the last few days that we are often seeing 400’s and 404’s on individual Webpack bundled JS files, causing errors and partial page loading issues, but only one on of the two branches, and not always the same branch.
Unfortunately, this behavior has been hard to reproduce consistently, but I have confirmed that switching which branch is being served by manually editing the nf_ab cookie value and refreshing gives different results for a given Gatsby build/deploy. In some cases, refreshing the page seems to fix the issue, in others it does not. If I trigger a manual rebuild/deploy of both branches, the issue will sometimes go away completely, and in other cases it will change to the other branch. This is unexpected.

Basically, it seems like one of the JS files isn’t being connected or routed correctly for a given deploy, causing parts of the page which are JS reliant such as embedded videos and forms, not to load. We suspect this is somehow connected to split testing, but it is hard for us to trace back more than we already have as there are no errors in our deploy logs or consoles, and we can’t reproduce it locally.

I’ve attached several screenshots of the errors including request header info.
Any help would be greatly appreciated.

Package.json:

"scripts": {
    "build": "gatsby build",
    "clean": "gatsby clean",
    "dev": "gatsby develop -o -H 0.0.0.0",
    "develop": "gatsby develop",
    "serve": "gatsby serve",
    "lint": "eslint . --ext .js,.jsx",
    "lint:fix": "eslint . --ext .js,.jsx --fix",
    "lint:ci": "yarn lint --format junit -o results/eslint/result.xml",
    "lint:staged": "eslint --fix --ext .js,.jsx",
    "format": "prettier \"**/*.md \" --write",
    "cy:open": "cypress open",
    "cy:run": "cypress run",
    "cy:run:ci": "cypress run --browser chrome --reporter junit --reporter-options 'mochaFile=results/cypress/result.xml'",
    "prettier": "prettier",
    "prettier:fix": "prettier --write \"**/*.{js,jsx}\"",
    "test:e2e:dev": "cross-env CYPRESS_SUPPORT=y start-server-and-test dev http://localhost:8000 cy:open",
    "test:e2e:run": "cross-env CYPRESS_SUPPORT=y start-server-and-test develop http://localhost:8000 cy:run",
    "test:e2e:ci": "cross-env WAIT_ON_TIMEOUT=600000 CYPRESS_SUPPORT=y start-server-and-test develop http://localhost:8000 cy:run:ci",
    "test": "jest"
  },
  "lint-staged": {
    "*.{js,jsx}": [
      "yarn prettier --write",
      "yarn lint:staged"
    ],
    "*.{yaml,yml}": [
      "yarn prettier --write"
    ]
  },
  "dependencies": {
    "@emotion/css": "^11.1.3",
    "@emotion/react": "^11.0.0",
    "@emotion/styled": "^11.0.0",
    "@material-ui/core": "^4.12.3",
    "@netlify/functions": "^0.11.0",
    "@reach/skip-nav": "^0.16.0",
    "@use-it/interval": "^1.0.0",
    "@vimeo/player": "^2.16.3",
    "airtable": "^0.11.1",
    "aws-amplify": "^4.3.14",
    "emotion": "^11.0.0",
    "emotion-server": "^11.0.0",
    "gatsby": "^4.7.1",
    "gatsby-plugin-canonical-urls": "^4.7.0",
    "gatsby-plugin-emotion": "^7.7.0",
    "gatsby-plugin-image": "^2.7.0",
    "gatsby-plugin-manifest": "^4.7.0",
    "gatsby-plugin-material-ui": "^4.1.0",
    "gatsby-plugin-prismic-previews": "^5.0.3",
    "gatsby-plugin-react-helmet": "^5.7.0",
    "gatsby-plugin-remove-console": "^0.0.2",
    "gatsby-plugin-remove-serviceworker": "^1.0.0",
    "gatsby-plugin-robots-txt": "^1.7.0",
    "gatsby-plugin-sitemap": "^5.7.0",
    "gatsby-source-prismic": "^5.1.0",
    "get-contrast": "^3.0.0",
    "idx": "^2.5.6",
    "material-ui-popup-state": "^2.0.0",
    "netlify-lambda": "^2.0.15",
    "node-fetch": "^2.6.2",
    "prismic-reactjs": "^1.3.4",
    "react": "^17.0.2",
    "react-colorful": "^5.5.1",
    "react-dom": "^17.0.2",
    "react-focus-trap": "^2.7.1",
    "react-helmet": "^6.1.0",
    "react-hubspot-form": "^1.3.7",
    "react-lottie-player": "^1.4.0",
    "react-phone-number-input": "^3.1.46",
    "react-reveal": "^1.2.2",
    "react-select": "^5.1.0",
    "react-share": "^4.4.0"
  },
  "devDependencies": {
    "@babel/eslint-parser": "^7.15.8",
    "@netlify/plugin-gatsby": "^2.0.2",
    "@testing-library/cypress": "^8.0.1",
    "@testing-library/react": "^12.1.2",
    "babel-jest": "^24.7.1",
    "babel-plugin-transform-remove-console": "^6.9.4",
    "babel-preset-gatsby": "^2.7.0",
    "cross-env": "^7.0.3",
    "cypress": "^8.6.0",
    "eslint": "^7.32.0",
    "eslint-config-airbnb": "^18.2.1",
    "eslint-config-prettier": "^8.3.0",
    "eslint-plugin-cypress": "^2.12.1",
    "eslint-plugin-import": "^2.25.2",
    "eslint-plugin-jsx-a11y": "^6.4.1",
    "eslint-plugin-prettier": "^4.0.0",
    "eslint-plugin-react": "^7.26.1",
    "eslint-plugin-react-hooks": "^4.2.0",
    "gatsby-cypress": "^1.14.0",
    "gatsby-plugin-netlify": "^4.1.0",
    "husky": "^7.0.2",
    "identity-obj-proxy": "^3.0.0",
    "jest": "^24.8.0",
    "lint-staged": "^11.2.3",
    "prettier": "^2.4.1",
    "start-server-and-test": "^1.14.0"
  },
  "peerDependencies": {
    "gatsby": "^4.0.0"
  }
}

Hi, @Nicky_Evers. We replied to the support ticket about this. Please reply to the email we sent to troubleshoot this issue with in our helpdesk.

However, if you don’t see that reply, please reply here to let us know.

Really hoping to hear back regarding this ticket.

Hi, @Nicky_Evers. Sorry for the delay and I did reply on the ticket (# 84100). Please do feel free to reply to the ticket as I did have additional questions there. However, if you don’t see my reply there, please reply here to let us know.

Thanks @luke , I got your response to the ticket and will respond there.

@Nicky_Evers @luke we are having the exact same issue with our Gatsby 4 site when running split testing. I came upon this thread from Google (netlify split testing webpack 404) and it might be really helpful to update this thread with findings as this is a site-breaking issue for us currently and preventing us from reliably using split testing

@jlevy-io Helpful to hear we are not the only people having this issue. We have heard from Netlify support that this is a bug on their side and something that they are working on, but we haven’t heard a response from them in 1.5 weeks even after sending several followup emails :frowning: Feeling pretty disappointed and stuck as this is also a huge blocker for us. Really hope to hear something soon Netlify…

@Nicky_Evers we were able to find a “temporary” solution for this by using gatsby-plugin-remove-fingerprints

I say temporary because this package was last updated 3 years ago and does not appear to still have support. That being said, I can confirm it does still work with Gatsby 4 and our split test issues have been resolved. We have had one running in production since about 4pm yesterday without any issues. YMMV

2 Likes

hi there nicky,

could you share how/when you got that response and also where you sent followup emails so i can take a closer look? thank you!

Hi @perry , sure, I submitted a support form around the same time as post was originally created (2/18) and have traded a few emails with Luke Lawson from your support time. The last response I got was on 2/28, and I have sent 3 emails since then with no response. Here is the body of the last email Luke sent:

I took a look at the three x-nf-request-id headers provided. The third value (01FW4H5PHVGES4MPWKKN8R5B9D) was a 400 response. The 400 response was caused by an issue in our service but it is unrelated to the split tests.

The other two x-nf-request-id ​ headers do show 404 responses for content not in the deploy requested. It also shows that the request was proxied to a Netlify function on the master branch.

I’m showing the IP address that made those 404 requests was browsing the site on the test-variant-1 ​ branch. In the middle of this content being served a request is made for this URL (for both 01FW2FHNREZ74ND3J5QPMK6THB and 01FW2FD0JNDQJGZQJDHC9JBN5M):

https://www.audioeye.com/app-ad333b2b5d718c147907.js

What is interesting is that this file does exist in the test-variant-1​ ​ branch deploy at that time (which is the deploy below)

[link removed here]

You can download the deploy to see that the file is there. This means that the file should have been served. Instead I see the request being proxied to the master and the Gatsby DSG function which is being invoked for files which don’t exist in the deploy. There is no DSG page defined for that URL so the DSG is returning the 404.

Again, though, the DSG function on the master branch should never have been invoked. Do you happen to know if the client that made the request included the required nf_ab cookie header for those two requests?

Our logging doesn’t include cookies for privacy reasons so I cannot see if the cookie was sent in the logs. If the cookie was omitted, that is a client side error and the root cause of the issue.

However, if the cookie was sent then the error happened on the Netlify side. I’ve asked our engineers to take a look at the research I’ve done. I’m certain they will want to confirm the details about the cookie header though so I wanted to ask about that before they do.

To summarize, the 400 response is a known bug unrelated to split tests. The 404s we are still researching but knowing about the nf_ab cookie header will be key to finding that root cause. Would you please let us know if the nf_ab ​ cookie was included for the two 404s?

END OF EMAIL

I have responded with screenshots confirming that the nf_ab cookie was and is always present on each request header.

Since then, we set up a Canary test to gather data on how many 4XX errors we were getting on individual resources (js files, etc) and we’re dismayed to see that even with the split testing off we are seeing 4XX’s on either individual resources or on entire pages about 4% of the time, and that is on a canary which is only hitting 2 pages out of hundreds. Seems like some bug in your CDN routing logic? Maybe the split testing just increases the frequency and visibility of an always present, and pretty scary, issue on your side?

Here are two x-nf-request-id’s that have the 400 errors in case that is helpful:

Req ID 400 on a JS file:
01FXR5H25RV88PZG8B7H381SVJ

Req ID Whole page 400:
01FXQVSBFNAYN3N2NWASX9N7NF

We will continue to monitor the rate of errors and figure out what our options are, and hope to hear back from you soon.

Update, our Canary test is now showing that we are missing some resource for ~9% of all requests with split testing turned off. This is really concerning…

Hey there, @Nicky_Evers :wave:

Thanks so much for the thorough follow-up. I acknowledge that this has been a slow back and forth, and I assure you that we have not forgotten about this question! Our team works very hard to reply to everyone in the forum and helpdesk as quickly as we can. With the recent incidents, we have had more on our plate than usual.

I have surfaced your helpdesk ticket to the Support Engineers again, and a member of our team will follow up there with you.

Thanks for your patience as we work to deliver Support to our Netlify customers. We appreciate it!