Sporadic deployment failures for Large Media site

Hi,

We have a Netlify Large Media-enabled site (afs-media.netlify.app) fail deployment about once every two weeks with the following error:

6:09:12 PM: ────────────────────────────────────────────────────────────────
6:09:12 PM:   Internal error during "Deploy site"                           
6:09:12 PM: ────────────────────────────────────────────────────────────────
6:09:12 PM: ​
6:09:12 PM:   Error message
6:09:12 PM:   Error: Deploy did not succeed: Failed to execute deploy: [PUT /sites/{site_id}/deploys/{deploy_id}][500] updateSiteDeploy default  &{Code:0 Message:}
6:09:12 PM: ​
6:09:12 PM:   Error location
6:09:12 PM:   During Deploy site
6:09:12 PM:       at handleDeployError (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/plugins_core/deploy/buildbot_client.js:87:18)
6:09:12 PM:       at deploySiteWithBuildbotClient (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/plugins_core/deploy/buildbot_client.js:68:12)
6:09:12 PM:       at processTicksAndRejections (node:internal/process/task_queues:96:5)
6:09:12 PM:       at async coreStep (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/plugins_core/deploy/index.js:45:5)
6:09:12 PM:       at async fireCoreStep (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/steps/core_step.js:39:9)
6:09:12 PM:       at async tFireStep (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/time/main.js:20:59)
6:09:12 PM:       at async runStep (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/steps/run_step.js:88:7)
6:09:12 PM:       at async pReduce.index (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/steps/run_steps.js:91:11)
6:09:12 PM:       at async runSteps (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/steps/run_steps.js:51:7)
6:09:12 PM:       at async runBuild (file:///opt/buildhome/node-deps/node_modules/@netlify/build/src/core/main.js:610:7)
6:09:12 PM: ​
6:09:12 PM:   Resolved config
6:09:12 PM:   build:
6:09:12 PM:     base: /opt/build/repo/media
6:09:12 PM:     command: npm run build-media
6:09:12 PM:     commandOrigin: ui
6:09:12 PM:     environment:
6:09:12 PM:       - INCOMING_HOOK_BODY
6:09:12 PM:       - INCOMING_HOOK_TITLE
6:09:12 PM:       - INCOMING_HOOK_URL
6:09:12 PM:       - NETLIFY_LFS_ORIGIN_URL
6:09:12 PM:       - ONEGRAPH_AUTHLIFY_TOKEN
6:09:12 PM:     publish: /opt/build/repo/media/dist
6:09:12 PM:     publishOrigin: ui

The site consists only of image files, but there a lot of them (~9000 / 2GB). While the site is connected to a GitHub repo, we keep builds deactivated until we need one, then we activate, call a build hook, then deactivate. This is to avoid needlessly triggering builds when no images are affected. Most of the time this works perfectly, but every so often this error occurs.

Is there sometime we can do to avoid this error? Any help would be much appreciated!

Hi, @gsjen123. I wish I could determine the root cause but I cannot. Our developers will need to research to find the root cause here.

I’ve filed an issue to track this and we will follow-up here to let you know if the issue is resolved. In the meantime, triggering a new deploy is the only workaround. While I agree this workaround is far from ideal, we don’t count build minutes for failed deploys so it won’t impact your build minutes use.

I do see this happening more frequently for sites with many files (as you mentioned, this site has over 9000). However, it also happens for sites with less than 50 files (but far less often).

I do think the error itself is happening randomly and isn’t specific to what your site does. For example, we see this error for sites not using Large Media so it isn’t Large Media causing it.

Larger sites tend to spend longer uploading (more files means it takes longer to upload) and I believe this is why it is seen more frequently for larger sites. It is a random error so the longer spent in this stage the more likely the error is to occur.

Again, I don’t have a fix today but I wanted to let you know it is being tracked now. If there are other questions or concerns, please let us know.

Any update on this? We are now experiencing this error on nearly every build.

Hi @gsjen123,

This issue was later dropped in priority, I have pinged them again to verify the current status.

Hi, @gsjen123. The issue has been determined to be caused by the use of Large Media for this site.

The recommended solution here is to stop using Large Media for this site. You can still use Git LFS for the repo and deploy sites at Netlify without using Large Media. You can use Git LFS without using Large Media and this is what we recommend.

We have a support guide about uninstalling Large Media here:

Would you please read that support guide and let us know if you are ready for us to remove the Large Media add-on for this site?

Please note, the only difference between using Large Media and Git LFS by itself is that Large Media allows for browse time image transformations. This site is not using that feature in any way. I see only six image transformations for the site in question in the last 30 days. All six came from a single IP address and none sent a referer: header. So, it appears to have been someone testing the feature and not real world use.

As you are not using the image transformation feature for this site, Large Media isn’t providing any benefit at all and it is the cause of the deploy errors. Removing it will prevent these deploy errors. There is nothing to lose and much to gain by removing Large Media.

Again, please let us know if there are questions and/or if you are ready for us to proceed with uninstalling the Large Media add-on.

Hi @luke. With Large Media removed, will requests for LFS pointer files continue to be proxied to your LFS service?

Hi, @gsjen123. If you follow the directions in the support guide above, then Git LFS API calls will be sent to the same Git host that hosts the repository.

This is done by this specific step:

Once our support team replies to the topic that the Large Media add-on was removed from the site, delete the file .lfsconfig from the repo and commit that deletion.

It is this file that redirects Git LFS API calls to Netlify. Once that file is deleted it will default to the Git host for the repo.

If you are ready to proceed or if there are other questions, please reply here anytime.

My question is about requests to our website, not requests made by Git. For example, today this request:

GET https://afs-media.netlify.app/posts/1600x500-giftcard.png

is reverse-proxied to the Netlify LFS service, so instead of returning the deployed text pointer file, the image is returned. My question is: if we remove Large Media, will this proxying still take place?

No, it won’t. The file will be fetched from your Git repo and deployed to the CDN directly just like any other file that doesn’t use LFS.

Thanks @hrishikesh. So disabling Large Media would mean not only would we have to set up another LFS service, but we would also incur longer build times by having to download 3GB+ every build. We would much prefer to keep using Large Media. Can you not provide more info on what is going wrong? Maybe we can do something different in our builds?

There’s nothing on your end that can solve this unfortunately. You’ve too many Netlify Large Media files and our API has a limitation of 28 seconds of execution. If the API takes longer than that to respond, the request fails. This is the same reason, why sites with too many files in their deploy cannot be downloaded - as the API is not able to complete the request within 28 seconds.

You’re basically hitting a system limitation of the platform which cannot be uplifted anytime soon.

Hi, @gsjen123. I do think not using Large Media is still and an option here because the following isn’t true when the build cache is used:

So disabling Large Media would mean not only would we have to set up another LFS service, but we would also incur longer build times by having to download 3GB+ every build.

Only new or changed files need to the downloaded when builds are cached. Note, with both processes, either Large Media or Git LFS alone (without Large Media), the files need to be sent to Netlify. That happens in either case.

It is true the first time build of a new repo occurs that it would take longer to download the repo. However, that only happens once. After the first build is cached, subsequent builds will only need to update the repo - not to download it completely. You will not need to download all media with every build with caching.

Long story short, I don’t think you are getting much of a savings on builds with Large Media and removing it will fix the deploy errors. Are there any other reasons you want to still with Large Media? Again, if you continue using it with so many files you will continue to see errors.