Between 12:39 PM and 1:10 PM EST on Sunday May 13th our monitoring on https://console.aserto.com (site name aserto-console.netlify.app) was returning errors. The monitoring tool alerted multiple times with the error “TCP connection failed”. In addition to automated monitoring, we directly observed that the site was intermittently unavailable during this time.
Occasional partial failures of a complex service like Netlify are understandable. What gives us cause for concern, however, is that during the outage period our engineers intermittently experienced the following security error when trying to visit the site in a browser:
Having the site fail in a state where it gives the user a reason to question the security of our site is something we would like to avoid no matter what the failure mode. Could we request your assistance to explain what happened here and help us determine how to prevent this sort of bad user experience from happening in the future?
First off I am incredibly sorry for the incident that occurred last week and I understand completely how seeing this error and your monitors alerting caused you and your team stress. It is our goal to not harm any developer workflow and provide world-class service and we let you down. We have learned an incredible amount from this incident and are committed to doing better in the future. I appreciate your kindness throughout this event.
So what is up with this error? Basically, the incident was triggered by an increase in latency to our backend systems, because of that, when your site went to resolve the SNI in the TLS handshake the certificate was unable to resolve the hostname to the site and defaulted back to our domain name netlify.com, which would be a mismatch for your site certificate to resolve correctly. Without SNI, then, there is no way for the client to indicate to the server which hostname they’re talking to. As a result, the server may produce the SSL certificate for the wrong hostname. In this case, our logic defaulted back to netlify.com, If the name on the SSL certificate does not match the name the client is trying to reach, the client browser returns an error and usually terminates the connection. Hence the error that you received.
In addition to the resiliency we are adding to our systems to ensure this doesn’t happen again, we are updating our logic to fail more gracefully in the event the TLS handshake is interrupted or SNI is not returned in a timely manner.
Hopefully, this answers your question and if not, please hit me back and I would be happy to continue this conversation!
Thanks very much for taking the time to give us such a detailed explanation. It’s reassuring to know your team understands the issue and is on top of it. I’ll let everyone here know you’re planning updates to prevent it.
Thanks Dana for the awesome response - you are, and always will be, the best!