OpenAI Blamed Its Three Hour Outage on a “New Telemetry Service”

IBL News | New York

OpenAI blamed the major disruption of its ChatGPT, Sora, and API on December 11 on a new telemetry service to collect Kubernetes that went awry. The company admitted that “it had fallen short of our own expectations.”

“This event resulted from an internal change to roll out new telemetry across our fleet,” said the company.

The outage of three hours, between 3:16 PM PST and 7:38 PM PST, was one of the most prolonged outages in its history.

A security incident or a recent product launch didn’t cause the downtime. In a postmortem explanation, OpenAI explained, “The issue stemmed from a new telemetry service deployment that unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems.”

Kubernetes is an open-source program that helps manage containers or packages of apps and related files used to run software in isolated environments.

“Our Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large Kubernetes clusters.”

OpenAI said that it could detect the issue “a few minutes” before customers ultimately started seeing an impact but that it couldn’t quickly implement a fix because it had to work around the overwhelmed Kubernetes servers.

“This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote.

OpenAI said that it’ll adopt several measures to prevent similar incidents from occurring in the future.