Infamous Software Bugs: AT&T Switches
In this third installment of our ‘Infamous Software Bugs’ series – a series with the goal of examining software blunders and identifying the bugs behind them – we will be focusing on AT&T’s Long-Distance Network Collapse that left callers waiting… and waiting… and waiting.
AT&T’s Long-Distance Network Upgrade Bug
In 1990, AT&T tried to initiate a very complex software upgrade with the intent of speeding up long distance calls. Instead of bringing customers better service, a single line of code from the upgrade actually caused the long distance network to shut down on January 15.
The Cost: An estimated loss of $60 million in long-distance charges, 9 hours of service time, approximately 75 million missed phone calls, and an estimated loss of 200,000 airline reservations.
The Error: At one of AT&T’s 114 switching centers, a single 4ESS switch, let’s call it Switch A, experienced a minor mechanical problem and sent out a congestion signal (basically a “do not disturb” message) to switches it was linked to. Before the upgrade, Switch A, after reinitializing itself, would send a message out to say it was working again and the connected switches would then reset themselves to acknowledge this.
In order to increase the speed of the calls, AT&T had tweaked the software to send messages faster and more efficiently. Switch A did not have to send a message saying it was back in service. The reappearance of traffic from Switch A would tell the others that it was working again, and as the switches receive messages from A, they would reset themselves.
Image credit: Matt Reinbold
So, after reinitializing, Switch A began processing calls and sending out call routing signals. The problem occurred after Switch A sent a call attempt to another switch, let’s call it Switch B. Switch B acknowledged A’s return and began the reset process, and, while still in the process of resetting, it received a second call attempt from A. This obscure load and time-dependent defect caused Switch B to think it needed to reinitialize itself, meaning B then sent out congestion signals out. Once B was back it sent Switch C a call attempt (causing C to reset) in addition to a second call attempt (during C’s reset process), causing C to reinitialize itself. This problem cascaded through the entire CCS7 network as switches were constantly resetting and sending signals during other’s resets.
The Reason: A single line of buggy code from the complex software upgrade, related to the fact that the switches were receiving messages too soon during the reset process, caused the cascading switch failures. A “pseudo-code” translation of the actual buggy code and a more in-depth analysis of what it means can be found here.
Keep an eye out for our next installment of the Infamous Software Bugs series!
Olenick & Associates has a full-time staff of experienced software testers that work to ensure your software’s deployment. By using proven approaches and methodologies, you won’t be added to the list of infamous victims of poor-quality software testing. With a focus on software testing for 15+ years, we specialize in finding bugs and offering solutions.
By Nicole Gawron, Marketing and Communications Intern – Olenick & Associates, Chicago
Wikipedia’s List of Software Bugs
Top 100s Blogspot: 10 Most Expensive Software Blunders
Top 10 List: Ten Costliest Software Bugs
Gang Tan’s A Collection of Well-Known Software Failures
Computerworld: Epic Failures: 11 Infamous Software Bugs