The AT&T Story
We all knew the day would come. And at least some of us were prepared for it. But, as usual, the vast majority had absolutely no idea what was going on.
AT&T was hit hard by a computer worm on January 15. That is a fact. And after reading the technical explanation below, you'll see why this is so.
But AT&T wasn't the only entity hit - by this worm - we all were, some far more than others. The inability to get through, the denial of access, coupled with the blind faith we put in technology, the unwillingness to spread information so we can all understand the process. Yeah, it was fun for the phone phreaks as we watched the network crumble. But it was also an ominous sign of what's to come.
In the words of a high-ranking AT&T person, "very little could have gone worse." According to AT&T, of 148 million attempts, only 50 million went through. Many claim it was far worse than that.
But what was it that actually happened? Here's what we were able to determine:
The problem started in a 4ESS machine in New York. The 4ESS is used to route calls and is basically, in the words of a Bell Lab technician, "nothing more than a big computer." New York, for reasons unknown, sent out a Broadcast Warning Message (BWM), which triggered all of the 113 other 4ESS machines around the nation to do likewise.
Why did this happen now? Well, back in the late seventies, Bell Labs developed a Common-Channel Signaling system known as Signaling System 6 or SS6. International standards have been developed over the past couple of years which necessitated some change on AT&T's part. So CCS 7, or Signaling System 7, was introduced. Somewhere inside SS7 is where the problem lurked, undetected, until January 15.
According to experts, SS7 is a much more flexible system and that's why it's become the international standard. It's actually more of a protocol to which each company must adjust. They don't all use the same software. AT&T uses its own software, British Telecom uses something different, U.S. Sprint uses something else, etc. Some AT&T people, aided by well-meaning but ignorant media, were spreading the notion that many companies had the same software and therefore could face the same problem someday. Wrong. This was entirely an AT&T software deficiency. Of course, other companies could face completely different software problems. But, then, so too could AT&T.
The 114 4ESS machines around the country have new software installed periodically. When this is done, it's done gradually, circuit by circuit, one machine at a time. The network is presently configured so that the 4ESS machines have some circuits consisting of both SS6 and SS7. Eventually, though, all ties to the SS6 will be eliminated. "There's no reason to be concerned with this," AT&T says. "We've had some major changes in the network in the last ten years. In fact, we've had quite a few in the last three or four. They've always been for the better."
But what caused the problem? Exactly the right situation occurred at the right moment for a particular event to occur. Possibly the fact that January 15th was a holiday had something to do with it. Traffic was fairly low, which was unusual for a Monday. It's assumed that the problem originated in a particular component known as Common Network Interface (CNI) Ring. There is a component of that ring that allows the 4ESS to transmit messages across the ring and across the Common-Channel Signaling network. What apparently happened was that there was a flaw of some kind in the software in one of those rings. The bogus BWM from New York was sent out and it caused an excess of messages going to other 4ESS locations. A snowball effect began and the congestion spread and grew rapidly. All of the 4ESS machines were effected within half an hour.
Sounds like a worm to us. Not the kind that gets spread deliberately. There are plenty of programming errors that cause accidental worms. It could happen to any computer system.
Phone calls were forced off of SS7 and onto SS6. The problem was fixed by overwriting part of the software, in effect, bypassing it. But, at press time, the specific cause still hadn't been made known.
The name of the organization of Bell Labs software people trying to figure all of this out is NESAC, National Electronic Switching Assistance Center. They're working out of Lyle and Indian Hill, Illinois.
Lack of Redundancy
One expert said, "There's been a tendency in this company to save money by centralizing operations and making things bigger. And that has made the whole system more vulnerable."
There is much less redundancy in today's system, meaning there is less of a backup. The current infatuation with fiber optics that certain long-distance companies have (AT&T included) spells certain trouble because of the lack of redundancy in these cheap systems.
The problem occurred in a part of the signaling system that doesn't carry voice traffic. It's known as "out-of-band signaling" because it's outside the band that carries the actual conversation. Data, such as the number called and the number calling, is sent over this path. Among other things, this prevents Blue Boxing since subscribers have no access to the routing signals.
And that's basically all we know at this stage. What we don't know is how a major force in communications like AT&T could be so sloppy. What happened to backups? Sure, computer systems go down all the time, but people making phone calls are not the same as people logging onto computers. We must make that distinction. It's not acceptable for the phone system or any other essential service to "go down." If we continue to trust technology without understanding it, we can look forward to many variations on this theme.
AT&T owes it to its customers to be prepared to instantly switch to another network if something strange and unpredictable starts occurring. The news here isn't so much the failure of a computer program, but the failure of AT&T's entire structure.
The Non-Technical Problems
In the height of the crisis, Laura Abbott, an AT&T spokesperson, said callers shouldn't try using any of the other companies. She recommended repeated tries over AT&T. "If you don't get through the first time, you'll get through the second time."
AT&T operators, hours after the crisis began, refused to tell customers how they could place their calls over other long-distance companies. It went against company policy. This, despite the fact that most long-distance companies tell the customer how to access AT&T if he/she needs to.
The media once again let us down by not doing enough to educate themselves, let alone the public. All that had to be done was to alert the public as to how to make a long-distance call using another company. Nobody had to be inconvenienced on that day.
Breaking up the Bell System was essential in the name of fairness. But it doesn't end there. The general public has to be educated on how to use the new system to their advantage. What good is a fair system if most people don't know how to use it? Why are people so afraid to do this? Why are they discouraged?
Many institutions and businesses choose to block access to the 10XXX system, thinking that somehow it will generate more bills. Many of them now realize belatedly the usefulness of that system.
The Carrier Access Code list we printed in our last issue should be available to everybody in the country. Possession of this list is really the only way consumers will find alternative long-distance companies that could be a life-saver in a situation like this.
During the California earthquake last October, AT&T made a decision for us. They decided that incoming calls weren't as important as outgoing calls to the people there. They were probably right. But, by blocking virtually all attempts, they were making a categorical assumption that simply doesn't hold up to individual reasoning. For those of us who knew the alternative ways to route our calls, calling in was no problem. But so few of us knew this.
There obviously have to be more alternatives, so that there are more choices for each of us. But there has to be a level of awareness among the end-users, or else, what's the point?