[vox] [fwd] [svlug] mission critical computing and air safety

Bill Kendrick nbs at sonic.net
Wed Sep 22 10:30:27 PDT 2004


Here's a nice long post with some more details about the air traffic issue
last week (and his thoughts on the Windows conversions) from Rick Kwan over
at SVLUG.  Just thought it was interesting reading...


----- Forwarded message from Rick Kwan -----

Date: Wed, 22 Sep 2004 08:43:02 -0700
From: Rick Kwan
Subject: [svlug] mission critical computing and air safety
To: SVLUG <svlug at lists.svlug.org>

Synopsis:  the SoCal ATC radio outage was presumably due to technician
error, a year after upgrade from UNIX to Windows 2000.  Slashdot has an
intense discussion of dissection and speculation.  While it is
correctly pointed out that this was a failure of the application,
not the OS, some question how the FAA could allow Windows in a
safety critical application.  My personal sense is that we will
see more, not less of this sort of thing.


A three-hour outage of the air traffic control radio system in the
SoCal (Southern California) airspace on September 14, 2004, left
800 airplanes in the air without radio communication.  The FAA says
communication was re-routed to other centers, and the problem did
not present a danger to planes or passengers.  However, other
reports mention that controllers were clearly shaken as they
witnessed near-tragedies.

Initial reports cite human error.  A technician forgot to reboot
Windows 2000 before 49.7 days elapsed.   It was installed last year
as part of an upgrade of the FAA's Voice Switching and Control
System from UNIX to Windows 2000 Advanced Server.  This system is
being (or has been) rolled out to all 21 Air Route Traffic Controller
Centers (ARTCCs) of the U.S. National Airspace System (NAS).

Yesterday, a heavy Slashdot discussion ensued.  Many expressed
shock that Windows 2000 was used in a safety critical system.
Rather than blame the technician, some say the FAA shouldn't have
let Windows 2000 be used in the first place.  Others say the FAA
is depending on the judgement of system contractors, in this case,
Harris Corporation.  Others correctly note that this was a problem
of the application, not the OS.  Yet others note that an application
problem should not require a server reboot.  (Now in fairness to
the FAA and other parties, perhaps radio communication is mission
critical, but not safety critical.  Where does one draw the line?)

Like many other areas of society, the NAS is becoming increasingly
dependent on commodity computing power.  This case "merely" involved
voice communication.  Imagine if it involved trajectories and
kinematics.

Frankly, not being inside Harris or the FAA, details still seem a
little murky to me.  But given Microsoft's recent Trustworthy
Computing Initiative, claims of five 9s, et cetera, I can imagine
how a Windows 2000 system crept in.  Someone must have done a
wonderful cost/benefit study to justify this and presented it to
FAA managers, who are aviation specialists, not computing folks.

Certainly, the FAA will do an in-depth investigation (although not
to the depth of Columbia).  But after identifying the causes, and
recommending and implementing revised maintenance procedures, I
expect more procurement of Windows-based servers.  This is not
because I don't trust the FAA; it's because that's what I see in
other industries.

For folks like SVLUG members, I expect that fundamental architecture
is a big issue.  In some measure, we operate on soundness of design
by white box inspection.  But government and other procurements
are usually a matter of requirements and getting a black box to
satisfy them.  Which is frequently as it should be; therefore, I
don't see this changing anytime soon.

Which leads me to the conclusion... we're in for more of the same.
We were simply lucky this time that no one died.

--Rick Kwan

----- End forwarded message -----


More information about the vox mailing list