[gclist] Mars Rover info from comp.risks

Ken Anderson kanderson@bbn.com
Mon, 5 Jan 1998 19:07:45 -0400


[kanderson: Henry is right, this is important.  It shows how dangerous
preemptive threads can be (is your Java application immune?), and how
important a dynamic langauge and debugging environment can be for a remote
distributed application.

BTW, many NASA missions start on July 4th, such as the 1976 Voyager mission
to Mars, that i worked on.  My contribution was a Basic program to display
seismic data that turned out to be very popular for making birthday cards.
We made sure the press made a big deal about the Viking missing being
programmable (while the Russian space flights weren't), but every Viking
scientific application affected every other, and the state of the universe
was also involved, so a programming sesssion involved 100 shouting people.
Once a program changes was agreed apon, we simulated it for 22 days to make
sure it would work.  That's how porgrammable it was!  Once we waited an
hour while two people violently argued.  It turned out they had
misinterpreted a TLA (three letter achronym).  Never mind.]

[hbaker: I thought that the following info was so important to this group
that I copied it in toto.  My apologies if you have already seen it.]

Date: Sunday, December 07, 1997 6:47 PM
From: Mike Jones <mbj@MICROSOFT.com>
Subject: What really happened on Mars Rover Pathfinder

The Mars Pathfinder mission was widely proclaimed as "flawless" in the early
days after its July 4th, 1997 landing on the Martian surface.  Successes
included its unconventional "landing" -- bouncing onto the Martian surface
surrounded by airbags, deploying the Sojourner rover, and gathering and
transmitting voluminous data back to Earth, including the panoramic pictures
that were such a hit on the Web.  But a few days into the mission, not long
after Pathfinder started gathering meteorological data, the spacecraft began
experiencing total system resets, each resulting in losses of data.  The
press reported these failures in terms such as "software glitches" and "the
computer was trying to do too many things at once".

This week at the IEEE Real-Time Systems Symposium I heard a fascinating
keynote address by David Wilner, Chief Technical Officer of Wind River
Systems.  Wind River makes VxWorks, the real-time embedded systems kernel
that was used in the Mars Pathfinder mission.  In his talk, he explained in
detail the actual software problems that caused the total system resets of
the Pathfinder spacecraft, how they were diagnosed, and how they were
solved.  I wanted to share his story with each of you.

VxWorks provides preemptive priority scheduling of threads.  Tasks on the
Pathfinder spacecraft were executed as threads with priorities that were
assigned in the usual manner reflecting the relative urgency of these tasks.

Pathfinder contained an "information bus", which you can think of as a
shared memory area used for passing information between different components
of the spacecraft.  A bus management task ran frequently with high priority
to move certain kinds of data in and out of the information bus.  Access to
the bus was synchronized with mutual exclusion locks (mutexes).

The meteorological data gathering task ran as an infrequent, low priority
thread, and used the information bus to publish its data.  When publishing
its data, it would acquire a mutex, do writes to the bus, and release the
mutex.  If an interrupt caused the information bus thread to be scheduled
while this mutex was held, and if the information bus thread then attempted
to acquire this same mutex in order to retrieve published data, this would
cause it to block on the mutex, waiting until the meteorological thread
released the mutex before it could continue.  The spacecraft also contained
a communications task that ran with medium priority.

Most of the time this combination worked fine.  However, very infrequently
it was possible for an interrupt to occur that caused the (medium priority)
communications task to be scheduled during the short interval while the
(high priority) information bus thread was blocked waiting for the (low
priority) meteorological data thread.  In this case, the long-running
communications task, having higher priority than the meteorological task,
would prevent it from running, consequently preventing the blocked
information bus task from running.  After some time had passed, a watchdog
timer would go off, notice that the data bus task had not been executed for
some time, conclude that something had gone drastically wrong, and initiate
a total system reset.

This scenario is a classic case of priority inversion.

HOW WAS THIS DEBUGGED?

VxWorks can be run in a mode where it records a total trace of all
interesting system events, including context switches, uses of
synchronization objects, and interrupts.  After the failure, JPL engineers
spent hours and hours running the system on the exact spacecraft replica in
their lab with tracing turned on, attempting to replicate the precise
conditions under which they believed that the reset occurred.  Early in the
morning, after all but one engineer had gone home, the engineer finally
reproduced a system reset on the replica.  Analysis of the trace revealed
the priority inversion.

HOW WAS THE PROBLEM CORRECTED?

When created, a VxWorks mutex object accepts a boolean parameter that
indicates whether priority inheritance should be performed by the mutex.
The mutex in question had been initialized with the parameter off; had it
been on, the low-priority meteorological thread would have inherited the
priority of the high-priority data bus thread blocked on it while it held
the mutex, causing it be scheduled with higher priority than the
medium-priority communications task, thus preventing the priority inversion.
Once diagnosed, it was clear to the JPL engineers that using priority
inheritance would prevent the resets they were seeing.

VxWorks contains a C language interpreter intended to allow developers to
type in C expressions and functions to be executed on the fly during system
debugging.  The JPL engineers fortuitously decided to launch the spacecraft
with this feature still enabled.  By coding convention, the initialization
parameter for the mutex in question (and those for two others which could
have caused the same problem) were stored in global variables, whose
addresses were in symbol tables also included in the launch software, and
available to the C interpreter.  A short C program was uploaded to the
spacecraft, which when interpreted, changed the values of these variables
from FALSE to TRUE.  No more system resets occurred.

ANALYSIS AND LESSONS

First and foremost, diagnosing this problem as a black box would have been
impossible.  Only detailed traces of actual system behavior enabled the
faulty execution sequence to be captured and identified.

Secondly, leaving the "debugging" facilities in the system saved the day.
Without the ability to modify the system in the field, the problem could not
have been corrected.

Finally, the engineer's initial analysis that "the data bus task executes
very frequently and is time-critical -- we shouldn't spend the extra time in
it to perform priority inheritance" was exactly wrong.  It is precisely in
such time critical and important situations where correctness is essential,
even at some additional performance cost.

HUMAN NATURE, DEADLINE PRESSURES

David told us that the JPL engineers later confessed that one or two system
resets had occurred in their months of pre-flight testing.  They had never
been reproducible or explainable, and so the engineers, in a very
human-nature response of denial, decided that they probably weren't
important, using the rationale "it was probably caused by a hardware
glitch".

Part of it too was the engineers' focus.  They were extremely focused on
ensuring the quality and flawless operation of the landing software.  Should
it have failed, the mission would have been lost.  It is entirely
understandable for the engineers to discount occasional glitches in the
less-critical land-mission software, particularly given that a spacecraft
reset was a viable recovery strategy at that phase of the mission.

THE IMPORTANCE OF GOOD THEORY/ALGORITHMS

David also said that some of the real heroes of the situation were some
people from CMU who had published a paper he'd heard presented many years
ago who first identified the priority inversion problem and proposed the
solution.  He apologized for not remembering the precise details of the
paper or who wrote it.  Bringing things full circle, it turns out that the
three authors of this result were all in the room, and at the end of the
talk were encouraged by the program chair to stand and be acknowledged.
They were Lui Sha, John Lehoczky, and Raj Rajkumar.  When was the last time
you saw a room of people cheer a group of computer science theorists for
their significant practical contribution to advancing human knowledge? :-)
It was quite a moment.

POSTLUDE

For the record, the paper was:

L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An
Approach to Real-Time Synchronization. In IEEE Transactions on Computers,
vol. 39, pp. 1175-1185, Sep. 1990.