[gclist] What really happened on Mars (priority diversion)

Henry G. Baker hbaker@netcom.com
Sun, 11 Jan 1998 08:28:14 -0800 (PST)


[hbaker: Since I had earlier forwarded the original message, I thought that
I should forward the followon, as well.]

Date: Fri, 9 Jan 1998 14:13:58 -0800
From: Mike Jones <mbj@MICROSOFT.com>
Subject: Re: What really happened on Mars? by Glenn Reeves (RISKS-19.49)

> Date: Monday, December 15, 1997 10:28 AM
> From: Glenn E Reeves <Glenn.E.Reeves@jpl.nasa.gov>
> Subject: 	Re: [Fwd: FW: What really happened on Mars?]
>
> What really happened on Mars ?
>
>By now most of you have read Mike's (mbj@microsoft.com) summary of Dave
>Wilner's comments given at the IEEE Real-Time Systems Symposium.  I don't
>know Mike and I didn't attend the symposium (though I really wish I had now)
>and I have not talked to Dave Wilner since before the talk.  However, I did
>lead the software team for the Mars Pathfinder spacecraft.  So, instead of
>trying to find out what was said I will just tell you what happened.  You
>can make your own judgments.
>
>I sent this message out to everyone who was a recipient of Mike's original
>that I had an e-mail address for.  Please pass it on to anyone you sent the
>first one to.  Mike, I hope you will post this wherever you posted the
>original.
>
>Since I want to make sure the problem is clearly understood I need to step
>through each of the areas which contributed to the problem.
>
>THE HARDWARE
>
>The simplified view of the Mars Pathfinder hardware architecture looks like
>this.  A single CPU controls the spacecraft.  It resides on a VME bus which
>also contains interface cards for the radio, the camera, and an interface to
>a 1553 bus.  The 1553 bus connects to two places : The "cruise stage" part
>of the spacecraft and the "lander" part of the spacecraft.  The hardware on
>the cruise part of the spacecraft controls thrusters, valves, a sun sensor,
>and a star scanner.  The hardware on the lander part provides an interface
>to accelerometers, a radar altimeter, and an instrument for meteorological
>science known as the ASI/MET.  The hardware which we used to interface to
>the 1553 bus (at both ends) was inherited from the Cassini spacecraft.  This
>hardware came with a specific paradigm for its usage : the software will
>schedule activity at an 8 Hz rate.  This **feature** dictated the
>architecture of the software which controls both the 1553 bus and the
>devices attached to it.
>
>THE SOFTWARE ARCHITECTURE
>
>The software to control the 1553 bus and the attached instruments was
>implemented as two tasks.  The first task controlled the setup of
>transactions on the 1553 bus (called the bus scheduler or bc_sched task) and
>the second task handled the collection of the transaction results i.e. the
>data.  The second task is referred to as the bc_dist (for distribution)
>task.  A typical timeline for the bus activity for a single cycle is shown
>below.  It is not to scale.  This cycle was constantly repeated.
>
>     |< ------------- .125 seconds ------------------------>|
>
>     |<***************|                    |********|   |**>|
>
>                      |<- bc_dist active ->|    bc_sched active
>     |< -bus active ->|                             |<->|
>
>
> ----|----------------|--------------------|--------|---|---|-------
>     t1               t2                   t3       t4  t5  t1
>
>The *** are periods when tasks other than the ones listed are executing.
>Yes, there is some idle time.
>
>t1 - bus hardware starts via hardware control on the 8 Hz boundary. The
>transactions for the this cycle had been set up by the previous execution of
>the bc_sched task.
>t2 - 1553 traffic is complete and the bc_dist task is awakened.
>t3 - bc_dist task has completed all of the data distribution
>t4 - bc_sched task is awakened to setup transactions for the next cycle
>t5 - bc_sched activity is complete
>
>The bc_sched and bc_dist tasks check each cycle to be sure that the other
>had completed its execution.  The bc_sched task is the highest priority task
>in the system (except for the vxWorks "tExec" task).  The bc_dist is third
>highest (a task controlling the entry and landing is second).  All of the
>tasks which perform other spacecraft functions are lower.  Science
>functions, such as imaging, image compression, and the ASI/MET task are
>still lower.
>
>Data is collected from devices connected to the 1553 bus only when they are
>powered.  Most of the tasks in the system that access the information
>collected over the 1553 do so via a double buffered shared memory mechanism
>into which the bc_dist task places the latest data.  The exception to this
>is the ASI/MET task which is delivered its information via an interprocess
>communication mechanism (IPC).  The IPC mechanism uses the vxWorks pipe()
>facility.  Tasks wait on one or more IPC "queues" for messages to arrive.
>Tasks use the select() mechanism to wait for message arrival.  Multiple
>queues are used when both high and lower priority messages are required.
>Most of the IPC traffic in the system is not for the delivery of real-time
>data.  However, again, the exception to this is the use of the IPC mechanism
>with the ASI/MET task.  The cause of the reset on Mars was in the use and
>configuration of the IPC mechanism.
>
>THE FAILURE
>
>The failure was identified by the spacecraft as a failure of the bc_dist
>task to complete its execution before the bc_sched task started.  The
>reaction to this by the spacecraft was to reset the computer.  This reset
>reinitializes all of the hardware and software. It also terminates the
>execution of the current ground commanded activities.  No science or
>engineering data is lost that has already been collected (the data in RAM is
>recovered so long as power is not lost).  However, the remainder of the
>activities for that day were not accomplished until the next day.
>
>The failure turned out to be a case of priority inversion (how we discovered
>this and how we fixed it are covered later).  The higher priority bc_dist
>task was blocked by the much lower priority ASI/MET task that was holding a
>shared resource.  The ASI/MET task had acquired this resource and then been
>preempted by several of the medium priority tasks.  When the bc_sched task
>was activated, to setup the transactions for the next 1553 bus cycle, it
>detected that the bc_dist task had not completed its execution.  The
>resource that caused this problem was a mutual exclusion semaphore used
>within the select() mechanism to control access to the list of file
>descriptors that the select() mechanism was to wait on.
>
>The select mechanism creates a mutual exclusion semaphore to protect the
>"wait list" of file descriptors for those devices which support select.  The
>vxWorks pipe() mechanism is such a device and the IPC mechanism we used is
>based on using pipes. The ASI/MET task had called select, which had called
>pipeIoctl(), which had called selNodeAdd(), which was in the process of
>giving the mutex semaphore.  The ASI/ MET task was preempted and semGive()
>was not completed.  Several medium priority tasks ran until the bc_dist task
>was activated.  The bc_dist task attempted to send the newest ASI/MET data
>via the IPC mechanism which called pipeWrite().  pipeWrite() blocked, taking
>the mutex semaphore.  More of the medium priority tasks ran, still not
>allowing the ASI/MET task to run, until the bc_sched task was awakened.  At
>that point, the bc_sched task determined that the bc_dist task had not
>completed its cycle (a hard deadline in the system) and declared the error
>that initiated the reset.
>
>HOW WE FOUND IT
>
>The software that flies on Mars Pathfinder has several debug features within
>it that are used in the lab but are not used on the flight spacecraft (not
>used because some of them produce more information than we can send back to
>Earth).  These features were not "fortuitously" left enabled but remain in
>the software by design.  We strongly believe in the "test what you fly and
>fly what you test" philosophy.
>
>One of these tools is a trace/log facility which was originally developed to
>find a bug in an early version of the vxWorks port (Wind River ported
>vxWorks to the RS6000 processor for us for this mission).  This trace/log
>facility was built by David Cummings who was one of the software engineers
>on the task.  Lisa Stanley, of Wind River, took this facility and
>instrumented the pipe services, msgQ services, interrupt handling, select
>services, and the tExec task.  The facility initializes at startup and
>continues to collect data (in ring buffers) until told to stop.  The
>facility produces a voluminous dump of information when asked.
>
>After the problem occurred on Mars we did run the same set of activities
>over and over again in the lab.  The bc_sched was already coded so as to
>stop the trace/log collection and dump the data (even though we knew we
>could not get the dump in flight) for this error.  So, when we went into the
>lab to test it we did not have to change the software.
>
>In less that 18 hours we were able to cause the problem to occur. Once we
>were able to reproduce the failure the priority inversion problem was
>obvious.
>
>HOW WAS THE PROBLEM CORRECTED
>
>Once we understood the problem the fix appeared obvious : change the
>creation flags for the semaphore so as to enable the priority inheritance.
>The Wind River folks, for many of their services, supply global
>configuration variables for parameters such as the "options" parameter for
>the semMCreate used by the select service (although this is not documented
>and those who do not have vxWorks source code or have not studied the source
>code might be unaware of this feature).  However, the fix is not so obvious
>for several reasons :
>
>1) The code for this is in the selectLib() and is common for all device
>creations.  When you change this global variable all of the select
>semaphores created after that point will be created with the new options.
>There was no easy way in our initialization logic to only modify the
>semaphore associated with the pipe used for bc_dist task to ASI/MET task
>communications.
>
>2) If we make this change, and it is applied on a global basis, how will
>this change the behavior of the rest of the system ?
>
>3) The priority inversion option was deliberately left out by Wind River in
>the default selectLib() service for optimum performance.  How will
>performance degrade if we turn the priority inversion on ?
>
>4) Was there some intrinsic behavior of the select mechanism itself that
>would change if the priority inversion was enabled ?
>
>We did end up modifying the global variable to include the priority
>inversion.  This corrected the problem.  We asked Wind River to analyze the
>potential impacts for (3) and (4). They concluded that the performance
>impact would be minimal and that the behavior of select() would not change
>so long as there was always only one task waiting for any particular file
>descriptor.  This is true in our system.  I believe that the debate at Wind
>River still continues on whether the priority inversion option should be on
>as the default.  For (1) and (2) the change did alter the characteristics of
>all of the select semaphores.  We concluded, both by analysis and test, that
>there was no adverse behavior.  We tested the system extensively before we
>changed the software on the spacecraft.
>
>HOW WE CHANGED THE SOFTWARE ON THE SPACECRAFT
>
>No, we did not use the vxWorks shell to change the software (although the
>shell is usable on the spacecraft).  The process of "patching" the software
>on the spacecraft is a specialized process.  It involves sending the
>differences between what you have onboard and what you want (and have on
>Earth) to the spacecraft.  Custom software on the spacecraft (with a whole
>bunch of validation) modifies the onboard copy.  If you want more info you
>can send me e-mail.
>
>WHY DIDN'T WE CATCH IT BEFORE LAUNCH ?
>
>The problem would only manifest itself when ASI/MET data was being collected
>and intermediate tasks were heavily loaded.  Our before launch testing was
>limited to the "best case" high data rates and science activities.  The fact
>that data rates from the surface were higher than anticipated and the amount
>of science activities proportionally greater served to aggravate the
>problem.  We did not expect nor test the "better than we could have ever
>imagined" case.
>
>HUMAN NATURE, DEADLINE PRESSURES
>
>We did see the problem before landing but could not get it to repeat when we
>tried to track it down.  It was not forgotten nor was it deemed unimportant.
>
>Yes, we were concentrating heavily on the entry and landing software.  Yes,
>we considered this problem lower priority.  Yes, we would have liked to have
>everything perfect before landing.  However, I don't see any problem here
>other than we ran out of time to get the lower priority issues completed.
>
>We did have one other thing on our side; we knew how robust our system was
>because that is the way we designed it.
>
>We knew that if this problem occurred we would reset.  We built in
>mechanisms to recover the current activity so that there would be no
>interruptions in the science data (although this wasn't used until later in
>the landed mission).  We built in the ability (and tested it) to go through
>multiple resets while we were going through the Martian atmosphere.  We
>designed the software to recover from radiation induced errors in the memory
>or the processor.  The spacecraft would have even done a 60 day mission on
>its own, including deploying the rover, if the radio receiver had broken
>when we landed.  There are a large number of safeguards in the system to
>ensure robust, continued operation in the event of a failure of this type.
>These safeguards allowed us to designate problems of this nature as lower
>priority.
>
>We had our priorities right.
>
>ANALYSIS AND LESSONS
>
>Did we (the JPL team) make an error in assuming how the select/pipe
>mechanism would work ?  Yes, probably.  But there was no conscious decision
>to not have the priority inversion enabled.  We just missed it.  There are
>several other places in the flight software where similar protection is
>required for critical data structures and the semaphores do have priority
>inversion protection.  A good lesson when you fly COTS stuff - make sure you
>know how it works.
>
>Mike is quite correct in saying that we could not have figured this out
>**ever** if we did not have the tools to give us the insight.  We built many
>of the tools into the software for exactly this type of problem.  We always
>planned to leave them in.  In fact, the shell (and the stdout stream) were
>very useful the entire mission.  If you want more detail send me a note.
>
>SETTING THE RECORD STRAIGHT
>
>First, I want to make sure that everyone understands how I feel in regard to
>Wind River.  These folks did a fantastic job for us.  They were enthusiastic
>and supported us when we came to them and asked them to do an affordable
>port of vxWorks.  They delivered the alpha version in 3 months.  When we had
>a problem they put some of the brightest engineers I have ever worked with
>on the problem.  Our communication with them was fantastic.  If they had not
>done such a professional job the Mars Pathfinder mission would not have been
>the success that it is.
>
>Second, Dave Wilner did talk to me about this problem before he gave his
>talk.  I could not find my notes where I had detailed the description of the
>problem.  So, I winged it and I sure did get it wrong.  Sorry Dave.
>
>ACKNOWLEDGMENTS
>
>First, thanks to Mike for writing a very nice description of the talk.  I
>think I have had probably 400 people send me copies.  You gave me the push
>to write the part of the Mars Pathfinder End-of-Mission report that I had
>been procrastinating doing.
>
>Special thanks to Steve Stolper for helping me do this.  The biggest thanks
>should go to the software team that I had the privilege of leading and whose
>expertise allowed us to succeed: Pam Yoshioka, Dave Cummings, Don Meyer, 
>Karl Schneider, Greg Welz, Rick Achatz, Kim Gostelow, Dave Smyth, 
>Steve Stolper.   Also, Miguel San Martin, Sam Sirlin, Brian Lazara (WRS), 
>Mike Deliman (WRS), Lisa Stanley (WRS)
>
>Glenn Reeves, Mars Pathfinder Flight Software Cognizant Engineer
>glenn.e.reeves@jpl.nasa.gov