Jonathan Moffett, Moderator, hise-safety-critical mailing list
From this point on the text is supplied by Gérard.
Back to Safety-Critical Mailing List Forum
This contribution is prompted by the fact that there is still no
widespread agreement on the nature of the failure of Ariane 5 flight
501
(June 1996). This contribution is also prompted by discussions I have
had with Peter Ladkin, who I thank for having helped in improving the
presentation of the arguments that follow.
An Inquiry Board (IB) was formed to identify the cause(s) of the 501
failure. The IB report concludes that causes are software (S/W) design
and S/W implementation errors [ESA, 1996], a view which is disputed
-
see [SCS], [RISKS], [Ladkin, 1998], [Le Lann, 1996] for examples. (Of
course, these analyses, as well as this contribution, assume that all
causal factors appear in the IB report). In fact, it is almost
straightforward to show that the 501 failure has a unique cause, which
is a system engineering (SE) fault.
This is so for the reason that this SE fault is the root of the causal
graph that leads to the 501 failure. Stated differently, among other
causal factors (such as, e.g., the BH overflow), none precedes this
one.
(I leave it to Peter Ladkin to give a more refined definition of
"cause").
Back to the facts. The alignment task was running, despite the fact
that, after lift-off, realignment of the inertial platform, needed
with
Ariane 4 (A4), is useless in the case of Ariane 5 (A5). This task
contains the conversion procedure that computes integer BH from
horizontal velocity.
What if someone would have had the idea of disallowing the execution
of
this task after lift-off? Simple. The scenario which has led to the
501
failure could not have occurred.
Now the argument.
How could this someone know that this was the right thing to do?
Obviously, only by correctly capturing the problem to be solved by
those
engineers in charge of the A5 computer-based system, i.e. by correctly
specifying the interface between this particular A5 subsystem and the
A5 inertial platform subsystem.
Decomposition of a launcher into subsystems, and specification of
appropriate interfaces (capture of requirements and assumptions) between
these subsystems, are SE activities, which depend on which satellite
launcher technologies are selected. Only the main architect of a
launcher can conduct such SE activities correctly, for the reason that
only the main architect of a launcher is responsible for deciding on
how
to decompose a launcher into subsystems, given the technological choices
made.
Consequently, this someone can only be an Ariane 5 engineer. Indeed,
only an engineer aware of the technology retained for the A5 program
can
tell: "Given A5 technology, there is no need to have the strap-down
inertial platform aligned after lift-off".
That system engineering-dependent knowledge is totally independent
of
the fact that the alignment "thing" which, after lift-off, happens
to be
needed (A4), or not needed (A5), is implemented in hardware, in
software, or in melloware, correctly or incorrectly. That knowledge
is
also totally independent of the fact that the "thing" is a reused
"thing" or a newly developed "thing". It is also totally independent
of
the fact that inhibition of the "thing" after lift-off is instantiated
via, e.g., a boolean set to false, or a mechanical switch activated
after lift-off.
Hence, the 501 failure does not result from "how" the "what" (was needed
or not needed) was instantiated. The 501 failure has been caused by
an
overlook of the "what", which is a requirement capture fault. And given
that the knowledge at stake is system engineering-dependent, the cause
is a SE fault.
It has never been the intent of ESA, of CNES, of Arospatiale, or
Arianespace, to plan, commission, build and operate a launcher based
on
A5's technology and which needs inertial platform alignment after
lift-off, a fictitious launcher that could be labelled Ariane 4.5,
half-way between A4 and A5.
End of the argument.
Therefore, stricto sensu, all the work that has been invested in
"inspecting the code" and ironing out the "S/W errors" from the
alignment task, all the contributions - including ours [Le Lann,
1996], [Le Lann, 1997] - to the "Is the 501 failure due to software
or
system engineering mistakes?" debate, apply to this fictitious Ariane
4.5 launcher, that will never be operated, and whose unique flight
is
labelled 501, not to the Ariane 5 program.
The real qualification flights of A5 have been (successful) flights
502
and 503, which were conducted with the alignment task inhibited after
lift-off. Consequently, success with these flights cannot result from
having "inspected the code and corrected the bugs" of the alignment
task (since this task was not in use (after lift-off)).
It is certainly interesting to keep discussing about the 501 failure,
until, maybe, our community reaches a consensus on one of the three
prevailing views, namely:
1) The 501 failure could have been avoided by "inspecting the S/W"
(group G1),
2) No way! The failure has been caused by a requirement fault,
which is further split in two diagnoses:
2.1) The failure could have been avoided by resorting to a "good" S/W
Engineering method (group G2),
2.2) Maybe, with luck. The failure would have been avoided for sure
(it's easier to say, now that we know what happened) by resorting to
a
"good" System Engineering method (group G3).
Still, we should not forget that these discussions make sense only in
the context of the fictitious Ariane 4.5 launcher. Neither should we
ignore that the issue of "correcting the bugs" of the alignment task
has
lost any practical relevance as early as 1996.
As a member of G3, I am interested in keeping interacting with
representatives of G2 (the most populated group it seems at this time),
and discuss at greater length why I believe it does not make sense
to
shift responsibilities from System Engineering to S/W Engineering or
to
H/W Engineering.
In the particular case of flight 501, I have argued in [Le Lann, 1996]
and [Le Lann, 1997] that those errors which have been identified in
the
IB report are causal consequences of System Engineering faults. They
are
not causes of the 501 failure, but manifestations of more "profound"
causes.
Yes, maybe, with luck, following some "good" S/W Engineering method
(some "good" H/W Engineering method if H/W implementation had been
resorted to), someone could have been led to ask such questions as
Q1:
"Under which conditions should this function be available, be
inhibited?", Q2: "What's the range of possible values for horizontal
velocity?". It's much less likely that the Q3: "What's the failure
model
assumed for processors?" or the Q4: "Can the assumption that there
is no
common mode failure (of the SRI module) be violated" questions would
have been raised.
But why take chances, anyway? This knowledge (questions and responses)
is natural and obvious to Ariane 5 engineers (Q1 and Q2), natural and
obvious to (system-level) designers of the Ariane computer-based system
(Q3 and Q4). With a "good" System Engineering method at hand, it would
have been normal practice for these engineers to spontaneously
"propagate that knowledge", via specifications handed over to S/W (to
H/W) engineers, releaving them from the burden of "not forgetting to
ask
(the right questions?, all of them?)".
It seems there is a temptation to consider that a S/W (or a H/W)
Engineering method is "good" not only if it guarantees correct
implementations of specifications but, furthermore, if it also
guarantees that the specifications under consideration are correct
with
respect to some higher-level problem. Why should a S/W (or a H/W)
Engineering method compensate for lack of consideration for System
Engineering issues? Where do these specifications meant to be S/W
(or H/W) implemented come from? Is there not a boundary to the
"universe" that is tractable with S/W (or H/W) concepts?
Besides this, concerning the Ariane 5 program, a really interesting
question is as follows: Was the S/W used for flight 501 - to the
exception of the alignment task - found to be "erroneous", and if the
case, have experts found fatal S/W errors, i.e., errors which, if not
corrected, would have led to a failure of flight 502?
As of now, I have been returned only one non content-free response
(i.e., other than "it's secret", which might be understandable). I
have
been told by some experts - in group G1 - that they had found non fatal
S/W errors. This demonstrates that bug-free S/W is not a necessity,
given that Ariane 4 has been operated for over 10 years very
successfully, despite the existence of these S/W errors.
[ESA, 1996] European Space Agency, "Ariane 5 - Flight 501 Failure",
Board of Inquiry Report, 19 July 1996, 18 p.
[http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html].
[Ladkin, 1998] P. Ladkin, "The Ariane 5 Accident: A Programming
Problem?", Article RVS-J-98-02, Bielefeld University, Germany, March
1998 [http://www.rvs.uni-bielefeld.de].
[Le Lann, 1996] G. Le Lann, "The Ariane 5 Flight 501 Failure - A Case
Study in System Engineering for Computing Systems", INRIA Research
Report 3079, Dec. 1996, 26 p [http://
www.inria.fr/RRRT/publications-fra.html].
[Le Lann, 1997] G. Le Lann, "An Analysis of the Ariane 5 Flight 501
Failure - A System Engineering Perspective", 10th IEEE Intl. ECBS
Conference, March 1997, 339-346.
[RISKS] The RISKS Forum [http://catless.ncl.ac.uk/Risks].
[SCS] Safety Critical Systems Mailing List [ftp.cs.york.ac.uk, directory
hise_reports/sc.list].
PS: comments more than welcome.
**********************************
Gérard Le Lann
INRIA - Projet REFLECS - BP 105
78153 LE CHESNAY Cedex, France
Fax: +33 (0)1 39.63.58.92
**********************************