[gclist] What to say about GC.

Richard A. O'Keefe ok@cs.rmit.edu.au
Thu, 25 Jul 1996 14:31:59 +1000 (EST)


	From: "=TL2=Tom Lord()" <lord@LanMinds.Com>
	To: gclist@iecc.com
	Subject: Re: [gclist] What to say about GC.

	The false idea that unneeded objects are automatically freed by a
	collector stems from the true fact that collectors automatically free
	all objects not reachable by reference chains in the program's data
	structures.  What collectors don't do, as I'm sure we all know, is to
	free those objects reachable by reference chains, but unreachable by
	the running program.  For example, if there is a list, L, and the
	program will never look at (cdr L) then (cdr L) isn't needed, but it
	won't be freed.

There are so many assumptions behind this argument that I am at a loss
where to start.

1.  The major assumption appears to be "if it isn't perfect, then it isn't
    better than people".  In this specific example, it's patently false:
    if L exists and L->tail will not be used, the number of C programmers
    or Ada programmers or whatever programmers who would free L->tail is
    vanishingly small.

2.  A similar assumption appears to be "people almost always get it right".
    This also is patently false.  The C programmers who _do_ remember to
    free L->tail will usually do it by calling free(L->tail), which will
    probably result in a leak.  Hen's teeth, if people were good at getting
    this stuff right, there wouldn't be any demand for Purity, or Sun's
    'bcheck', or lclint's amazing pointer-tracking annotations.  And the
    applications that *don't* crash my Mac would not be confined to the
    Prolog, Xlisp-Stat, Gambit, and Clean systems (all of which use GC).

    It would be *extremely* surprising if a sufficiently large proportion
    of C/C++/Pascal/Ada/Modula/... programmers were better at live/dead
    analysis than compilers.

3.  Another assumption strongly linked to 1 is "if it isn't perfect, it
    isn't worth having".  This is also clearly false.  As I see it, my
    requirements for garbage collection are
    - it must never reclaim live storage
    - it must reclaim a *useful* amount of dead storage *soon enough*
    - it must run *fast enough*
    where "useful", "soon enough", and "fast enough" depend on the
    application and the environment.

4.  The specific case is one that can readily be handled with compiler
    assistance:  whenever the compiler detects that a variable has just
    become dead, it can insert code to null out that variable.  This
    analysis can also be carried into records (just as the SPARCompiler
    Pascal compiler's "-Rw" option carries uninitialised variable
    tracing into records, and just as the lclint C checker carries
    uninitialised variable tracing into records).

	Any claim that collectors make it
	trivial to ensure that all unneeded objects are freed is in essence a
	claim that it is easy to manually maintain the invariant that unneeded
	objects are not referenced -- a claim that is at best unsupported and
	that can be seen to be highly unlikely.

This is a straw man.  Nobody who knows anything about garbage collection
would ever make it; such people can construct obvious "halting problem"
counterexamples in a few seconds.  What can be reasonably claimed is that
    - automatic storage management
    - makes it EASIER to reclaim MOST storage SOON ENOUGH
    - for all but a few real-time applications
    - in most modern environments
    - while eliminating a large class of storage management errors
    - that are known to plague existing commerically important software.

	The sometimes false idea, that only the GC implementor has to worry
	about GC internals, is true for programmers working in high-level
	languages on a fixed platform using a stable language implementation.
	But the idea is misleading for programmers who use GC with lower level
	languages like C or C++, who extend an existing high-level language
	implementation with new built-in primitives or types, or who must port
	a GC beyond the platforms supported by its author.

There are three answers to this.
(a) Yes, garbage collection is tricky with C (and for that matter, and
    despite an explicit claim to the contrary in "Object Oriented Extensions
    to Pascal", in Pascal).  This is not a problem with GC, it is a problem
    with C and Pascal.
(b) "fixed platforms" is a red herring.  I don't have to worry about GC
    internals on *any* platform to which Haskell, CAML, Clean, Lisp, Prolog,
    ... have been ported, any more than I have to worry about string
    storage management when using TCL, AWK, Perl, or Ksh.  These things
    have stable implementations on many platforms.
(c) Of course people who port a GC have to worry about GC internals, but
    that is _not_ a lot of people.

	In simple, short-running, non-critical or otherwise permissive
	programs GC can be used to automate memory management -- it won't
	necessarily be 100% correct but will tend to err in the less harmful
	direction of core leaks.  Most of the good press GC gets seems to
	focus on this style of using it.

There is no reason why long-running programs cannot use GC.  Hen's teeth,
reference-counted GC is *built into* operating systems such as UNIX and
Windows NT:  you can't have a long-running Windows NT program that *doesn't*
use a form of GC.  Back in the early days of Poplog, people at Sussex
University would log in to VMS, start Poplog, and stay inside Poplog as
long as VMS was up.  Erlang uses garbage collection, and that language is
explicitly intended for long-running programs.

You can very easily have something which is 100% correct with respect to
the specification its designers have in mind; it won't be 100% correct
with respect to a designedly unachievable straw man specification, BUT
NEITHER WILL MANUAL STORAGE ALLOCATION BE 100% CORRECT WITH RESPECT TO
SUCH A SPECIFICATION.

The reason is obvious.  Complex systems have to be built in pieces,
using best software engineering practice.  This practice includes
***information hiding***.  A compiler is allowed to use information
about the internals of another package when compiling this one, but
a programmer is not (because the internals of the other package may
*change*; the compiler is allowed because if such a change is made,
the program will be recompiled).  And that means that the compiler
(hence the garbage collector) has MORE information available to it
than a programmer.

	Looking at the state of things today, the decision to use a compiler
	is much easier -- but an analogy to GC still makes sense: using a
	compiler doesn't necessarily save programmers from having to work very
	carefully to make programs efficient.

Yes, but be very careful to understand what form this work takes.
In the case of using a compiler, it means understanding the structure
and character of the problem, and selecting (not necessarily implementing)
appropriate data structures and algorithms, and then measuring to see what
performance problems, if any remain.  [I note that some component libraries
claim to be "only" 2-3 times slower than handwritten code.]
In the case of storage management, it doesn't mean inserting hand code to
destroy storage, but designing data structures and algorithms that don't
turn over a lot of storage in the first place, and it most especially
means *measuring* the program's behaviour instead of *guessing*.

Programmer intuition is very often wrong.

For example, everyone knows that Scheme is slow, right?
Well, yesterday I tried a three-way race between
    Ada (Gnat, checks switched off, optimisation high)
    Pascal (SPARCompiler Pascal, checks off, optimisation high)
    Scheme (Stalin, optimisation high, same gcc backend as Gnat).
The problem was 2-dimensional integration.
Ada and Pascal took about the same time.
Scheme was three times faster.
All programs computed the same result.

If you weren't expecting that result, why should you trust your
intuition about what's going on with storage management any more than that?

The real lesson is MEASURE, MEASURE, MEASURE.