[gclist] What to say about GC (and free GC support SW for C/C++)

Paul R. Wilson wilson@cs.utexas.edu
Fri, 26 Jul 1996 05:57:57 -0500


I agree with Richard's argument that GC bugs are always possible and
can be extremely annoying, but generally less common than compiler
bugs.

In general, any reasonable GC for a language designed for GC is much less
complicated than a good optimizing compiler.

If GC'd languages were more popular, GC's would be even more reliable.
We'd see a lot more *reuse* of good GC's, and they'd get heavily banged
on (hence very well debugged) in implementations of multiple languages.

A couple of thousand lines of code used continually by thousands of
programmers and hundreds of thousands of users daily could be a very
reliable piece of software.

If a GC ever got to be as popular as (say) the GCC compiler, it'd be
a pretty reliable GC---at least, if it got one percent of the development
and debugging attention that GCC has gotten.

We've had many fewer headaches from our own home-grown GC's, used
by a comparitively tiny number of people, than we've had with bugs
in GCC and other C++ compilers.  GC just isn't that hard.  Code optimization
is much harder.

Tom may have been bitten by SCM, which is IMHO not a particularly
carefully designed system.  This is understandable, because it was
not designed by professional language-design-and-implementation experts,
but by people who wanted a free, portable, and featureful Scheme 
implementation in a hurry.  Its support for GC could be much improved,
and I'm sure its reliability improves somewhat over time, except
occasionally when tricky-to-implement new low-level features are added.

Background: SCM is a C-coded Scheme interpreter that relies very
heavily---and in my view, unnecessarily---on conservative pointer
finding, for simple interoperability with code written in C. (Simple
for random application programmers, that is, not for developers.)  I prefer
a Scheme-to-C interface that emphasizes Scheme more and C less, and makes
clearer what the GC issues are when the two interact.  (This is clearly
a matter of personal preference, to a large degree, as well as a difference
in system design goals.  But I think it's good to provide a GC-safer
interface to programmers who want to use it, as well as a "we'll take
care of it" conservative interface to programmers who want to rely
on that.)

By the way, our group has developed a "runtime type descriptor" system
for extracting object layout information from C and C++ object files,
so that C and C++ data can be traced precisely IF the programmer uses
a special allocation routine and avoids unsafe casts.

(You compile the file with debugging on, and our tool extracts layout
information from the debug output.  Then you can strip the .o file,
or compile it with debugging off, to avoid bloat and allow full
optimization.  This is automated by simple makefiles.)

You could use this with a conservative GC to improve its precision and
reliability.  GC-cooperative C or C++ code can be compiled this way,
and you coulds still use plain malloc and conservative tracing for
random libraries you need to link into your program.

Currently, we use this with our persistent store for C++, and our
real-time GC for C++.  (The RTGC currently uses smart pointers, not
conservative stack scanning and pagewise write barriers, for real-time
reasons, but it would be easy to make it use those at some cost in
conservatism and lost real-time guarantees.)

Our RTTD code is free, under the GPL.  (This doesn't affect application
programs that use it, since it does its work at compile time---you
can use it with commercial apps.)  This system can use code from gdb
to understand debug output of a variety of UNIX compilers, and there's
also a module that lets it read debug output of the IBM compilers
to OS/2.  Ports to other compilers should be straightforward, and
often trivial if they can emit debug output in a format we already
handle.  (We haven't tried any Windows compilers, but don't see any
serious problems.  Since most Windows compilers can emit debug output
in several formats, we may be able to find one or two formats that
will handle essentially all compilers.)

We haven't developed a C allocator interface yet, but that should be
easy.  (The C++ interface is more sophisticated and interesting to
us.  One nice feature is that we can find and fix up virtual function
table pointers, which allows us to dynamically link persistent C++ objects
to the vtbl's in a given executable, transparently and on the fly, as
pages of data are faulted into memory.)

There's a draft of a paper on the RTTD system available from our 
web page, html://www.cs.utexas.edu/users/oops.  The code is available
there, too, with the Texas Persistent Store distribution.  (Texas is
also free.  It's an orthogonal page-faulting p-store with zero
overhead for accesses to in-memory data, and minor overhead on
first access to a page.)

This might be useful in conjunction with Hans's simple GC-safe compilation,
to reduce the conservatism of conservative GC.  (If only for people who
are overly worried about it---I'm *not* one of the people who's paranoid
about conservative GC techniques myself.)

One thing we haven't done is to do precise scanning of statically-allocated
data.  In general, this looks pretty easy, if annoyingly system-dependent.
For normal executables, UNIX systems generally provide an "nm" utility that
gives the addresses of statically-allocated data.  (OS/2 provides a similar
utility, and I'd guess Windows and the Mac do too.)  In conjuction with our
type descriptor tool, these could be scanned precisely.  If nothing else,
this would let you identify large pointer-free objects to avoid scanning
them.  (This could reduce the base cost of GC for programs with large
images but little live heap data.)

With stuff like this, I think it would be easy to support mostly-precise
GC for C++.  (It would be good if implementors of C++ realized this.)