(course on) garbage collecting

Raul Deluth Miller rockwell@nova.umd.edu
Tue, 27 Dec 1994 09:55:01 -0500


Francois-Rene Rideau:
      Now, I shall again remind you the problem of garbage
   collection.  This is a foremost problem; it is compulsory to have
   one in any dynamic system (which our system should be).
      Persistency does not mean anything without a garbage collection
   policy, and any trusted module must be garbage collection aware. We
   *must* provide standards for this, and it will most surely be a
   very basic module in the system ...
      Believe me; GC is a most important feature for our system. It is
   what distinguishes a high-level-able system from a low-level-only
   one. See C, FORTH, and so on. Even newest C++'s will have to
   include a GC !

I think we'll eventually need to elucidate our reasons for garbage
collection a little more carefully than this.  However, I'll agree
that garbage collection is a HLL feature, not a LLL feature.

[Is it true that garbage collection is our only distinction between
HLL and LLL?  Also, there's the related matter of memory management
implicit in object migration -- how to coalesce memory left behind by
departed objects, for example.]

In general, garbage collection is necessary because of the difficulty
of locating the "last reference" to an object.  This is a specialized
instance of the halting problem -- garbage collection typically does
not solve the problem, but is still a useful approximation (garbage
collection will not reclaim memory if it's potentially accessible --
even if that potential is never realized).

			   Reference Counts

Reference counts are not always a bad solution to the garbage
collection problem -- it is, however, a low level solution.  This
mechanism is distinguished by a fixed cost overhead per "reference".
Since this method requires no global awareness of pointers, it is
amenable to implementation in systems where such awareness is
impossible (because of the fluidity of "pointers").

Reference counts are bad when there are many "references" to otherwise
low cost objects.  Reference counts require explicit programmer
intervention when dealing with "circular references".

The fragmentation Francois-Rene associates with reference counts
arises from the lack of global knowledge of pointers.  Pointers, in
general, are numbers (symbols) which refer to some underlying location
in memory.  Memory is typically arranged as a sequence, and a pointer
typically refers to several locations in that sequence.  Programs
which have a global awareness of pointers can, for example:
(*) copy the data starting at location A to location B, and
(*) change every reference to location A to location B.  If this can
be accomplished "atomically" then program execution can continue
unhindered, and the memory at location A can be reclaimed and
coalesced with adjacent free memory.  Another way of accomplishing
this is to introduce a level of indirection (e.g. "memory management
hardware") and force all uses of pointers to use this indirection.

I put "reference" in quotes because I'm refering to an abstract
concept, not a particular language implementation.  A notable special
case of "reference counting" is the case where the count occupies a
single bit.  All other memory management strategies may be viewed as
algorithms which work with an underlying single-bit reference count
(it's there, or it's not).

   
			    Mark and Sweep
   
This mechanism optimizes the "single bit reference count" by "walking
through all references" occationsionally.  It explicitly represents
the reference count by manipulating the objects.  Note that there are
several variations on this theme depending on issues such as the
representation of free memory.  Francois-Rene implicitly suggested an
implementation where the heap is manipulated as a stack.

			    Stop and Copy

Stop and copy is similar to Mark and Sweep, but represent the single
bit reference count using flow-of-control, rather than explicitly.
This mechanism seems to have some strong simularities with our
proposed "object migration" methodology.  I think we should favor it
for that reason.

Francois-Rene Rideau wrote:

      So, I had a discussion with Mike about GC; the unavoidable
   requirement for a GC is that at any PAUSE and/or call to a GC-aware
   memory manager, the system has some easy way to differentiate
   pointers from other data, to solve problem #2. So, either we spill
   enough registers to separate pointer registers and stacks from
   integer/data registers and stacks, or we have to tag every single
   register/stack value (and thus lose a bit or more per word) to make
   the difference.

There are systems which amortize this PAUSE (for example, generational
garbage collection) and achieve some efficiency gains in the process.
Basically, a lot of computation has data properties "almost like a
stack".  We can take advantage of this by doing garbage collection
less often on the older objects.

Ways of differentiating pointers from other data include:

(1) reserving bits throughout memory which indicate the distinction
(2) placing all pointers in a certain region of memory
or, in general,
(3) placing all pointers in some data structure which makes them
evident.

There is lots of room for application-specific optimization here.  In
addition to the general mechanisms above, some algorithms implicitly
free data structures at certain points in the calculation -- garbage
collection here might be a simple reseting of the heap pointer.

I should note that option (1) gets tricky if we allow packed strings
of 8-bit characters.

      Well, enough for today. I await your remarks. But please read
   all the message. This is a *serious* and *complex* problem, yes
   sir, and you're all directly concerned, as anything you do, a GC
   must be aware of, and the most simple GC we use, the more efficient
   we can make it.

I could say more, by I'll instead refer you to another paper by Henry
Baker: 
   ftp://ftp2.netcom.com/pub/hb/hbaker/ForthStack.html

(or ACM's Sigarch Computer Architecture News, March 1994 pp34-43).

If you have the time, browse his home page:
  ftp://ftp2.netcom.com/pub/hb/hbaker/home.html

There's a lot of good stuff on garbage collection here.

Ultimately, I think we need to factor GC into its underlying
mechanisms, and provide an interface to those mechanisms.  However,
such factoring will require a fair bit of understanding of the nature
of GC (and related concepts such as virtual memory, program migration,
and optimization for efficiency).  I think Henry Baker's work sheds a
lot of light on these sorts of details.

-- 
Raul D. Miller          N=:((*/pq)&|)@                 NB. public e, y, n=:*/pq
<rockwell@nova.umd.edu> P=:*N/@:#               NB. */-.,e e.&factors t=:*/<:pq
                        1=t|e*d    NB. (,-:<:)pq is four large primes, e medium
x-:d P,:y=:e P,:x                  NB. (d P,:y)-:D P*:N^:(i.#D)y [. D=:|.@#.d