[gclist] Thread safe GC

Eric Jeschke jeschke@cs.indiana.edu
Tue, 15 Apr 1997 13:59:15 EST

Jim Larson <larson@kesey.jpl.nasa.gov> writes:
| In message <9704140924.ZM13104@hoh.mti.sgi.com> Hans Boehm writes:
| >- for a massively parallel system, you want to allow more than one
| >thread doin g useful GC work at the same time.  I don't know of too
| >many implementations tha t support this.  (One of ours does, but only
| >to a very small extent.)  I don't believe there are any fundamental
| >problems here.  But I suspect it's hard to get the details right.
| I wrote a mildly-parallel (up to 24-processor) parallel GC as part
| of the runtime system for a committed-choice logic language (similar
| to Mercury).  The details aren't too hard to get right - at least
| no harder than any other part of a parallel runtime system - the
| difficulty lies in getting good parallel speedups out of it.
| We found that a huge problem was that many of the heaps we built
| contained a lengthy 'spine' of live data which had to be scanned
| sequentially.  Even though we had two dozen processors trying to
| do GC work, all but one of them were idle after an initial flurry
| of activity, and had to wait for a long, deep list to be traversed.
| Any parallel code has to be tuned to eliminate bottlenecks or hot
| spots, but parallel GC introduces a problem that's harder still.  The
| programmer must find a way to let the parallel GC run, but cannot do
| by optimizing the GC code.  Instead, the programmer must create data
| structures in such a way that the GC scanning can run in parallel.  So
| the code to optimize in inaccessible and its data set is created
| indirectly - and all the while the programmer has to make sure that
| normal execution runs quickly too.
| For this reason, most parallel GC'd systems tend to focus on having
| good single-threaded concurrent collectors rather than
| 'stop-the-world' parallel collectors.

I implemented a parallelized mark-sweep collector for DSI, a Lisp-like
VM, that we are were using on a 20-node BBN Butterfly (anyone recall
these?  interesting machine) which was a distributed shared memory
architecture (NUMA). It used *three* phases to collect: mark-traverse,
compact, and update references.  There was a collector thread running on
each node (it was stop the world), and the processors would barrier
synchronize between phases, but otherwise ran concurrently.

The compact phase ran full bore with complete locality, since each node
compacted only its own memory space.

The update phase had "reasonable" locality; each node scanned its own
space, but would occasionally need to read from other spaces to
locate non-local forwarding addresses.

The mark phase could be all over the board, depending on the amount of
nonlocal references in each space. We tried to keep this low by
allocating as much locally as possible (stacks, etc.), but at the same
time trying to avoid hot spots by distributing general list allocation

We seemed to do reasonably well, parallel-wise, looking at traces of
collector activity, but I was uncomfortable with the sheer heuristic
nature of allocation decisions and how little we knew about user program
behavior. I suspect that this is the basic problem that you are
referring to above. I'm just as uncomfortable forcing the programmer to
make those allocation decisions.
I guess allocation profiling might be one way to go. Anybody have any
experience with this?


Eric Jeschke                      |       Visiting Research Associate
jeschke@cs.indiana.edu            |          Indiana University   
                                  |       Computer Science Department