[gclist] GC for multi-threaded and multi-processor systems?

Hans Boehm boehm@hoh.mti.sgi.com
Wed, 8 Jul 1998 10:30:26 -0700


On Jul 8,  2:33pm, David Bruce wrote:
> The documentation for the BDW collector (v4.11) explains how to
> configure it for Solaris threads (we're using Suns, so that's OK), which
> would make it MT-safe, but I'm concerned that this might not be
> sufficient for a truly parallel system: alloc.c/solaris_threads.c
> suggest that it stops the world (all the LWPs in the process) to do its
> collection, thereby forming a potentially serious parallel bottleneck,
> but how severe is this in practice?  (Or have I misunderstood how it
> works?)  Does the BDW collector have a truly parallel version (and if
> not, does its incremental mode help?), or can anyone recommend another
> garbage collector that does?  Finally, are there any nasty gotchas I
> should know about with any of the above?
>
Unfortunately, I conjecture that the answer to all questions of the form
"Is X adequate for a truly parallel system?" is "It depends."  This one is no
exception.

If you are planning on taking advantage of something on the order of 2 to 4
processors, and your application doesn't spend too much time in the allocator,
it may work adequately as it is.

If you need more processors/runnable threads and/or they spend a lot of time
allocating, you are likely to run into problems.  Probably the most serious
bottleneck is the fact that there is a global lock that is acquired each time
you allocate.  That can easily result in contention.  GC_malloc_many, which
returns a linked list of several free objects with a single lock acquiaition,
possibly combined with thread-local storage, can often be used to get around
this one.  (Note that this problem is typically worse with malloc/free, since
you need two lock acquisitions per object roundtrip, instead of one + one per
GC.)

The collector will have to temporarily stop all other threads.  That's
basically unavoidable, and not always a big deal.  A better way to look at it
is that you would like to continue to utilize all processors during a
collection, either by forking new GC threads, or by having the existing threads
perform GC work while they're waiting.  The current collector doesn't do that.
 I believe others have modified it to do that, but I don't believe the
modifications are freely available.

I would guess that if pause time is not an issue, the nonincremental collector
will outperform the incremental one, even in this setting.  That's probably
even more true for the default Solaris incremental mode, which uses a very
expensive /proc call with the world stopped.  (At least it was very expensive
last I checked.  If you decisde to experiment, you might want to try
configuring MPROTECT_VDB instead of PROC_VDB.)

There was another version of the collector that allowed more mutator/collector
concurrency.  That helps, but only by a limited amount.  Eventually the GC
thread still becomes the bottleneck.  Sooner or later you need to parallelize
the GC itself.  Once you do, Mutator/collector concurrency no longer buys you
much.

Hans

-- 
Hans-J. Boehm
boehm@sgi.com