[gclist] reference counting

Boehm, Hans hans_boehm@hp.com
Tue, 5 Sep 2000 11:19:21 -0700

> -----Original Message-----
> From: David Chase [mailto:chase@world.std.com]
> >(3) It's also not clear to me that pointer-bump allocation is
> >as huge a win as people think.  A good non-compacting organization
> >(as in the BDW collector) can allocate objects almost as fast.
> How's that go on a multiprocessor?  A compacting collector
> makes it relatively easy to carve your heap up into pieces,
> one per thread, and allocate from those lock-free (*).  I imagine
> you could do the same with your free lists (given that you're
> using 64 or fewer, it's not that much overhead) but I haven't
> hashed out all the details, and there surely are details.
This is done in the "6.0alpha" releases, though there's not enough mileage
on those to say exactly how well it works.

Larger objects are always allocated from global free lists.  Each thread
gets a small number of per-thread free lists for different small object

Small objects are also initially allocated from a global free list.  Each of
the local free list entries initially contains a count of how many objects
of that size were allocated.  When that count passes a threshold, objects of
that size are subsequently allocated through the local free list.  The local
free list is refilled by sweeping a section of the heap, something that can
be done with reasonable concurrency between threads.

The intent here is that whenever a thread does large amount of allocation
for a given object size, it should go through the local free list.  But we
may introduce additional fragmentation by carving up the heap in this way,
so we don't want to do it if a thread only allocates a single object of a
given size.

Unfortunately, one of the down sides of doing this in an environment in
which I don't control the APIs is that the obvious ways to find the thread
local free lists, in particular pthread_getspecific(), seem to be fairly
expensive.  I ended up reimplementing a subset of the
thread-specific-storage functionality outside the thread library to get
somewhat better performance.  If I could keep a pointer to the
thread-specific free list array in a register, or keep the free list headers
at a known offset from the thread pointer that's already in a register, this
problem would go away.