[gclist] ref-counting performance cost
Tue, 05 Sep 2000 10:43:07 +0100
At 2000-09-05 07:27:31+0000, Simon Peyton-Jones writes:
> There is a fourth cost, namely the effect on allocation cost.
The overall trade-off for of inline allocation are unclear to me. To
reach peak efficiency, you really have to have total control over the
ABI, which severely complicates and restricts your interoperability.
It is possible to get the best of both worlds, but it's very hard
I wrote, rewrote, and optimized the pips out of the memory management
code for MLWorks. I can say some things about this without breaching
my NDA, which in any case is somewhat moot these days.
MLWorks will allocate any object in three inline instructions on a
SPARC (it is also inline on a MIPS, but I recall we made it
out-of-line on x86, because of register pressure). One of the
instructions takes a trap if the nursery is full, and the handler
fixes things up and schedules a GC.
The GC is a multi-generational flexible copying collector: the number
of generations, the promotion policy, and the collection scheduling
are all automatically varied to suit the survival pattern of the
particular application. It's not incremental, because the extra
complexity wasn't worth it. Nursery collections took a small number
of milliseconds and ran five to ten times every second in typical
code. A two generation collection would take about fifty milliseconds
and run a couple of times per minute. A three generation collection
would take one or two hundred milliseconds and run every ten minutes
or so. And so on. The collector as a whole was rarely more intrusive
than paging or other OS delays.
MLWorks comes with its own thread system, and the single 1MB nursery
generation is shared between threads. I never got around to porting
this to a native threading system, although I had a plan for doing so
along the same general lines as the SML/NJ threads system (in which
each native thread has its own nursery, and collection is by global
The scan/copy loop is in tightly bummed assembly, with each control
flow branch carefully analyzed for performance on a number of sample
application. I could once have told you the exact number of
instructions, loads, stores, branches, and likely cache behaviour for
scanning any given pointer type.
The overall cost of allocating and freeing each object, including the
amortized cost of the next minor GC (which would be performed slightly
sooner as a result of the allocation), was around seven clock cycles
on the SparcStation I had on my desk (I seem to recall this was a
10/40). That's averaged over our usual test, which was MLWorks
compiling itself: an operation which took ten to twenty minutes and
allocated around 1.5 billion objects totalling about 20 GB (apologies
for some inaccuracies here, this was several years ago).
The fastest out-of-line allocator would have added around 20 cycles to
the per-object cost, which would have made MLWorks as a whole
something like 75% slower. Inline allocation is really a big win for
On the other hand, we took total control over the ABI. Switching
between ML and C can take several hundred cycles (including a system
trap to clean register windows on SPARC). Debugging the ML/C
interface was a major challenge. The collector is exact, so we had to
maintain tight control over the form and content of stack frames.
Talking about ML heap objects in C is a tricky business, which tripped
us up a number of times. The allocator uses two global registers
which were unavailable for other use. We tickled a number of OS bugs
on Solaris. Threading systems were a headache. And we had to have
first refusal on the integer overflow signal handler, which meant we
couldn't necessarily interoperate with anything else that used that
At the end of the day, I'm not sure it was worth it. Most of the pain
was due to the exactness of the collector; maybe a mostly-copying
inexact collector would have been better.
FreeBSD 2.2.8-RELEASE: up 9 days, 13:43
last reboot Sat Aug 26 21:11 (lightning strike)