[gclist] Fast-allocating non-copying GC's
David Chase
chase@centerline.com
Thu, 14 Mar 96 10:45:45 EST
> Could you please be more specific about "customizing allocator" ?
> It all sounds interesting, but I am afraid I miss couple of details...
The idea behind it is that you build an allocator that starts with,
say, about 8 free lists, plus a big block list. The free lists are
for buckets (bucket = object + overhead) of size 8, 16, 32, 64, 128,
512, 1024, and 2048. Anything bigger is a "page" or collection of
pages.
Right away, you can see that there's possible internal fragmentation
problems -- if I request a large number of objects of size 520, I
waste 504 bytes per allocation. I cannot wave this away with uniformity
assumptions, either, because we know that the distribution of allocation
sizes is almost certainly not uniform. The "solution", is to monitor
this waste, and at some suitable time, to create a new list for the
smaller sized objects. With the page size chosen here, the best
choice is 4096/(4096/520) = 585, rounded down to 584 bytes. 7 allocations
will fit in one page, instead of only 4. That's a big win. Note
the goal of always choosing the largest possible free list size --
given that these are always allocated out of pages, there's no sense
choosing a smaller one.
An additional heuristic that you build in to one of these is the
"not-worth-the-trouble" heuristic. What that means is that there
are gains that may not be worth pursuing, because they aren't
that large, and they only increase the risk of pages being assigned
to an object size that isn't really being allocated. For instance,
you can fit 32 128-byte objects into a 4096 byte page. It is probably
not worth it to create a special free list for 120-byte objects,
because you could only allocate one more of them per page (that is,
you're looking at a 3% savings. Big Deal).
Now, as to the specific allocator I recall building. I combine "monitor"
and "sample" into the action of "I need a new page for freelist X".
Each time I get a new page, I record the "actual size" (request,
rounded-up to a multiple of 8, plus headers, if any) at the time
of the request. I also check the "actual size" from the last time
I needed a page for this list. I take the max of those two numbers,
do the arithmetic to figure out the allocation size I could be using,
and if it is worthwhile, create that list. I realize that this sounds
phenomenally dumb, but it worked well enough (on the "industry-standard
benchmarks" of interest, as well as all the random little malloc-intensive
programs I had at hand) for me to quit looking. I measured success
against the time taken by the Kingsley allocator (fast, but big)
and the space used by the cartesian tree allocator (small, but slow)
supplied with the version of SunOS I was using at the time. (Note
that whatever allocator was chosen had to be fast for the benchmark,
but since most customers (unlike benchmarkers) did not use machines
stuffed with the maximum amount of memory, it would be helpful if
it was also not wasteful of memory). This allocator had a couple
of other useful frobs added to it -- information about "big pages"
was stored in a more compact table, even after they were freed. This
had a benificial effect on the paging behavior of modules that were
"good citizens" by trying to free all their memory pools upon module
shutdown (the pool data structures also have to avoid links through
the bulk storage, but that is a SMOP).
You can probably imagine other sampling methods that aren't too expensive,
and would keep better information. For instance, for each size,
you could keep a counter of the number of times you requested that
size since the last list refill. This wouldn't hurt your allocation
time too much. There are also some silly variations on the accounting
-- for instance, if you have an exceptionally high allocation rate
for 520-byte objects, but you also free them immediately (so that
there's only one of them allocated at any given time) then it doesn't
matter that they're landing in a 1024-byte bucket. HOWEVER, I was
just talking about malloc-free; this will work differently for a
garbage collector, because it will allocate some number of them before
realizing that only one is active.
-----
On a different note, I did think of one major difference between
Smalltalk systems (as I understand them) and Modula-3 systems. At
the moment, most M-3 programs start, run for a bit, and then stop,
and their address spaces go away. Thus, the variation in their
allocation styles is limited to what is found in one program (variation
in their allocation styles, meaning the change in the "favorite object
size" over time). A Smalltalk system runs, and runs, and runs, and
may run different progams, and thus, though any single program has
an allocation pattern well-suited to multiple free lists, the sequence
of allocations, over time (run the programs one after another, not
all together at once) is not as well suited to multiple free lists.
Furthermore, this is also different from long-running programs like
file servers, because those are, after all, just one program.
Does this ring true, to those with more experience in Smalltalk
implementation?
David Chase