[gclist] ratio of heap size to live data?

Nick Barnes Nick.Barnes@pobox.com
Wed, 04 Oct 2000 10:52:10 +0100


At 2000-10-03 21:47:09+0000, Norman Ramsey writes:

> I'm looking for any research that would tell me about what ratio
> of heap size to live data is needed for a modern copying or mark-sweep
> collector to perform well.  Can anyone point me to papers?

This rather depends on what you mean by "live data" (and, for that
matter, "heap size").  The heap H consists of:

- live objects L (i.e. data which will be used again);

- dead but reachable objects R (i.e. data which can be reached by
  chasing pointers from the roots);

- unreachable objects U (i.e. objects which could be collected by a
  GC); this may be divided into "strongly unreachable" (S) and "weakly
  unreachable" (W), where W is treated as reachable by the collector
  (due to the approximations which it makes).

- allocation overhead A (object headers, padding, etc);

- collection overhead C (collector data structures, to-space, etc).

L is hard to measure (halting problem).  L+R+W is easy to measure (do
a full GC).  It should be trivial to instrument your collector to tell
you A, C, and H.

There's also F, the footprint in virtual memory, which is not as
interesting as H.  Backing store which you don't ever touch is very
cheap indeed.  This is particularly relevant to a copying collector,
in which to-space is part of F but only a small fraction of it is part
of H.

So the question is, what ratio interests you?  The most interesting
one is probably (L+R)/H, but to obtain this you need to know L+R, for
which you need to do an accurate full GC, which may not be possible.
In a conservative collector this is very hard (multiple runs with
different base addresses, etc).  In an exact collector it may be
doable (although hard because a full GC in a generational collector
involves collecting all generations together).

As I recall, in MLWorks (generational exact copying collector), when
L+R is 20-30MB, U is typically 10-20MB (mainly garbage building up in
older generations).  A was small.  C was very small except during a
collection of an older generation, when it could reach 16MB.  So the
total heap most of the time would be 50-75% larger than the reachable
set.

Nick B

--
FreeBSD 2.2.8-RELEASE: up 38 days, 14:10
last reboot Sat Aug 26 21:11 (lightning strike)