[gclist] ratio of heap size to live data?

Nick Barnes Nick.Barnes@pobox.com
Thu, 05 Oct 2000 10:06:11 +0100


At 2000-10-04 16:57:33+0000, "Boehm, Hans" writes:
> 
> 
> > -----Original Message-----
> > From: Nick Barnes [mailto:Nick.Barnes@pobox.com]
> > Sent: Wednesday, October 04, 2000 2:52 AM
> > To: Norman Ramsey
> > Cc: gclist@iecc.com
> > Subject: Re: [gclist] ratio of heap size to live data? 
> > 
> > 
> > At 2000-10-03 21:47:09+0000, Norman Ramsey writes:
> > 
> > > I'm looking for any research that would tell me about what ratio
> > > of heap size to live data is needed for a modern copying or 
> > mark-sweep
> > > collector to perform well.  Can anyone point me to papers?
> > 
> > This rather depends on what you mean by "live data" (and, for that
> > matter, "heap size").  The heap H consists of:
> > 
> > - live objects L (i.e. data which will be used again);
> > 
> > - dead but reachable objects R (i.e. data which can be reached by
> >   chasing pointers from the roots);
> > 
> > - unreachable objects U (i.e. objects which could be collected by a
> >   GC); this may be divided into "strongly unreachable" (S) and "weakly
> >   unreachable" (W), where W is treated as reachable by the collector
> >   (due to the approximations which it makes).
> > 
> > - allocation overhead A (object headers, padding, etc);
> > 
> > - collection overhead C (collector data structures, to-space, etc).
> > 
> ...
> > So the question is, what ratio interests you?  The most interesting
> > one is probably (L+R)/H, but to obtain this you need to know L+R, for
> > which you need to do an accurate full GC, which may not be possible.
> > In a conservative collector this is very hard (multiple runs with
> > different base addresses, etc).
> For a conservative collector running with C/C++ code, it's also sometimes
> possible to get a lower bound on L+R, since many of the benchmarks used in
> the past were written for explicit deallocation, and some of those don't
> leak.  Thus you can count the amount of live memory in the explicitly
> managed version of the program.  I remember that Zhong Shao tried this on
> some of the Zorn benchmarks, and ended up with very small bounds on W in
> most but not all cases.  (The bound may be large either because W is large,
> or because the explicitly managed program deallocates memory without
> clearing the pointers to it.)
> > 
> > As I recall, in MLWorks (generational exact copying collector), when
> > L+R is 20-30MB, U is typically 10-20MB (mainly garbage building up in
> > older generations).  A was small.  C was very small except during a
> > collection of an older generation, when it could reach 16MB.  So the
> > total heap most of the time would be 50-75% larger than the reachable
> > set.
> It would be nice to get some more numbers along these lines.  For our
> collector U is normally kept at 33-50% for L+R on the order of 10MB, i.e.
> large enough to swamp additive constants.  Fragmentation overhead (part of
> A?) is typically small, but has certainly been reported to reach 50% in some
> cases, and can theoretically be much worse.  (It seems to also be possible
> to trade off fragmentation overhead in a nonmoving collector against
> collection frequency.  If you collect only with a full heap, the entire heap
> is broken up into smaller blocks during each collection cycle, violating
> Phong Vo's Wilderness preservation principle.  In this way, nonmoving
> collector fragmentation is different from malloc/free fragmentation.) 
> 
> My impression is that for copying collectors the to-space component of C is
> often an issue, though probably less so for ML.  Even if the to space for
> the older generation is only used rarely, it still needs to be resident at
> that point.  That means it either needs to remain resident, or it needs to
> be paged in and out.  Paging in (and out) something on the order of (L+R) is
> often a nontrivial operation.

I'm not sure I follow you here.  Let's look at a specific example for
MLWorks.  Generation 0 is the nursery, collected many times per CPU
second; putting maybe 1 MB per second into generation 1, which gets
collected maybe every ten CPU seconds, putting maybe 2 MB per CPU
minute into generation 2.  So perhaps generation 2 gets collected
every five CPU minutes (these numbers are all very approximate; the
actual collection policy, including scheduling and tenuring, is
adaptive, depending on survival rates, trying to minimise the total
cost of memory management).

Look at a generation 2 collection.  Plucking numbers from the air,
generation 0 is empty, suppose generation 1 has 4 MB, generation 2 has
12 MB, and maybe there's 10 MB in higher generations.  So L+R+U is 26
MB.  If the policy is working well, maybe 6 MB will survive this
collection (so S = 6 MB for this collection).

The generation 2 collection proceeds like this:

- map 12 MB of to-space from /dev/zero.  Our VM footprint now is 38
  MB.  Note that the operating system doesn't actually make anything
  resident at this point; the mapping is on-demand.  If everything is
  resident, we still have a resident set size of 26 MB.

- collect generation 2 into to-space.  As we collect, the OS provides
  6 MB of zero-filled pages for us on demand.  These pages aren't
  being paged in, they are being manufactured on the fly.  Shame about
  the zero-filling, but that's another thread.  If real memory is
  tight then other pages (hopefully from other processes :-) may be
  paged out to provide the page frames.

- At the end of the collection, just before we start unmapping, the
  resident set size is 32 MB and C = 6 MB.  Our VM footprint is still
  38 MB.

- unmap the remainder of to-space.  The only effect this has is to
  trim 6 MB from our VM footprint (now 32 MB).  This memory was never
  actually resident, and cost us next to nothing.

- unmap the generation 2 from-space.  This doesn't get paged out, it
  just gets discarded.  Now the resident set size is 20 MB (L+R+U),
  and so is the VM footprint, and C is zero again.

So during this whole collection (which takes epsilon longer than
scanning 10 MB), C rises from 0 to 6 MB and back again, but it's never
12 MB, and the only paging is (maybe) 6 MB worth of paging out to free
the frames for our 6MB worth of /dev/zero.

Does this make sense?

> It's possible to make this even less frequent by increasing the size of
> younger generations.  For slightly different reasons, this seems to be
> common practice for HotSpot.  But then you may end up with a significantly
> larger value of U, and you reduce the possibility of fitting the youngest
> generation into cache.

Indeed.

Nick B

--
FreeBSD 2.2.8-RELEASE: up 39 days, 13:24
last reboot Sat Aug 26 21:11 (lightning strike)