[gclist] ratio of heap size to live data?
Nick Barnes
Nick.Barnes@pobox.com
Thu, 05 Oct 2000 10:06:11 +0100
At 2000-10-04 16:57:33+0000, "Boehm, Hans" writes:
>
>
> > -----Original Message-----
> > From: Nick Barnes [mailto:Nick.Barnes@pobox.com]
> > Sent: Wednesday, October 04, 2000 2:52 AM
> > To: Norman Ramsey
> > Cc: gclist@iecc.com
> > Subject: Re: [gclist] ratio of heap size to live data?
> >
> >
> > At 2000-10-03 21:47:09+0000, Norman Ramsey writes:
> >
> > > I'm looking for any research that would tell me about what ratio
> > > of heap size to live data is needed for a modern copying or
> > mark-sweep
> > > collector to perform well. Can anyone point me to papers?
> >
> > This rather depends on what you mean by "live data" (and, for that
> > matter, "heap size"). The heap H consists of:
> >
> > - live objects L (i.e. data which will be used again);
> >
> > - dead but reachable objects R (i.e. data which can be reached by
> > chasing pointers from the roots);
> >
> > - unreachable objects U (i.e. objects which could be collected by a
> > GC); this may be divided into "strongly unreachable" (S) and "weakly
> > unreachable" (W), where W is treated as reachable by the collector
> > (due to the approximations which it makes).
> >
> > - allocation overhead A (object headers, padding, etc);
> >
> > - collection overhead C (collector data structures, to-space, etc).
> >
> ...
> > So the question is, what ratio interests you? The most interesting
> > one is probably (L+R)/H, but to obtain this you need to know L+R, for
> > which you need to do an accurate full GC, which may not be possible.
> > In a conservative collector this is very hard (multiple runs with
> > different base addresses, etc).
> For a conservative collector running with C/C++ code, it's also sometimes
> possible to get a lower bound on L+R, since many of the benchmarks used in
> the past were written for explicit deallocation, and some of those don't
> leak. Thus you can count the amount of live memory in the explicitly
> managed version of the program. I remember that Zhong Shao tried this on
> some of the Zorn benchmarks, and ended up with very small bounds on W in
> most but not all cases. (The bound may be large either because W is large,
> or because the explicitly managed program deallocates memory without
> clearing the pointers to it.)
> >
> > As I recall, in MLWorks (generational exact copying collector), when
> > L+R is 20-30MB, U is typically 10-20MB (mainly garbage building up in
> > older generations). A was small. C was very small except during a
> > collection of an older generation, when it could reach 16MB. So the
> > total heap most of the time would be 50-75% larger than the reachable
> > set.
> It would be nice to get some more numbers along these lines. For our
> collector U is normally kept at 33-50% for L+R on the order of 10MB, i.e.
> large enough to swamp additive constants. Fragmentation overhead (part of
> A?) is typically small, but has certainly been reported to reach 50% in some
> cases, and can theoretically be much worse. (It seems to also be possible
> to trade off fragmentation overhead in a nonmoving collector against
> collection frequency. If you collect only with a full heap, the entire heap
> is broken up into smaller blocks during each collection cycle, violating
> Phong Vo's Wilderness preservation principle. In this way, nonmoving
> collector fragmentation is different from malloc/free fragmentation.)
>
> My impression is that for copying collectors the to-space component of C is
> often an issue, though probably less so for ML. Even if the to space for
> the older generation is only used rarely, it still needs to be resident at
> that point. That means it either needs to remain resident, or it needs to
> be paged in and out. Paging in (and out) something on the order of (L+R) is
> often a nontrivial operation.
I'm not sure I follow you here. Let's look at a specific example for
MLWorks. Generation 0 is the nursery, collected many times per CPU
second; putting maybe 1 MB per second into generation 1, which gets
collected maybe every ten CPU seconds, putting maybe 2 MB per CPU
minute into generation 2. So perhaps generation 2 gets collected
every five CPU minutes (these numbers are all very approximate; the
actual collection policy, including scheduling and tenuring, is
adaptive, depending on survival rates, trying to minimise the total
cost of memory management).
Look at a generation 2 collection. Plucking numbers from the air,
generation 0 is empty, suppose generation 1 has 4 MB, generation 2 has
12 MB, and maybe there's 10 MB in higher generations. So L+R+U is 26
MB. If the policy is working well, maybe 6 MB will survive this
collection (so S = 6 MB for this collection).
The generation 2 collection proceeds like this:
- map 12 MB of to-space from /dev/zero. Our VM footprint now is 38
MB. Note that the operating system doesn't actually make anything
resident at this point; the mapping is on-demand. If everything is
resident, we still have a resident set size of 26 MB.
- collect generation 2 into to-space. As we collect, the OS provides
6 MB of zero-filled pages for us on demand. These pages aren't
being paged in, they are being manufactured on the fly. Shame about
the zero-filling, but that's another thread. If real memory is
tight then other pages (hopefully from other processes :-) may be
paged out to provide the page frames.
- At the end of the collection, just before we start unmapping, the
resident set size is 32 MB and C = 6 MB. Our VM footprint is still
38 MB.
- unmap the remainder of to-space. The only effect this has is to
trim 6 MB from our VM footprint (now 32 MB). This memory was never
actually resident, and cost us next to nothing.
- unmap the generation 2 from-space. This doesn't get paged out, it
just gets discarded. Now the resident set size is 20 MB (L+R+U),
and so is the VM footprint, and C is zero again.
So during this whole collection (which takes epsilon longer than
scanning 10 MB), C rises from 0 to 6 MB and back again, but it's never
12 MB, and the only paging is (maybe) 6 MB worth of paging out to free
the frames for our 6MB worth of /dev/zero.
Does this make sense?
> It's possible to make this even less frequent by increasing the size of
> younger generations. For slightly different reasons, this seems to be
> common practice for HotSpot. But then you may end up with a significantly
> larger value of U, and you reduce the possibility of fitting the youngest
> generation into cache.
Indeed.
Nick B
--
FreeBSD 2.2.8-RELEASE: up 39 days, 13:24
last reboot Sat Aug 26 21:11 (lightning strike)