[gclist] When to garbage collect
Taura Kenjiro
tau@is.s.u-tokyo.ac.jp
Sat, 15 Nov 1997 00:07:15 +0900 (JST)
> > I would like to know whether there exist references to papers
> > dealing with when to really do a Mark-Sweep garbage collection.
> > There are many moments when the system can decide to garbage
> > collect. However, to reduce overhead, a rule is used to
> > determine whether it is worthwhile to go through all the memory.
> > A (particularily bad) rule that I encountered in some
> > program is that only if less than 5 percent of the declared
> > memory is free, (i.e. not used, and no garbage), then
> > the process collects the garbage.
>
> It is important to know what _bad_ means for your. If you want to
> minimize the total amount of time spend garbage collecting, it is best
> to do as few collections as possible and to to them as late as possible.
> So collect only when the available memory is exhausted.
>
That's not true. If the program has a large amount of data at the
moment when the available memory is exhausted, invoking collection
earlier (hopefully when the program has a small live-data/heap-size
ratio) may spend fewer time on collection.
As an illustrating example, suppose that a program uses no permanently
live data and repeats allocating two arrays of 500KB each, doing
something useful by these arrays, and discarding them. An oracle
collector that triggers collection everytime when the two arrays have
just been discarded obtains 1MB for free at each collection. If, on
the other hand, the system has 1.5 MB of available memory and the
collector is invoked when the available memory is exhausted, the
collector would scan 0.5MB to obtain 1MB.
In general, assuming the program allocates the same amount of memory
reagardless when collectors are invoked, the program spends fewer time
on collectoin when the collector frees larger space by the same amount
of effort (i.e., scanning live data).
Back to the original question, typical collectors I know of collect
when the program runs out the current heap AND there have been "enough
allocations" since the last collection. They typically consider the
program has made "enough" allocations when the program allocates a
constant factor of live data size. For example, if the last collection
found 1MB of live data (i.e., the collector scanned 1MB) and the
constant was two, the collector allows 2MB allocations until the next
collection. They effectively try to guarantee the lower bound of
PROFIT/EFFORT without expanding heap too much.
It is true that you could potentially do better, but then you would
have to predict live-data/heap-size ratio during the course of a
computation. You also need to predict how the value will change in the
future.