[gclist] why malloc/free instead of GC?

Igor Pechtchanski pechtcha@cs.nyu.edu
Tue, 18 Feb 2003 12:28:51 -0500 (EST)


On Tue, 18 Feb 2003, Charles Fiterman wrote:

> At 03:49 PM 2/18/2003 +0100, Basile STARYNKEVITCH wrote:
> > >>>>> "Charles" == Charles Fiterman <cef@geodesic.com> writes:
> >
> >     Charles> Consider a large online application with the following
> >     Charles> common requirement.  90% of all requests will be filled
> >     Charles> in one second. All requests will be filled in ten
> >     Charles> seconds.
> >
> >There are two meanings of large here :
> >
> >     1. application with a big memory requirement at runtime
> >
> >     2. application with a big amount of code
>
> Both.
>
> >Since copying a hundred megabytes per second is realistic on today's
> >machines, I would believe that a full major garbage collection of a
> >gigabyte heap (which for me is a big heap) should require less than 10
> >seconds.
>
> The commercial world is approaching 10 gigabyte heaps. This means trouble.
> Programmers in such environments are starting to manage their own heaps to
> avoid garbage collection. This only makes storage requirements expand even
> faster.

I would think that there are three major classes of applications with huge
heap sizes:
1) those where the large heap size comes not from the amount of data
handled in any one transaction, but rather from the number of concurrent
transactions.  Since each transaction is (usually) a separate entity,
approaches like region-based GC or transaction-specific heaps might work
well there.
2) those actually handling massive amounts of data for each transaction
(such as search engines).  Such applications mostly do not have a lot of
simultaneous live data, just a large data stream.  Since the performance
of, say, copying GC is proportionate to the amount of live data, this
shouldn't affect performance.
3) those with a lot of shared data.  This data would probably be
long-lived, and, once promoted to the oldest generation, rarely collected
anyway.  And there are techniques, like pretenuring, that allow the
relevant data to be promoted sooner.

The above are all susceptible to existing non-reference-counting GC
techniques, albeit with some tuning.  It'd be interesting to know if there
is a fourth class of applications that actually maintain large amounts of
simultaneous short-lived *live* data throughout the execution.

> Languages gain power more from their restrictions than their capabilities.
> Functional languages gain referential  transparency and composition from
> the loss of side effects. Type safe languages can be used in places where
> people fear viruses. Giving up circular data structures buys finalizers and
> large applications.

AFAICS, the only thing non-RC GC doesn't scale to is applications with a
large dynamic (i.e., fluid) working set (the fourth class above).  I'm not
at all sure any real applications fall into that category, although it
would be interesting to be proven wrong.
	Igor
-- 
				http://cs.nyu.edu/~pechtcha/
      |\      _,,,---,,_		pechtcha@cs.nyu.edu
ZZZzz /,`.-'`'    -.  ;-;;,_		igor@watson.ibm.com
     |,4-  ) )-,_. ,\ (  `'-'		Igor Pechtchanski
    '---''(_/--'  `-'\_) fL	a.k.a JaguaR-R-R-r-r-r-.-.-.  Meow!

Oh, boy, virtual memory! Now I'm gonna make myself a really *big* RAMdisk!
  -- /usr/games/fortune