[gclist] Name that hypothesis

Paul R. Wilson wilson@cs.utexas.edu
Tue, 10 Dec 1996 08:55:29 -0600


>From majordom@iecc.com Fri Dec  6 07:18:30 1996
>To: gclist@iecc.com
>Subject: Modeling and randomness (was Re: Name that hypothesis)
>Date: Fri, 06 Dec 1996 13:07:21 +0000
>From: Nick Barnes <nickb@harlequin.co.uk>
>
>> Talking about half-lives can be very misleading, because programs
>> aren't stochastic processes.  (The allocator literature is a disaster
>> because people took the notion of stochastic processes and applied
>> it where it wasn't appropriate---see my allocator survey on my
>> web site if you're interested.)
>
>I've read and enjoyed your allocator papers, and agree with the
>substance of them. Stochastic models are overly simple. However, I
>don't think that using continuous random processes to model the
>behaviour of an unknown set of unknown mutators is, in itself, a bad
>thing.

Hmmm... I think it's a bit dangerous to say that continuous random
process are an acceptable model of an unknown set of unknown anythings.
(I'm not sure we disagree, but I think the vocabulary can be misleading.)
If something's unknown, you can't model it with any real confidence,
and using "continuous random processes" is likely to introduce huge
biases.  Program behavior is often discontinuous and almost always
nonrandom (i.e., it's usually phased and strongly patterned).

What we found in studying allocators is that simple Markov-like models
as traditionally applied to allocation problems are simply wrong, and
deeply wrong.  About two thirds of the relevant behavior of programs is
lost when you do statistics about object lifetimes but fail to replicate
sequence and phase behavior.  (Lifetime distributions are useful for
getting a very rough idea of how much memory a program uses, across
most allocators, but they're much less useful for comparing the performance
of different allocators.  Unfortunately, the latter is usually what you
want, so you can pick a winner.)

For generational GC's, the pig-in-the-snake problem is a crucial
problem that's an example of discontinuity and nonrandomness.  To
capture this problem in a continuous random model, it has to
be a model of pigs large data structures' lifetimes), not a model of
object lifetimes per se.  Using traditional lifetime distributions misses
the point.

I can believe that you can come up with an interesting parameterized
model that models *phases* and nestings of phases using random
variables, and generate an interesting variety of program-like
behaviors.  In that case, you don't really have a model of object
lifetimes per se, but of phase behavior.  You also don't have a single
parameter, or even a pair of parameters.  You have a bunch of
parameters that represent program behavior at different timescales,
or something like that.  You may have to model the interactions
between phase structure at different timescales.  But then you're left
with the problem of trying to find combinations of parameters that generate
realistic large-scale behaviors, and to do that, you need real data and
sophisticated analyses.

If you model individual object lifetimes using traditional stochastic
models, the law of large numbers says that you're likely to lose most 
of the phase behavior that matters at any timescale above a few object
allocations, where "a few" is roughly the order of the model in
Markov terms.

> When designing a garbage collector, one often wants to cater to
>a wide range of mutator behaviours. There are a few obvious ways of
>preparing for this:
>
>(1) trace and measure the mutators one has available,
>(2) construct benchmark mutators,
>(3) construct mathematical models of mutators.
>
>One would be a fool not to pursue (1), but the set of available
>large-scale mutators is often very small; it is easy to end up with a
>GC specialized to a couple of programs which then performs abysmally
>with another user's code.

Right.  But the only good solution is to gather more data, and to
have a realistic idea of what behaviors matter.  This is where
allocator researchers went wrong in the 60's, and it led to
decades of wasted effort.  (That's why I'm pretty vehement about
this.  The GC literature is much better than the allocator literature,
and I'd like it to stay that way.)

>(2) is quick and easy, but most benchmarks are far too trivial to
>stress a GC in realistic ways. Again, it should be done (if only to
>eliminate trivial worst-case behaviour), but it is clearly
>insufficient.

Right.

>This leaves us with (3) as an essential tool for the GC researcher and
>designer. And if one's observations from real large-scale mutators are
>that some aspects of their behaviour can be realistically modelled by
>a continuous random process, one is lead naturally to investigate the
>mathematic properties of continuous random models.

I think there's a big if there.  Before we get too wrapped up in
continuous random models, I think we need to show that the assumptions
of continuity and randomness don't affect the things we're trying
to measure.  I general, both *do* affect the things we're trying
to measure, so we need to make sure that we acknowledge that the
models are "somewhat wrong" at the outset, and characterize when
and how they're wrong.

Or if we're modeling the higher-level behavior that is directly
relevant to the systems we're studying, we need to make that very
clear, and point out how _ad_hoc_ the models are.  (Ad hoc models
are often good, but it's important to make clear which _hoc_ they're
_ad_ to, i.e., which facts we're modeling, and which we're not,
and why it's appropriate.)

If we're making ad hoc models, we don't usually need randomness.  We
can write benchmark programs that do "realistic" things, even if they're
not real programs.  We can deterministically generate a variety of
programs that do deterministic things. 

In general, I think this is a better approach than throwing in randomness.
Whenever you randomize data, you are smoothing the data and you're likely
introducing biases.

We might randomly choose features of deterministic programs, but each
time we choose a feature randomly, we need to argue that probablility
of a program exhibiting that feature is pretty much independent of
our other choices.

>The principle danger here is that one must not pick too simple a
>model. I agree that my use of the term "half-life" implies too simple
>a model; I was using it to avoid discussing the second-order
>derivatives which actually come out of the simplest actual such model.

Okay.  But I find talk of continuity, randomness, and second-order
derivatives a little unsettling.  It always sounds like we understand
these things in those mathematical terms, and I don't think we do.
I'd rather hear about pigs in snakes and things like that.

I don't want to discourage modeling.  I *do* want to discourage
over-mathematization of things that aren't mathematically tractable.

I think the paradigm for constructing synthetic benchmarks should
be fundamentally nonrandom and noncontinuous.  It should be more
like a grammar than like an equation.

Here's an example.  Suppose our grammar consists of ramps, peaks,
and plateus, with phase sequences.  I can generate a ramp program with
nested peaks, where peaks have nested plateaus, and the plateaus
have little nested peaks in them.  Or I can generate a plateau
program with a superimposed sequence that consists of a ramp,
a big peak, and three smaller ramps of increasing size.

I claim that you need this kind of vocabulary to express the basic
things that programs do and which are relevant to generational
GC. (For example, if you had a good opportunistic GC, but your
synthetic programs didn't exhibit these kinds of behaviors, your
synthetic programs would be systematically biased against your
GC, and any experimental results might be worse than useless.)

Unfortunately, this kind of grammar-like characterization is very
powerful, and not very tractable.  You not only have a combinatorial
space of "sentences" (whole-program behaviors), but you have to
put quantities in, too (e.g., how big a ramp?).  But this is
appropriate, because real programs are even worse in those ways.
(Far worse.)

So you have to heuristically search the space of real program behaviors
to see if you can approximate the known behaviors of a bunch of
programs.  Then you make generalizations that you hope capture the
regularities, and you use those generalizations to generate new
important "plausible" synthetic behaviors.  At each step when you
generate a fake program, you need to justify any random choice of 
parameters.

For example, you might semi-randomly pick peak heights for a particular
"grammar rule", and justify it.  The peak heights are relevant to any
particular generational GC configuration (because they affect whether
data get advanced out of a particular generation), but that on average
the randomly chosen peak sizes will affect randomly-chosen generation
sizes in approximately the right ways the right percentage of the time.
Sometimes these justifications can be analytic, but sometimes they'll
have to be based on empirical data not only for programs but for
interactions between programs and GC configurations.

In general, I think we need to think qualitatively before we think
quantitatively and statistically.  Here's an example of the kind
of question we need to ask, and answer it analytically if possible
or empirically if necessary:

  If a program goes through a series of peak phases of a certain
  average height, does it matter whether the peak heights are
  all identical or if they vary around a mean?

The answer to this depends on what we want to analyze.  If we're analyzing
a particular configuration of a generational GC, the variation can
matter a lot.  If all the peaks are the same height, and fit in a
generation, we may not advance any data at all to the next generation.
But if they vary, the bigger ones will cause advancement and the little
ones won't.

On the other hand, if we're talking about average performance for
a particular GC over different configurations (e.g., choice of
generation size), the variation is less critical, but maybe
still significant.  A benchmark with a random variation may give
reasonable results that resemble the average performance across several
programs with fixed ramp sizes (but with each program having a different
fixed ramp size).  Then again, there may be significant error here because
the performance for an "randomized average" program may not be the same
as the average of the performance across a random selection of non-random
programs.

(In general, the answer to such a question may depend on other aspects
of the program.  For example, whether I advance data out of a particular
generation may or may not affect whether the next generation fills
up and advances data to yet another generation.   It also matters
whether you GC the next generation only when it fills up, or if
you GC it incrementally every n younger-generation GC's, or what.)

For synthetic benchmarks, I'd like to see programs that get by without
random number generators.  For example, if you want to make a benchmark
that doesn't depend critically on the size of the youngest generation,
you can have one that goes through peak phases of increasing size.
The early peaks will fit in the youngest generation of any GC, and
the later ones won't, with a different crossover point for different
generation sizes.  Then you can pick some plausible distributions
of peak sizes that give "realistic" lifetime distributions.

What I like about this idea is that it makes it clear that the benchmark
may be unrealistic, and only measures certain things.  Too often, people
take programs that use random number generators, and interpret the
results as being "representative" of real programs, which they often
aren't.  It's better to have a clearly phony benchmark that only models
certain phenomena than an unclearly phony benchmark that seems more
general, but isn't.