[gclist] Guardians

Hans Boehm boehm@hoh.mti.sgi.com
Mon, 14 Apr 1997 11:08:49 -0700


On Apr 13, 12:34pm, Carl Bruggeman wrote:
> [I wrote:]
> > c) The finalization facility should make it hard to accidentally
> > access finalized objects.  If I have to write subtle code to get
> > finalization order between a file buffer and the underlying file
> > right, in my opinion something is wrong.  If the scheme breaks with
> > a concurrent collector, something is wrong.
>
> Its the characterization that an approach is "wrong" here that bothers
> me.  In this finalization debate there there is no "wrong" and no
> "right" way; there are only design tradeoffs between the two
> approaches that should be illuminated as clearly as possible (hence my
> posts to try to dicover what weak-for-finalization pointers were and
> how they were used to influence finalization order).

OK.  I'll refrain from using the word "wrong" here.  As I previously
pointed out, people are programming with Java and they are presumably using
Guardians.  Finalization is needed rarely enough that any of these mechanism
are
sufficient in practice.  Further, they all prevent dangling pointers for purely
garbage collected code, which I think is the next most important criterion in
evaluating finalization mechanisms.

However, beyond that, it still seems to me that topologically ordered
finalization is a clear win over the alternatives.  The reason I'm pushing on
the point, in spite of the fact that I don't believe it's among the most
important language design decisions, is that I was hoping it's one of those
rare opportunities to prune the design space.
>
> My understanding so far is:
>
> Topological finalization:
>
> You are proposing a language restriction that enforces a topological
> finalization order.  This restriction correctly handles finalization
> automatically N% of the time (I think you have characterized this as
> 90%, but the important part here is that N<100%).  Thus, the
> programmer must use an alternate mechanism in 100-N% of the cases.
> Your mechanism to allow the programmer to handle these cases where
> topological ordering is inappropriate is weak-for-finalization
> pointers.
>
> In exchange you are willing to place responsibilty for timely
> finalization in the hands of a programmer, who must use
> weak-for-finalization pointers in those cases where there are long
> dependency chains or cycles.  In other words, you do not get
> _automatic_ timely finalization because it requires programmer
> intervention in some cases.

Right.  But this sort of programmer intervention is required (on similarly rare
occasion) for any sort of garbage collection.  I claim that if you have a
garbage collector, sooner or later you will have to take some care that
pointer-reachability is close to the "will be accessed" property.  Certain data
structures don't have this property (see the paper just preceding Carl's in the
PLDI 93 proceedings).  Large data structures remain accessible if only a small
piece is needed.  Similarly, there are rare cases in which reachability from
finalizers doesn't model whether or not something will be accessed during
finalization.  Either case can be fixed by restructuring the data, or clearing
pointers at the right time.

The general observation is that this is rare.  (The nonfinalization problem is
even more rare in Scheme than it is in C++ or Modula-3.  This is largely
historical accident, in that it's harder to build the offending data structures
out of cons-cells.)  When it does become a problem, it can usually be remedied
by local changes.

Timely finalization is elusive in either scheme.  It's hard to define what you
mean by timely in a world in which that object in question may have been
promoted to a generation that will be collected next week.  I've never
encountered it as a real issue.  People don't build lists of a billion
finalizable objects.  Linear lists of that length are rarely a reasonable data
structure.  And there are rarely anywhere near that many finalizable objects.
 (Except if you need finalizers to deallocate memory in a mixed language
program. But then you really want the safety of topological ordering.)  It's
trivial to break such chains when they do arise.
>
> It has also been asserted that weak-for-finalization decisions can be
> made using only local information.  If this were true and could be
> determined automatically (i.e.  done by the compiler), it would in
> essence eliminate the need for programmer declarations about weak
> pointers and you would again have automatic timely finalization.  I
> have proposed a counter example for this assertion, namely that for
> cycles, the programmer must have information about an entire cycle to
> break a cycle correctly.  I also still believe that more than local
> information is required for chains of dependencies.
I don't claim that the decision is always local.  I do claim that in most
interesting cases it is.  A typical example (proposed by Christian Jacobi) is a
finalizable object A (representing a window?) that has a Lisp-style property
list attached.  The property list can include arbitrary references, and may
indirectly refere back to A.

This introduces a potential cycle.  But finalization will never refer to the
property list.  Hence it is perfectly safe to treat it as weak for finalization
(or to split the object A suitably).

Technically, this does require a global check that there will be no accesses to
the property list by other finalizers.  The interface to A should document that
the property list will be unvailable after the object first becomes
inaccessible.  In practice this is unlikely to be an issue.
>
>
> With guardians:
>
> Timely finalization is automatic and requires no intervention on the
> part of the programmer.
>
> Finalization order is not automatic, but is completely under the
> control of the programmer.  In M% of the cases, however, the guardian
> code can be encapsulated in the object creation functions where it is
> nearly as automatic as in the case of topological finalization.
> Unfortunately, the value for M% is rather fuzzy and will remain that
> way until we have more experience with guardians.  Even if the simple
> mechanism that I proposed for dealing with chains of dependencies
> becomes overly complex, it does not rule out the possibility of other
> ways of using guardians to handle finalization dependencies in a
> simple modular fashion.  In order to consider alternate mechanisms I
> would appreciate an example of finalization dependencies where
> weak-for-finalization pointers shine.  I have seen someone (I think
> it was Hans) allude to an "editor" with "cords" (Cedar ropes with
> file backing?) as being an example.  I would appreciate as detailed
> a description as possible (documented source if you wish).
The source is included in the gc distribution.  By itself, it's not a good
example for this discussion.  Finalization is only used to close read-only
files.  There are no finalization order issues.  And I agree with you that this
is the typical case.

Let me try to describe the example I alluded to previously.  This assumes that
we use cords (the string implementation included with the garbage collector) or
equivalently Cedar ropes as the standard string type.  Cords have the following
charactersistics:

- Strings are represented as binary trees.  The leaves contain either C-style
character strings or a function closure for computing the i'th character.  The
internal nodes represent concatenation.  Thus (nondestructive) concatenation
can usually be performed in constant time.  Essentially concatenation is lazy.

- In-place updates are not supported.  Cords can be updated by taking
substrings of the original cord and concatenating with other strings.  The
substring operation is roughly logarithmic in the length of the string.
 Computing a long substring of a cord represented by a function closure results
in a new function closure (with suitable hacks so that repeated substring
operations perform acceptably).

- Cords provide a function CORD_from_file(FILE * f) that turns a read-only file
into a cord.  The file is either read immediately (if it is short enough) or
lazily.  In the latter case, the string representation is actually a function
closure that reads the file on demand (with a suitable buffering scheme).

- If I repeatedly modify a string that was generated from a file, dropping the
original version, it is hard to tell when the file is no longer needed.  Hence
the file must be closed implicitly by a finalizer.

Now consider another object B with a finalizer which simply logs the objects
print name in a log file of known-deceased objects.  The name is stored as a
cord, which is passed into the object creation procedure.  (This is a bit
contrived, but it's not hard to believe that a finalizer might want to access a
string.)

In order for this finalizer to run correctly, I need to ensure that it's name
(the cord) is still fully intact.  This means that any files embedded in it
must be closed AFTER B's finalizer is run.  B's implementation has no idea
whether there are any such files.  (In fact it's unlikely unless B has a very
long name.)  But I still need to deal with that case.

Topologically ordered finalization just does the right thing in this case,
without programmer intervention.  I don't know how to handle it well with
Guardians.


>
> My first cut at an analysis:
>
> There are I% of the cases where the finalization order is independent
> of any other finalization actions.  In my very limited experience, I%
> has been 100% in production programs, but your experience with
> real-world problems clearly shows otherwise.  In either case I% must
> be less than N% (because topological ordering is one possible order
> for independent finalization actins) or M% (because guardians can be
> used to implement more than just independent finalization actions with
> only trivial work on the part of the programmer).
>
> Question:  What is a realistic value for I% ?  I suspect that I% is
> very large, on the order of 50-80%, but I would be interested in what
> you have found out.
>
> Summary:  I% < N% < 100%
>           I% < M% < 100%
>
> We don't have any hard data or whether N<M or M<N.  You clearly feel that
> M<N, and I wouldn't currently dispute that, but would like to point out
> that since the programmer has complete control, each new insight into the
> use of guardians pushes M higher.

I would guess that I is on the order of 80, maybe higher.  But I haven't looked
at all the Cedar uses of finalization.

We have to be a little careful in defining I.  In most cases, I would guess
there is no dynamic dependence.  But it may be that a finalizer for C does in
fact perform an operation on a subobject D.  C should not know whether D points
to another object, whether that object requires finalization, or whether in
fact it might just be a small integer.  Thus I would expect there to be
relatively many cases in which it's hard to exclude the existence of a
dependence without violating modularity, but there in fact is never a dynamic
dependence.  But I would still guess that in 80% of the cases, it's easy to
guarantee that there is no dynamic dependence.
>
> > If the scheme breaks with a concurrent collector, something is wrong.
>
> You have mentioned this point twice without explaining how guardians
> suffer in a concurrent collector in ways that topological finalization
> do not, so I am still completely in the dark about this assertion.
>

It would be nice to get this clarified.  I think I'm beginning to understand
how this would have to work, so perhaps you can correct me.  Let's deal with
the above example in which B refers to a cord.  I could run, every once in a
while:

<finalize all type B objects>
<finalize all objects I can retrieve from the file guardian>

But that would be wrong, as we discussed before, since a collection can occur
between the two.  So I could use

while I can retrieve a file F from the file guardian
    <finalize all type B objects>
    <finalize F>

That looks correct to me, but it seems to have some unpleasnt modularity
properties.  The file finalization code has to explicitly know about all
objects that may refer to files, and what their internal dependencies are.  It
seems to me that if I tried to use this in cases which required ordering, I
would probably build something like Charles' static finalization ordering
scheme on top of it, so that I could hide all the finalization ordering code in
one module?

Hans

-- 
Hans-Juergen Boehm
boehm@mti.sgi.com