[gclist] Finalizers: An alternative

Jerry Leichter leichter@smarts.com
Mon, 13 May 1996 15:52:07 -0400


For all the talk about finalizers, there really seems to be exactly one 
application that keeps coming up:  Freeing externally-defined resources, like 
file descriptors.

Perhaps the right thing to do is to forget about finalizers for the moment and 
concentrate on exactly this issue.  By doing so, I think we can eliminate many 
difficulties.

A garbage collector deals with a single resource, memory.  Let's extend it by 
allowing the definition of new "resources".  A "resource" is an abstract object, 
introduced to the GC by a call of the form:

	rd = newResource(id,after,before,options,handler)

rd is a unique internal "resource descriptor" for the resource.

id is some kind of programmer-meaningful identifier for the resource; multiple 
calls with the same id value get the same d values back.  before and after are 
lists of resource descriptors.  The GC maintains a total ordering of all rd's 
such that the rd returned falls after all the "before" rd's and before all the 
"after" rd's.  If the before and after values given produce a loop, the call to 
newResource fails.  (If the id names a previously-existing resource, such a call 
has no effect on it.)

The only option I can think of is "counted".  Resources will be associated with 
objects; a "counted" resource associates a count of the "number" of such 
resources associated with the object.  A non-counted resource maintains no such 
count - or, logically, the count is always 0 or 1.

handler is a function that receives resources back from the GC, as described 
below.

The "counted" option may not be changed for an existing resource.

There is a built-in counted resource with id "memory" and "well-known" rd which 
represents the contents of the memory allocated for an object.  The count 
associated with the object is the object's in-memory size.  The "memory" 
resource is always implicitly added to the "before" list of any newResource 
call.

Once a resource has been created, any GC'ed object may be flagged as containing 
that resource by calling:

	addResource(rd,object[,count])

The count argument is ignored for non-counted resources; it must be present 
for counted resources, and is added to the count value maintained for the 
object.
The new value of the count is returned (it's always 1 for a non-counted 
resource).  (Normally, count>0, but who knows, the general definition might be 
useful to someone.  I suppose a count of -1 for a non-counted resource could 
"turn it off".)

When the GC locates an unreachable object, if walks through the list of 
associated resources, respecting the total order determined by newResource.  For 
each resource, it calls its handler, passing it a pointer to the object and its 
count (always 1 for a non-counted resource).  The handler cannot "resurrect" the 
object, nor prevent the remaining resource handlers from being called.  It may, 
of course, cause other objects to become unreferenced.  The "memory" resource, 
which always comes last, logically destroys the contents of the the memory 
allocated for the object and places it on the free store.

A program may at any time call:

	needResource(rd,n)

which requests a garbage collection that returns n "units" of the resource.  
What this means is simply that the GC algorithm runs until it has been able to 
find and free objects containing the resource in a total quantity of at least n.
"Freeing" is no different in this case than in any other:  All associated 
resource handlers are called.  However, the GC may choose to ignore unreachable 
objects that don't contain any of the resource (i.e., which either are not 
associated with the resource at all, or have a count value that is <=0).  If 
necessary, the GC may run repeatedly to find "enough" of the resource, which 
could gain something if either handlers for the resource might render new 
objects inaccessible, or if concurrent threads might do the same.  If an 
"optimized" pass - ignoring objects not containing the resource - fails to find 
enough units, the GC has to fall back and simply clean up everything.  Needless 
to say, a well-designed program will try to avoid this by adding the resource to 
the "outermost externally-visible object".  In some situations (e.g., real-time) 
there might need to be a way for the program to specify how long the GC should 
try to find resources before giving up and returning a failure indication.

An obvious resource is "file-handle".  The allocator for any object that 
contains a file handle would declare that fact to the GC.  The allocator and the 
handler for the file-handle resource would keep track of exactly how many file 
handles had been "released".  If the allocator found it needed a handler when 
too many were in use, it would call needResource() to try to free one up.  It 
could call needResource with a request for a large number of units to force all 
dead file-handles to be freed.  A program that wanted to make sure that 
file- handles were closed "soon" might arrange to call needResource with a large 
n value periodically.  Of course, the program would have to establish and 
enforce some kind of convention about where within the object the file-handle 
could be found.  hasResource is intended to be used as part of a package that 
creates certain kinds of objects encapsulating resources; it's a not intended to 
be usable by completely arbitrary code.

It would be reasonable for an implementation to assume that almost all objects 
contain no resource other than memory.  In that case, the resource information 
could be stored separately from the objects.  For a non-moving collector, a hash 
table based on the memory address of the object would be appropriate; for a 
moving collector, one would need something a bit different.  An incremental GC 
might not wish to call all the handlers associated with an object as soon as it 
found it was non-referenced, because of the time involved.  In that case, a 
queue of objects awaiting resource cleanup might develop.  The separate resource 
records could allow objects on that queue that actually contain the resource to 
be handled when a request of the resource was issued.

Long chains of objects containing a given resource would still require multiple 
GC passes to "kill off"; that's inherent in the nature of garbage collection.
In this case you do better with a collector whose performance is based on t
the amount of *garbage*, not on the amount of live data!  (Of course, a clever 
handler might break such chains, allowing recovery during the following GC 
pass.)

Needless to say, a conservative collector may potentially leak any resource, 
just as it potentially leaks memory.

"Counted resources" may be overkill - allocator and handler could probably keep 
track of counts themselves.
							-- Jerry