[gclist] Collecting distributed cyclic garbage.

Giuliano Carlini giuliano@ix.netcom.com
Wed, 5 Jun 1996 12:23:29 -0700


This message desicribes an algorithm for reclaiming distributed
cyclic garbage.

I hope that posting it to the list is within the list's charter.
I'll take this off line if either it is not, or if a list member
objects.

If you are interested in this, please take a look at the "big
picture" first, and comment upon that. If after reviewing that you
have the energy to comment upon the surely many details which are
incorrect, inconsistent, or otherwise wrong, I'd be very
appreciative.

I apologize in advance to any one who has already developed an
algorithm like this. My ties to the academic world are non-existent,
and I'm unaware of much that happens there. Please let me know if
I just re-invented the wheel.

This algorithm is heavily influenced by the work at DEC on Network
Objects, and the work at INRIA on SSP chains. This message presents
a great deal of work devised by these researchers. I'm making no
claims that I developed these approaches. I present this material
only because back-searhing is built upon them, and one must
understand them to understand back-searching.

I'm also heavily indebted to those who pioneered the use of using
write barriers to detect changes to previously scanned data. This
served as an inspiration for detecting newly created references to
objects that were previously scanned. This is crucial to detecting
race conditions where a reference moves from an unscanned space to
a scanned space, leading to the incorrect reclaiming of an object.

I also apologize to anyone whose work I use but haven't mentioned.
Just let me know, and I'll correct that.

Thanks,

g






Back-Searching to Collect Distributed Cyclic Garbage.

ABSTRACT:

This algorithm reclaims distributed cyclic garbage by
"back-searching". It is designed for a client/server style of
processing using proxies to represent objects located in a remote
address space.

Back-searching inverts the traditional tracing collector's operation.
Rather than tracing live objects, it traces through "zombie" objects.
A zombie appears to be dead, but may on inspection still be live.
A zombie object is one which is:
	remotely referenced
	locally unreachable
	has not been recently mentioned in an RPC

This would be a horrible approach for a local collector. But, because
of the differences between local and distributed collectors, I expect
that it will work well.

The improvements offered by this approach over the very little
research I've seen are:
	- It does not require a central "Oracle" which is responsible for
		all distributed collection
	- It should scale well since the work performed by any space is
		dependent only upon the number of references through that
		space, and not upon the number of references in the system
		as a whole.
	- I believe that it is resilient to communication errors. While
		communication errors may delay the reclaiming of garbage,
		they should not cause an object to be reclaimed in error, 
nor
		prevent an unreachable object from being reclaimed forever.
	- It does not require a distributed clock
	- It should be relatively efficient. It should be O(n) where n is
		the size of a cycle.


PROGRAMMING MODEL

This algorithm assumes that a program is run on many address spaces,
abbreviated as "spaces". Data are shared by passing between the
spaces. An object is located in exactly one space. It may reference
other objects. A reference is either local or remote. It is local
iff the target object is in the same space as the subject object;
otherwise it is remote. Local references are represented as pointers.

Remote references are represented as a pointer to a "stub" proxy.
The stub stands in for the target in the remote address space. All
function calls intended for the target are made on the stub. The
stub forwards the function call to the target by sending a message
to a "scion" proxy in the remote space. The scion receives the
message, and makes the call on the target. If the function returns
a result, the scion sends it back to the stub, which in turn returns
it to its caller.

There is a 1-1 mapping between stubs and scions. That is, stubs do
not share scions. In addition, every object which is the target of a
scion has a "visa" which maintains data used for communication and
garbage collection. All scions which share a common target, also
reference a common visa.

The local roots of a space are its registers, stack, and global data.
The roots of a space are its local roots plus its scions. An object
is locally reachable when it can be reached from its spaces local
roots. An object is remotely visible if it is reachable from its
spaces scions. It is remotely reachable if it is reachable from
some spaces local roots.

LOCAL GARBAGE COLLECTION:

Most objects are reclaimed by a local collector. This collector
retains all objects reachable from the roots. If an object is
reachable from no root, it is garbage and should be reclaimed.

DISTRIBUTED ACYCLIC COLLECTION:

When a stub is no longer reclaimed by the local collector, it sends
a delete message to its scion. Any stub reachable from only that
scion in turn becomes garbage, and will be reclaimed on the next
local collection of that space. This guarantee's that over time all
distributed acyclic garbage is collected.

DISTRIBUTED CYCLIC COLLECTION STRATEGY:

A zombie is an object which is probably dead, but which may not be.
A zombie is:
	Locally unreachable.
	The target of at least one scion.
	Not recently mentioned by an RPC.

Distributed cycles are collected by finding zombies, and searching
transitively backwards through the objects which reference the zombie
looking for an object which is locally reachable. If there is no
such object, the zombie is unreachable. It and all its ancestors can
be reclaimed.

Three different collectors are used. The first is a traditional local
collector which uses registers, stacks, and globals as its roots. The
second is the "Remotely Visible" - abbreviated RV -scanner. It uses
scions as its roots, and is responsible for inializings the data
structures used by the third. The third is the back-search scanner.
It checks if a locally unreachable objects is reachable from the
local roots of some remote space.

DATA STRUCTURES:

Every object must maintain a remotely visible bit denoting that it is
reachable from only a scion. It is set iff the object is not locally
reachable, but is reachable from a scion.

Every address space has a globally unique identifier.

Back-searches have globally unique identifiers. The identifier is
composed of the zombie's space identifier and zombie's address in
the space.

Visa's record:
	- The time an RPC last mentioned their target
Visa's also record the following if their target is a zombie:
	- The next scheduled time to perform a back search
	- The stubs which were reachable from the zombie during the last
		run of the local collector.
	- A count of the number of stubs which were reachable from the
		zombie, but not included in the above list.
	- The back-search id of the back-search which is looking for
		a reachable ancestor of the zombie.

Stubs which are locally unreachable record:
	- Pointers to all visa's whose targets can reach it.
	- A count of the number of visas whose targets can reach it, but
		are missing from the above list.

LOCAL GARBAGE COLLECTION:

Very few constraints are placed upon the local collector. Even though
all objects must have a remotely visible mark bit, the local GC can
be done using a copying or compacting collector. Only two
requirements are imposed on it by the distributed collector. First,
it must clear the remotely visible mark bit of the objects it retains.
Second, it must run the RV scanner when it finishes.

REMOTE VISIBILITY SCANNER

The remotely visible scanner uses the scions as it roots. It is
responsible for setting the remotely visible bit for all objects
it scans. It is also responsible for adding stubs to visas' reachable
list, and visas' to the stubs reached from list. If one of these lists
is too small, it may not be possible to allocate memory to expand them
to add new entries. In this case, the RV scanner will increment the
deficiency counter for the list. After the local GC completes and the
allocator state is consistent, these lists can be expanded. They will
then be ready for the next run of the RV scanner.

When a stub or scion's target is locally reachable, there is
no need to maintain the cross reference list. In fact the list may
be deleted. The first run of the RV scanner after a proxy becomes
locally unreachable will simply increment the proxy's deficiency
counter. The next run will then fill in the cross reference lists.

As the RV scanner runs, it will schedule scion targets which are
newly locally unreachable for a back search. When a locally
unreachable scion's target is mentioned in an RPC, the scheduled
date is slipped. Further, the object and those reachable from it
may be migrated to the source space which most often mentions
the target in it's RPC's.

BACK SEARCH SCANNER:

Back-searches are scheduled for visas whose targets are:
	- locally unreachable
	- not recently mentioned by an RPC.
The scheduled back-search time is a function of the object's age,
and the time the object was last mentioned in an RPC. The older
an object and the more recent it was last mentioned, the more
distant the scheduled back-search time.

In order to detect cycles, and to minimize redundent back-searches,
visas record the id of the back-search which is searching for a
locally reachable ancestor. If a back-search visits a zombie whose
back-search id is equal to the current back-search's id, then the
back search has circumnavigated a cycle. If two zombies share a
common ancestor, a back-search from each may reach the common
ancestor. There is no need for both to conduct a complete back-
search for the visa's target.

In this case, the first will set the visa's back-search id. The
second will see that visa's back-search idis set and not equal to
the second back-search's id. The second and subsequent back-searces
will suspend themselves. Suspended back-searches resume when the
first returns. They then return the result returned by the first
back-search.

Suspending backsearches can cause a deadlock. How this can
occur, and how to avoid it are discussed later.

Back-searches run in two passes. The first pass searches for a
locally reachable stub which can reach the zombie. A back-search
is started for the zombie. The zombie's visa's backsearch id is set
to the newly created back-searches. Immediately, distinct back-
searches are cloned for each of the zombie's scions. Each inherits
the id of the original. Each back-search goes back to the scion's
stub. If the stub is locally reachable, then the zombie is remotely
reachable. If the stub is locally unreachable, the search then
visits the stub's list of visas whose targets can reach it. If a
visa's back-search id is null, it is set to the current back-
search's id. Then a new back-search is cloned for each of the
visa's scions. Each of these back-searches retains the parent
back-searches id.

Every back-search will either:
	- find an unreachable stub
	- cycle back to the zombie
	- find a locally reachable stub in which case the zombie is
		remotely reachable
If a back search finds an unreachable stub, the stub is deleted.
The back-search terminates and reports back that it found no locally
reachable stub.

If a back search cycles back to a zombie, then the zombie is not live
along that path. The back-search is terminated and reports back that
it found no local reference.

If a back search finds a locally reachable stub, the zombie is
reachable. The search terminates and returns the id for the space
back down the search path. The target space can then migrate the
target to one of the spaces with a local reference which reaches the
target.

The result of a back-search with cloned daughters is dependent upon
the daughter's return values. The back-search returns that it found
a locally reachable stub, iff some daughter found a locally reachable
stub.

If the first pass of all back-searches report that they found no
locally referenced stub, then the zombie is probably unreachable.

But, there is a race condition. A back search may examine a stub,
and start searches on each scion which can reach the stub. Each of
these returns that no local reference was found. The back-search
reports that no local reference was found. Then, a new scion is
created which can reach the stub. The stub is now reachable, so the
zombie is reachable. But, the back search for the stub has already
reported that the stub was unreachable. Thus, the zombie may be
collected.

The solution is to make a second pass, and to implement a "new
scion" barrier. This is similar to the write barrier used by
local collectors to determine that they must rescan an area of
memory.

Whenever a stub has been scanned, the space is notified that new
scions should be marked dirty. If the dirty scion is the first for
the target, then the RV scan must be run on the scion. We then need
only look at the target's visa. Any stub cross referenced by the visa
needs to be marked as reachable. When the back-search returns to the
stub, it must wait for all dirty scions created after the first scan
to be checked. After they have, it can report on whether it is
reachable or not.

If all of a zombie's stubs make it through both passes of the back-
search without being reachable, then the zombie and all proxies which
reach it are garbage and can be reclaimed.

Each space maintains a list of the back-searches which have completed
the first, but not the second pass. Each entry in this list records
the dirty in proxies that it watches. A newly created scion is dirty
and is added to the latest entry. After the RV scanner processes a
scion, the scion is clean is removed from its dirty entry. When the
second pass of a back search comes through, it scans its entry and
all later ones. This causes all stubs which might be affected to be
updated. The second pass returns reachable, iff its stub had a scion
added to its list of scions.

AVOIDING DEADLOCK:

The last problem to handle is avoiding a deadlock. A deadlock can occur
when several zombies in a cycle start back-searches. As each comes
upon the others visa's, they will see that the visa's back-search id
is non null and not equal to their id, and suspend. Deadlock.

I haven't figured out yet how to detect this condition. I believe
that the following strategy will be effective at solving it. Assume
N zombies in a cycle start back-searches. Eventually N-1 will suspend.
I believe that the only way that the N'th can deadlock is if there is
a cycle. That is, deadlock implies a cycle. So, back-searches check
for deadlock before suspending. If they would deadlock, there is a
cycle. Therefore, there is no reason to suspend, since a the action
when a cycle is detected is to terminate the back-search, and return
that no locally reachable stub was found.

FUTURE WORK:

I must figure out a deadlock avoidance protocol.


Adapt Back-Searching to the shared memory model of distributed
programming.