[gclist] Distr. GC general discussion for FAQ (was re: Weighted RC)

Marc Shapiro shapiro@prof.inria.fr
Fri, 22 Mar 1996 10:36:55 +0100


 || From: wilson@cs.utexas.edu (Paul R. Wilson)
 || Date: Thu, 21 Mar 1996 09:08:54 -0600
 || 
 || Distributed garbage collection is tricky, because tracing all reachable
 || data in a distributed system can take a very long time.

In a large-scale system (e.g. the Internet) you may safely assume it will
never finish.

 || Many designs use reference counting or reference listing to avoid
 || distributed tracing.  Since reference counting can't reclaim cycles,
 || this can lead to leaks if cycles span nodes.

The big question is: how bad is this going to be?  Nobody has any real
experience with distributed GC.  The only meaningful data points we have come
from distributed files systems, which have no cycles by construction, and the
WWW, which has a lot, but never has any garbage (any URL is a root).

 || Many people who advocate reference counting/listing advocate using it in
 || conjuction with a more complete tracing GC.  A common scheme is:
 || 
 ||    1. Use a local tracing collector
 ||    2. Use distributed reference counting or reference listing
 ||    3. Use distributed tracing of some sort to get back distributed
 ||       cyclic garbage

One alternative solution is to migrate any object that is not locally
reachable, to the process where it is reachable from.  A cycle will eventually
migrate to a single process, where the local collector will reclaim it.
Harder than you might think at first glance: what if your object is reachable
from 2 processes?  How do you ensure termination?  (The Liskov group has some
results.)

Also, I have seen proposals where you actively look for a cycle when some
heuristic decides that a cross-process reference is likely to be garbage.
However I've never read a paper that I both (i) understand and (ii) believe.

I think Schelvis' algorithm is in this category, but, for the life of me, I
don't understand it.  Can someone explain it to me?

    @InProceedings{gc:rep:974,
      author = 	 "Marcel Schelvis",
      title = 	 "Incremental Distribution of Timestamp Packets: a New
                      Approach to Distributed Garbage Collection",
      editor = 	 "Norman Meyrowitz",
      number = 	 "24(10)",
      series = 	 "SIGPLAN Notices",
      pages = 	 "37--48",
      booktitle = "OOPSLA'89 Conf.\ Proc.",
      year = 	 1989,
      organization = "ACM Sigplan",
      publisher = "ACM",
      address = 	 "New Orleans, LA (USA)",
      month = 	 oct
    }

 || It is fairly common for objects in distributed systems to have cyclic
 || connections, because objects that communicate with each other remotely are 
 || likely to have pointers to each other, and often this communication is
 || bidirectional.  For example, in client/server-style computing, an object
 || representing a server may have pointers to its clients, and the clients
 || of course have pointers to the server.  More generally, directory
 || structures used to manage resources are likely to have pointers
 || to things on lots of nodes.  So it's nice to use reference counting, but
 || you may still need a distributed tracing GC to get back the "hard stuff."

One could argue that these are special cases.  They are easy to recognize and
programmers can break the cycles explicitly.  (Yeah, I know, GC is supposed to
be automatic; unfortunately no 100% solution can exist, so I'm trying to find
acceptable 90% solutions).

 || For these reasons, conventional distributed GC's may have serious scalability
 || problems

Absolutely.  That is the number one problem.

Add another interesting fact: most data resides on disk and not in memory.  If
you want to trace the world, you must pay a high I/O price.

 || A different approach is to use an incremental local tracing GC, which
 || is always going in the background, and to propagate information for the
 || distributed trace asynchronously from local GC cycles.

That is the only way to go; but it does not solve the fundamental problem:
that a distributed trace will never terminate in a large-scale system.  That
problem HAS NO SOLUTION.

 || Another way to do it is to send
 || the data to a central service to be traced.

Easier said than done.  Ladin and Liskov tried this path, and got it wrong!
They did publish a fix later, which falls back onto Hughes' algorithm.
Liskov's group has now abandoned this route because it is too complicated and
does not scale.  Now they do migration.

 || A related alternative is to use a data-shipping approach to GC, like
 || a distributed virtual memory, rather than a function-shipping approach.

That is what we done in our Larchant DSM.  But this still does not (CAN not)
solve the problem of a global trace in a large-scale system.  

We instead approximate the global trace by a series of local traces.  This
does not guarantee that all cyclic garbage is reclaimed.  The scope of a local
trace is guided by a heuristic that minimizes cost.  We are crossing our
fingers and hoping that our heuristic is a good 90% solution.  But we will not
have any hard figures for a while.

 || The problems outlined above make me think that distributed GC is
 || far from a solved problem;

It is not solved because it is not solvable!

 || for the time being, I don't think you
 || can "garbage collect the world", just small or maybe medium-sized
 || networks, preferably without large volumes of data, and preferably
 || fairly reliable networks of fairly reliable machines.  (These
 || problems are exacerbated by having to handle node failures.)

Unfortunately the real world is asking for solutions, now.

                                                Marc