[gclist] FAQ: distributed objects

Marc Shapiro shapiro@prof.inria.fr
Fri, 8 Mar 1996 16:40:06 +0100
Previous message: [gclist] write barrier
Next message: [gclist] FAQ: distributed objects
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I started working on the "distributed objects" part of the FAQ.  It's pretty
long but it's hard to explain a difficult problem in few words.  I can try to
simplify it in a second pass.

                                        Marc

DISTRIBUTED OBJECTS 

A reference (e.g. ports, so>ckets, URLs, etc.) is the distributed equivalent of
a pointer. Few existing reference mechanisms support GC. If you consider
objects that are widely shared, in a large distributed system, manual
collection is a nightmare, and automatic garbage collection is essential.

Lots of people think it's a very hard problem. In fact, complete GC is
unfeasible, but simple solutions can collect most garbage.

The problem is that the property "object X is garbage" is a global property,
i.e. conceptually you need to examine the whole system (in a consistent state)
just to know the status of X. Examining the whole system is expensive, and
actually unfeasible in the presence of failures; getting a consistent state is
hard. However it is always safe not to collect; so any conservative
approximation is OK.

What is different about distributed systems?

    To know the status of a remote site you must send and receive
    messages; messages are costly.  Distributed systems are inherently
    asynchronous. A message only tells you about the past of the sender.
    There is no single global order of events. 

    Failures do occur. Machines crash, processes get into infinite
    loops. Messages get lost and delivered out-of-order (even with TCP/IP,
    in the presence of crashes). In the Internet remote sites often appear
    unreachable. The killer is: you can't reliably detect failures, so
    some questions about remote state are undecidable (in distributed
    systems jargon: "consensus is impossible in an asynchronous system
    with failures").

    As a consequence of the above: data tends to be inconsistent. If the
    data you are copying is the GC status, then you may be deciding to
    reclaim an object based on incorrect information.

You will find general algorithms in the distributed systems literature that
appear to solve the above problems, called distributed snapshots, distributed
transactions, etc. They will work but they are very costly and they don't
scale.  Specialized solutions to the specific problem of GC can be much
simpler and easier to implement.

Distributed reference counting and reference listing 

The easiest GC algorithm to distribute is reference counting. A simple-minded
approach is to attach a counter to each object; duplicating or killing a
reference sends an increment or decrement message to the target object. This
is very expensive (one message per pointer operation) but more importantly
doesn't work. Since there is no guaranteed delivery order for messages, an
increment might be overrun by a decrement, and the target object unsafely
reclaimed.

A general fix is to provide "Causal delivery" but that's overkill. A simpler
fix is Weighted Reference Count (or some variation thereof). When an object X
is created, it is allocated an initial weight. The first reference to X
carries a weight equal to the object's. Duplicating any reference to X divides
the weight between the two duplicates. Killing a reference sends a decrement
message to the object's weight. At all times, the sum of the reference weights
is equal to the object's weight.

In "reference listing" the target object keeps a "scion set", i.e. a list of
backpointers to the processes that hold pointers to it. This is made possible
by the following observation: if process P owns object X, process Q may
acquire a reference to X only if P sends a message to Q containing the
reference. At message sending time, record the name Q in the reference list of
X. When Q kills its (last) reference to X, it sends a decrement message which
removes Q from X's scion list. When the scion list is empty, X can be
reclaimed.

Reference listing is less sensitive to failures than counting. If process Q
crashes, P can unilaterally remove its scion from all its reference lists (but
beware: it is impossible to reliably detect crashes!). If a decrement message
is lost, P can query Q. Reference listing is heavily used in Partitioned GC.

Reference counting and listing scale well. They cannot collect cycles of
garbage. Another drawback in practice is that they do not provide "root
identity", i.e. if there are multiple roots (e.g. process roots and persistent
roots), there is no way to know from which a specific object is reachable.

Distributed Mark and Sweep

For concreteness, I will concentrate on Mark and Sweep; but all tracing
methods (whether mark and sweep, copying, or other) have a synchronization
point and hence will not scale well. On the other hand, tracing does collect
cycles and supports root identity. It is also fault tolerant (in a perverse
way): if anything goes wrong, you can abort a trace and start again from
scratch.

The simple-minded approach, where each process marks and sweeps in parallel,
sending "mark" messages down remote references, doesn't work. Suppose P has a
reference to object X in process Q, and Q has no local reference to X. Then Q
shouldn't start sweeping until it is certain that it will receive no more mark
messages from P. Of course, P must do the same with respect to Q. Thus, every
process must wait for all others between the mark and the sweep phases (this
is called a "termination protocol").

The Hughes algorithm overcomes this problem somewhat by using timestamps to
mark reachable objects (rather than a single mark bit). The idea is that the
timestamp of a reachable object will always increase; after an object becomes
unreachable its timestamp ceases to increase; objects whose timestamp is below
some global minimum are garbage. The catch is that computing the global
minimum is a termination protocol. The good thing is that it can be done in
the background.

Ladin and Liskov take an alternative approach. The idea here is that each
process sends (an approximation of) its local graph to some central
service. The central service combines the partial graphs into a global graph,
which it marks and sweeps. There are a few problems with this approach. The
first is that the central service is a bottleneck and a single point of
failure (Ladin and Liskov address this problem by replication). The second is
that the partial graphs are mutually inconsistent (Ladin and Liskov address
this by timestamping and being conservative). The third is that getting the
partial graph information is harder than it seems (Ladin and Liskov got it
wrong the first time around).

Partitioned or Hybrid GC

Partitioned GC combines the advantages of Distributed Reference Counting or
Listing and of Distributed Mark and Sweep. The global memory is partitioned;
GC is hybrid: tracing within a subset, counting (or listing) across subsets.

The idea is that when some subset of the global memory (e.g. a process) sends
a pointer to another subset, that pointer is recorded in a "scion set" (akin
to a remembered set). You keep a count of such inter-subset references.  Thus,
each subset can independently and locally perform tracing of any
kind. Inter-subset references are reference-counted.

The difference between a remembered set and a scion set is that in the former
case the memory subsets are ordered, so no cycles of garbage can form. In the
latter case there is no natural ordering and inter-subset garbage cycles
cannot be collected.

Variations on this theme have to do with doing the whole job more efficiently
and tolerating faults ("SSP Chains": Shapiro, Dickman and Plainfossé),
collecting cycles of garbage that span subsets (Hughes, or "Garbage Collecting
the World": Lang Queinnec and Piquer), avoiding expensive memory
synchronization and I/O operations ("Larchant": Ferreira and Shapiro), using
the counting information to choose the best subset to trace (Cook Wolf and
Zorn, or "Sprite LFS": Rosenblum and Ousterhout), or avoiding problems with
transaction aborts (Amsaleg Gruber and Franklin).
Previous message: [gclist] write barrier
Next message: [gclist] FAQ: distributed objects
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]