[gclist] weak pointer and gc
Richard A. O'Keefe
ok@atlas.otago.ac.nz
Wed, 18 Apr 2001 16:17:36 +1200 (NZST)
"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
I have heard that, to implement a symbol
table properly for [Sc]heme interpreters,
one has to use weak pointers.
That way, it can be garbage collected.
Is this important for garbage collection
optimization?
Should the symbol table be a weak table?
It all depends. Quintus Prolog managed for years with the answer NO.
That actually worked, because in QP there wasn't *any* information at
all associated with an atom other than its name. It even helped,
making passing atoms between C and Prolog a lot easier.
If "symbol" objects can contain pointers to other objects (as they can
in Lisp, with top level values and property lists and function cells
and so on), then hanging onto a "dead" symbol can cost you a lot of space.
If "symbol" objects are just unique immutable strings, then you might not
need to worry. In pure RnRS/IEEE Scheme, that's what symbols are. In
some Scheme implementations, there is additional information associated
with symbols.
You need to consider
- other information retained because a symbol is retained
- the space required for the symbols and their table
- other effects of retaining symbols, such as increasing time to
search the symbol table.
If you decide that symbols should be collectable, which SWI Prolog does,
you do NOT need a full weak-table implementation. With a classic stop-
the-world collector, it is enough to ensure that the collector does not
treat the symbol table as one of the "roots" and that it repairs the
table after a collection.
I know that most local symbols (unless they are
quoted) in scheme-expressions are converted, within
a Scheme interpreter, into objects that represent
lexical addresses. Thus, unless the symbols are
used globally, the symbols are unlikely to be needed
when the interpreter is running.
Symbols are used for many other purposes. In Prolog, the deciding factor
for several implementors was wanting to access external databases, and
wanting to represent some of the information coming back as symbols, so
as to have fast (in)equality tests. In a single run, several millions
of different symbols might come into existence, be used, and then be
flushed from memory.
SWI Prolog represents character data in SGML or XML documents as symbols.
It seems that such a weak-pointer symbol table
is expensive to maintain.
NOT doing it can also be expensive. Quintus Prolog had an architectural
limit of 2 million symbols in the symbol table. That turned out not to
matter, because on the 4MB machines of the day, once the symbol table
grew beyond about 50000 you got thrashing. I had to completely rewrite
the symbol table to scale well.
At the garbage
collection time, the collector probably has zero
out the weak pointer references. Later when
the table is accessed, via an accessor method,
the table entries and overflow
chains probably need to be cleaned up.
What makes you think this is an expensive problem?
If the symbol table is small enough that you can afford *not* to
collect it, repairing the symbol table is cheap enough not to worry about.
If the symbol table is large enough that you *have* to collect it, there
is little you can do that would be worse than not collecting it.
P.S. Some time ago, there was a discussion
thread on using hashtables for XML. If
XML file was read into a GC collected
environment. Would it be advisable to use
weak pointered hashtable here (to
store strings)?
Note that the keys in a hash table should not be changed in a way that
alters their hash code. In practice, this means no tampering with strings.
If you have a hash table ensuring the uniqueness of a collection of
immutable strings, you have something pretty much identical to Scheme
symbols. Over a fairly wide range of XML files, my measurements show
that this really pays off VERY well, even for "content" strings. Large
strings tend to be unique anyway, but small strings tend to be used a lot,
often enough for uniqueness to pay.
Once again, you don't need a general purpose weak pointer mechanism here.
If you have one, you can use it, but you can also build special handling
for "symbol tables" into the garbage collector.