[gclist] weak pointer and gc

Richard A. O'Keefe ok@atlas.otago.ac.nz
Wed, 18 Apr 2001 16:17:36 +1200 (NZST)


"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
	    I have heard that, to implement a symbol
	table properly for [Sc]heme interpreters, 
	one has to use weak pointers.
	That way, it can be garbage collected.
	
	    Is this important for garbage collection
	optimization?
	
Should the symbol table be a weak table?

It all depends.  Quintus Prolog managed for years with the answer NO.
That actually worked, because in QP there wasn't *any* information at
all associated with an atom other than its name.  It even helped,
making passing atoms between C and Prolog a lot easier.

If "symbol" objects can contain pointers to other objects (as they can
in Lisp, with top level values and property lists and function cells
and so on), then hanging onto a "dead" symbol can cost you a lot of space.

If "symbol" objects are just unique immutable strings, then you might not
need to worry.  In pure RnRS/IEEE Scheme, that's what symbols are.  In
some Scheme implementations, there is additional information associated
with symbols.

You need to consider
 - other information retained because a symbol is retained
 - the space required for the symbols and their table
 - other effects of retaining symbols, such as increasing time to
   search the symbol table.

If you decide that symbols should be collectable, which SWI Prolog does,
you do NOT need a full weak-table implementation.  With a classic stop-
the-world collector, it is enough to ensure that the collector does not
treat the symbol table as one of the "roots" and that it repairs the
table after a collection.

	    I know that most local symbols (unless they are
	quoted) in scheme-expressions are converted, within
	a Scheme interpreter, into objects that represent
	lexical addresses.  Thus, unless the symbols are
	used  globally, the symbols are unlikely to be needed
	when the interpreter is running.
	
Symbols are used for many other purposes.  In Prolog, the deciding factor
for several implementors was wanting to access external databases, and
wanting to represent some of the information coming back as symbols, so
as to have fast (in)equality tests.  In a single run, several millions
of different symbols might come into existence, be used, and then be
flushed from memory.

SWI Prolog represents character data in SGML or XML documents as symbols.
	
	    It seems that such a weak-pointer symbol table
	is expensive to maintain.

NOT doing it can also be expensive.  Quintus Prolog had an architectural
limit of 2 million symbols in the symbol table.  That turned out not to
matter, because on the 4MB machines of the day, once the symbol table
grew beyond about 50000 you got thrashing.  I had to completely rewrite
the symbol table to scale well.

	At the garbage
	collection time, the collector probably has zero
	out the weak pointer references.  Later when
	the table is accessed, via an accessor method,
	the table entries and overflow
	chains probably need to be cleaned up.

What makes you think this is an expensive problem?
If the symbol table is small enough that you can afford *not* to
collect it, repairing the symbol table is cheap enough not to worry about.
If the symbol table is large enough that you *have* to collect it, there
is little you can do that would be worse than not collecting it.

	P.S.  Some time ago, there was a discussion
	thread on using hashtables for XML.  If
	XML file was read into a GC collected
	environment.  Would it be advisable to use
	weak pointered hashtable here (to
	store strings)?
	
Note that the keys in a hash table should not be changed in a way that
alters their hash code.  In practice, this means no tampering with strings.
If you have a hash table ensuring the uniqueness of a collection of
immutable strings, you have something pretty much identical to Scheme
symbols.  Over a fairly wide range of XML files, my measurements show
that this really pays off VERY well, even for "content" strings.  Large
strings tend to be unique anyway, but small strings tend to be used a lot,
often enough for uniqueness to pay.

Once again, you don't need a general purpose weak pointer mechanism here.
If you have one, you can use it, but you can also build special handling
for "symbol tables" into the garbage collector.