[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Mon, 5 Mar 2001 12:05:12 +1300 (NZDT)


I wrote (concerning strings in XML):
    > The program puts all of these things in a hash table,
    > so all strings are unique.

"Ji-Yong D. Chung" <virtualcyber@erols.com> asked:
    This is a strong argument in favor of hashtables indeed -- but
    how do you determine the dimension of the hashtable?

Why should I do any such thing?  A hash table is as big as it is.
I wouldn't *dream* of "determining the dimension of" a hash table.
Dynamic hashing is by now a fairly old technique; the code I use
is based on Per-Ake Larson's April 1988 CACM paper.  (Actually, one
ejp@ausmelb.oz wrote the code, and I rewrote it.)

Perl and TCL both have dynamically resizing hash table code that you
can rip out and use in other programs.  I haven't tried the TCL code,
but the Perl code is about as fast as the Larson/ejp code.

    One last detail -- from looking at your previous email messages, I'd
    guess that you used reference counting, right?  Also, from what Ken
    Anderson said, I'd guess that one could obtain similar results
    (provided the code is of the similar quality) from a parser that
    uses garbage collector.

In fact my code was written for a group of applications that load a
document into memory, walk over it a couple of times in order to
extract and format information, and then exit.  My "garbage collector"
was thus the C exit() function.

When you have stuff in a hash table like this, you have to be careful to
ensure that the hash table structure itself does not deceive the garbage
collector into believing strings/nodes are still live.