[gclist] Garbage collection and XML

Jerrold Leichter jerrold.leichter@smarts.com
Fri, 2 Mar 2001 10:36:35 -0500 (EST)

| > The program puts all of these things in a hash table, so all strings are
| > unique.  When I wrote this I was concerned with the effectiveness of the
| > hash table, and whether it would be a good idea to do my own mallocking
| > out of blocks instead of allocating each string separately. (Yes it was.)
|     This is a strong argument in favor of hashtables indeed -- but how do
| you determine the dimension of the hashtable?  My guess here would be that
| you chose the size based on the XML file size.  I'd think that a static
| hashtable with a predetermined size would not work, because file sizes
| can really vary....

Dynamically adjustable hash tables have been around for 25 years.  They never
seem to have made it into the standard textbooks, and most programmers are
unaware of them.  To this day, textbooks continue to list fixed size as a
minus for hash tables.  (The contents of textbooks gets frozen in place:  The
textbooks define the course requirements, which in turn define the constraints
on newer textbooks for the same courses.  Compare a data structures textbook
published today to one published in 1980 and what you'll find is that the new
one uses object-oriented concepts and C++ or Java, while the older one uses
Pascal or C - but the actual algorithms presented are pretty much the same,
and generally all those algorithms were familiar by the mid-70's.  Tying this
back to GC:  You can see the same phenomenon in compiler texts, or any other
texts that might be expected to discuss garbage collection techniques:  They
say little - and much of what they *do* say is long obsolete - because they've
*always* said little.)

I've implemented and used with very good results the "linear hashing" algo-
rithm, described by Witold Litwin in Proc. 6th International Conf. on Very
Large Databases, 1980.  A more easily accessible reference is Per-\AA ke Larson
in CACM 31 (1988), pg 446--457.  Linear hashing gives you expected constant
time access using a table whose size is (expected) bounded as a constant
multiple of the number of elements in the table.  It grows and shrinks
dynamically in small increments - the (expected) cost per growth/shrinkage
again a constant.  All these constants are linearly related to a user-settable
parameter which is basically the target load factor for the table.

I can't distribute my code because it's part of a product.  However, to give
you an idea of what's involved, the basic implemention is about 500 lines of
extensively commented C++ code.  This is for a templated class with your
usual insert, lookup, and delete operations, plus an embedded class that
implements a "cursor" abstraction for building iterators, and even a function
to compute and print in a nice format a number of statistics about instances
of the class.  Since it doesn't come up in my application, I never implemented
the code to shrink a hash table.  (Then again, the function to grow the table
is all of about 35 commented lines; the code to shrink would be about the
same.)  The individual hash buckets use a class somewhat like an STL deque,
but simpler and much more space-efficient.  The externally-visible classes
that use this as a base are container classes analogous to, but rather
different from, the STL.
							-- Jerry