[gclist] Garbage collection and XML

Boehm, Hans hans_boehm@hp.com
Fri, 2 Mar 2001 09:23:15 -0800


Dynamically resizable hash tables exist in other places, too.

They're discussed in "The Design and Analaysis of Computer Algorithms", Aho,
Hopcroft, and Ullman, 1974.

They're used in SGI's STL implementation. (See
http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is
unfortunately uglified to be namespace-correct.)

They're also used to keep track of things like finalizable objects in our
collector. (See
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/boehm-gc/finalize.c?rev=1.5&conten
t-type=text/x-cvsweb-markup for the gory details.)

The above implementations use chained hash buckets, but I think that's
pretty much an orthogonal issue, unless you want clever implementations to
bring down the constant in the running time of the resize operation.

Hans

> -----Original Message-----
> From: Jerrold Leichter [mailto:jerrold.leichter@smarts.com]
> Sent: Friday, March 02, 2001 7:37 AM
> To: Ji-Yong D. Chung
> Cc: Richard A. O'Keefe; gclist@iecc.com
> Subject: Re: [gclist] Garbage collection and XML
> 
> 
> | > The program puts all of these things in a hash table, so 
> all strings are
> | > unique.  When I wrote this I was concerned with the 
> effectiveness of the
> | > hash table, and whether it would be a good idea to do my 
> own mallocking
> | > out of blocks instead of allocating each string 
> separately. (Yes it was.)
> | 
> |     This is a strong argument in favor of hashtables indeed 
> -- but how do
> | you determine the dimension of the hashtable?  My guess 
> here would be that
> | you chose the size based on the XML file size.  I'd think 
> that a static
> | hashtable with a predetermined size would not work, because 
> file sizes
> | can really vary....
> 
> Dynamically adjustable hash tables have been around for 25 
> years.  They never
> seem to have made it into the standard textbooks, and most 
> programmers are
> unaware of them.  To this day, textbooks continue to list 
> fixed size as a
> minus for hash tables.  (The contents of textbooks gets 
> frozen in place:  The
> textbooks define the course requirements, which in turn 
> define the constraints
> on newer textbooks for the same courses.  Compare a data 
> structures textbook
> published today to one published in 1980 and what you'll find 
> is that the new
> one uses object-oriented concepts and C++ or Java, while the 
> older one uses
> Pascal or C - but the actual algorithms presented are pretty 
> much the same,
> and generally all those algorithms were familiar by the 
> mid-70's.  Tying this
> back to GC:  You can see the same phenomenon in compiler 
> texts, or any other
> texts that might be expected to discuss garbage collection 
> techniques:  They
> say little - and much of what they *do* say is long obsolete 
> - because they've
> *always* said little.)
> 
> I've implemented and used with very good results the "linear 
> hashing" algo-
> rithm, described by Witold Litwin in Proc. 6th International 
> Conf. on Very
> Large Databases, 1980.  A more easily accessible reference is 
> Per-\AA ke Larson
> in CACM 31 (1988), pg 446--457.  Linear hashing gives you 
> expected constant
> time access using a table whose size is (expected) bounded as 
> a constant
> multiple of the number of elements in the table.  It grows and shrinks
> dynamically in small increments - the (expected) cost per 
> growth/shrinkage
> again a constant.  All these constants are linearly related 
> to a user-settable
> parameter which is basically the target load factor for the table.
> 
> I can't distribute my code because it's part of a product.  
> However, to give
> you an idea of what's involved, the basic implemention is 
> about 500 lines of
> extensively commented C++ code.  This is for a templated 
> class with your
> usual insert, lookup, and delete operations, plus an embedded 
> class that
> implements a "cursor" abstraction for building iterators, and 
> even a function
> to compute and print in a nice format a number of statistics 
> about instances
> of the class.  Since it doesn't come up in my application, I 
> never implemented
> the code to shrink a hash table.  (Then again, the function 
> to grow the table
> is all of about 35 commented lines; the code to shrink would 
> be about the
> same.)  The individual hash buckets use a class somewhat like 
> an STL deque,
> but simpler and much more space-efficient.  The 
> externally-visible classes
> that use this as a base are container classes analogous to, but rather
> different from, the STL.
> 							-- Jerry
>