[gclist] object identity & copying collection

Mark Tillotson markt@harlequin.co.uk
Wed, 10 Dec 1997 13:45:59 GMT

Fergus Henderson <fjh@cs.mu.oz.au> wrote:
| Hi,
| In Appel's "Modern compiler implementation in ML", exercise 13.4 says
| "... Explain the problem in implementing hashCode() ... in a Java system
| with copying garbage collection, and propose a solution."
| Explaining the problem is easy enough (you can't just use the
| object address, because it will change), and proposing a solution
| is also easy, but proposing a *good* solution is a bit harder.
| What is the "textbook" solution for this problem?
Java gets the protocol "wrong" by separating out hashCode() from the act
of hash-probing without providing any means for determining if the
hash value is dependent on the addresses of objects.  Various other
languages avoid this problem by either keeping the hash function
hidden in the implementation of hash tables (Common Lisp and others),
or by providing an enhanced protocol where address changes are spotted
and the entry is rehashed (Dylan).

| Some less-than-ideal solutions:
| 	- hash values can be computed lazily in hashCode()
| 	  using a random number generator; they can then
| 	  be stored by either
| 	  - having a hash value field in each object (ugh!)
Memory is so cheap these days, that this is to my mind the clear
winner - it is fast and it is easy to get right.  The size of the
average object is often found to be around 10 words in large real
systems using heap storage, so one more word is not a heavy penalty.

A PRNG is probably not needed, allocating hash values consecutively is
usually good enough (the case where this will fail is when every Nth
object allocated is stuck in a table, and  gcd(N,P)!=1  where P is the
(prime) size of the hash table, and this case can be avoided by
collision-detection and rehashing on the next larger prime.

The laziness also provides caching for the more complicated hashCode()
methods, obviating the need to hand-optimize this to speed up hash
table access.

| 	  - keeping a global table (binary tree) mapping
| 	    from objects to hash values.  This requires
| 	    that the collector preserve relative ordering
| 	    (otherwise you'd have to use a linked list, double-ugh!).
| 	    You also want the table keys to be weak pointers.
Why use a binary tree when you can use a hash-table?  Such an
implementation-private hash table _can_ use object addresses, because it
is free to implement its own hash invalidation protocol.  This
places few constraints on the GC, except that the GC must provide
information sufficient to determine when an object address may have
changed since last placed in a particular table.

Such a mapping from addresses could be useful for other purposes, such
as persistent storage, and for object monitors, note.

This approach, while more complex and slower, does have better
space-requirements, in that if you don't use hash tables you don't pay
the price in space usage.  However this has to be balanced against the
time penalty for hash-table access, which is at least twice as slow
(because there are two hash table accesses rather than one).  In a
system where most objects were really small, and rarely hashed, this
approach has definite advantages, however this is rarely the case in
real systems these days, I believe.

[ markt@harlequin.co.uk | http://www.harlequin.co.uk/ | +44(0)1954 785433 ]