[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Wed, 7 Mar 2001 11:46:03 +1300 (NZDT)


I wrote:
	>If you follow the letter of the DOM specification (the CORBA IDL, not the
	>Java and Javascript bindings) that is not *allowed*.
	
From: David Chase <chase@world.std.com> asked
	Pardon my potential ignorance here, but who would care if there
	were sharing, especially if:
	
	1 - the binding was done to a GC'd language, where last-owner-of-a
	    string is less of an issue.
	
The two bindings in the DOM specifications are to Java and Javascript,
where strings are immutable.  It's really difficult to figure out *what*
the DOM specifies, because
 - the primary specification is in CORBA IDL, in which every time you
   ask for a string the remote system sends you back a new copy
 - the object chosen to represent strings in the CORBA IDL for the DOM
   is a *mutable* array of 16-bit characters
 - the object chosen to represent strings in the Java and Javascript
   bindings is an *immutable* String of 16-bit characters.

I don't know about Javascript, but in Java it is perfectly possible to
have two String objects with the same (immutable!) state which must act
the same for all future time, but have distinct identities.  A Java
program which tried to keep track of which nodes strings came from by
using String identities as keys could be confused if strings were shared.

I came to hate the DOM when I tried to implement it in Smalltalk.  Since
Smalltalk strings are *mutable*, it was important to know whether I was
allowed to return the string object already inside a text node, or whether
I had to copy it.  I couldn't figure out *what* to do, and amongst other
things discovered the contradiction above, that according to the IDL you
get a new mutable array whenever you ask about a string in the model, but
according to the Java and Javascript bindings you get an immutable object.

An explicit statement about sharing in the DOM specification would help a
LOT, as would explicit advice about what to do in languages like Eiffel,
Lisp, and Smalltalk, where strings are mutable.

However, it *is* absolutely clear that no sharing of non-string objects is
allowed at all.  The figures I have show that you save a useful amount of
space by sharing attribute=value bindings.

	2 - the resulting implementation were much smaller/faster.
	
It is clear that there is a substantial space saving from sharing strings.
It's not just "structural" strings like element names and attribute names
either.  There are a lot of repetitions of "content" strings like attribute
values and #PCDATA nodes.

Since the DOM absolutely requires the use of UTF-16-encoded strings
(UTF-8 is *NOT ALLOWED*, still less anything more compact than that),
there is still at least a factor of two compared with what you can get
in C or Smalltalk.  Why does the DOM require UTF-16?  Because it's *really*
an attempt to pretty up something the browser vendors bodged together to be
accessed from Javascript, which has UTF-16-encoded strings these days.

	Is this possibly just some sort of pin-headed overspecification
	that may safely be ignored, or do people actually write programs
	(in particular, Java programs) that depend on the lack of sharing?
	
As noted above, I have to admit that the specification is actually
inconsistent on this point.  However, I also note that there is nothing
in the Javascript reference material I recently downloaded from Netscape
or the ECMA 262 standard for ECMAscript that would make sharing
particularly easy to implement in Javascript.  I thought there _was_ a
UniqueString class in Java, but when I looked for it, I couldn't find one.
Perhaps someone can correct me about that.  It is far easier to implement
the DOM *without* string sharing in Java and Javascript.

And as noted before, any other kind of sharing is *explicitly* forbidden,
and could not be provided without comprehensively wrecking the entire design.