[gclist] Garbage collection and XML

David Bakin davidbak@microsoft.com
Tue, 6 Mar 2001 15:56:17 -0800

Since the DOM presumably defines an interface (esp. as you're talking
about a CORBA IDL) I don't see what the requirement for what strings
(mutable/immutable/16-bit/8-bit/whatever) look like on the outside have
to do with how the implementation stores nodes.  What's to keep an
implementation from doing whatever sharing and compression it wishes and
just satisfying the semantics whenever a caller traverses the DOM and
executes getters or setters for string valued attributes?

-- Dave

-----Original Message-----
From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz]
Sent: Tuesday, March 06, 2001 2:46 PM
To: chase@world.std.com; gclist@iecc.com
Cc: icis-developers@bbn.com
Subject: Re: [gclist] Garbage collection and XML

I wrote:
	>If you follow the letter of the DOM specification (the CORBA
IDL, not the
	>Java and Javascript bindings) that is not *allowed*.
From: David Chase <chase@world.std.com> asked
	Pardon my potential ignorance here, but who would care if there
	were sharing, especially if:
	1 - the binding was done to a GC'd language, where
	    string is less of an issue.
The two bindings in the DOM specifications are to Java and Javascript,
where strings are immutable.  It's really difficult to figure out *what*
the DOM specifies, because
 - the primary specification is in CORBA IDL, in which every time you
   ask for a string the remote system sends you back a new copy
 - the object chosen to represent strings in the CORBA IDL for the DOM
   is a *mutable* array of 16-bit characters
 - the object chosen to represent strings in the Java and Javascript
   bindings is an *immutable* String of 16-bit characters.

I don't know about Javascript, but in Java it is perfectly possible to
have two String objects with the same (immutable!) state which must act
the same for all future time, but have distinct identities.  A Java
program which tried to keep track of which nodes strings came from by
using String identities as keys could be confused if strings were

I came to hate the DOM when I tried to implement it in Smalltalk.  Since
Smalltalk strings are *mutable*, it was important to know whether I was
allowed to return the string object already inside a text node, or
I had to copy it.  I couldn't figure out *what* to do, and amongst other
things discovered the contradiction above, that according to the IDL you
get a new mutable array whenever you ask about a string in the model,
according to the Java and Javascript bindings you get an immutable

An explicit statement about sharing in the DOM specification would help
LOT, as would explicit advice about what to do in languages like Eiffel,
Lisp, and Smalltalk, where strings are mutable.

However, it *is* absolutely clear that no sharing of non-string objects
allowed at all.  The figures I have show that you save a useful amount
space by sharing attribute=3Dvalue bindings.

	2 - the resulting implementation were much smaller/faster.
It is clear that there is a substantial space saving from sharing
It's not just "structural" strings like element names and attribute
either.  There are a lot of repetitions of "content" strings like
values and #PCDATA nodes.

Since the DOM absolutely requires the use of UTF-16-encoded strings
(UTF-8 is *NOT ALLOWED*, still less anything more compact than that),
there is still at least a factor of two compared with what you can get
in C or Smalltalk.  Why does the DOM require UTF-16?  Because it's
an attempt to pretty up something the browser vendors bodged together to
accessed from Javascript, which has UTF-16-encoded strings these days.

	Is this possibly just some sort of pin-headed overspecification
	that may safely be ignored, or do people actually write programs
	(in particular, Java programs) that depend on the lack of
As noted above, I have to admit that the specification is actually
inconsistent on this point.  However, I also note that there is nothing
in the Javascript reference material I recently downloaded from Netscape
or the ECMA 262 standard for ECMAscript that would make sharing
particularly easy to implement in Javascript.  I thought there _was_ a
UniqueString class in Java, but when I looked for it, I couldn't find
Perhaps someone can correct me about that.  It is far easier to
the DOM *without* string sharing in Java and Javascript.

And as noted before, any other kind of sharing is *explicitly*
and could not be provided without comprehensively wrecking the entire