[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Wed, 7 Mar 2001 15:32:05 +1300 (NZDT)

	From: "David Bakin" <davidbak@microsoft.com>

	Since the DOM presumably defines an interface (esp. as you're talking
	about a CORBA IDL) I don't see what the requirement for what strings
	(mutable/immutable/16-bit/8-bit/whatever) look like on the outside have
	to do with how the implementation stores nodes.

*Strings* and *Nodes* are indeed different things, with different sharing
payoffs and possibilities.

But it is precisely the job of an interface to state the properties of
the objects it is an interface *to*.

The CORBA IDL in the DOM specifications (which I keep on-line and check
when I make claims about the DOM) state that when you ask a document
model for a string, what you get back is a sequence of characters.
With respect to encoding, the DOM explicitly says
    "APPLICATIONS must encode DOMString using UTF-16."

Never mind the DOM:  for all alphabetic and some syllabic scripts you can
do a *lot* better space-wise than using UTF-16, which is what you get in
Java or Javascript.

Thing is, if you *don't* implement things pretty close to the "natural"
image of the DOM, you are going to do unbelievable amounts of copying.

I discussed sharing at four levels:
    - "structural" strings			HUGE payoff for sharing
    - "content" strings				SERIOUS payoff for sharing
    - **XML** attribute=value nodes		SERIOUS payoff for sharing
    - element nodes				MODERATE payoff for sharing

If anyone will take the trouble to read the DOM, they will discover that
sharing the last two of those are unambiguously ruled out, in the sense
that if you have
    <foo><bar>Ho</bar> <bar>Ho</bar> <bar>Ho</bar></foo>
then you *must* have three distinct <bar>Ho</bar> Element nodes with
three distinct identities and distinct relationships to other nodes.

If a DOM implementation implements the DocumentType at all well, it is
likely that it will share "structural" strings.  

A DOM implementation cannot share attribute=value or Element nodes
without some very fancy and rather expensive behind-the-scenes calculation,
which would more than offset the space savings.

The one open area is sharing "content" strings (values of attributes and
PCData nodes).  The DOM actually envisages that these will be too big for
your programming language so that you have to get the information out in
pieces (CharacterData::substringData()).  A DOM implementor who hasn't
done the measurements might think (mistakenly) that it didn't pay to share
content strings.  Does anyone actually *know* what actual DOM implementations

The reported high space cost of DOM use (I have heard 8 to 10 times as
many bytes in core as XML on disc) is most simply explained on the hypothesis
that popular DOM implementations *don't* share content strings.

	What's to keep an implementation from doing whatever sharing and
	compression it wishes and just satisfying the semantics whenever
	a caller traverses the DOM and executes getters or setters for string
	valued attributes?
A four-letter word:  T I M E.

Amongst other things, the Document Object Model is an *Object* model and
is totally oriented to manipulating documents by mutating data structures.
When you select a collection of nodes, a NodeList is stated to be "live";
mutations to the original document show up in the list.  Until you've
actually read the DOM2, you wouldn't believe how complex things can get
when someone makes a serious attempt to define iteration over mutating
data structures.

This is the garbage collection list, not the DOM list, so I'd like to
close with a number of general storage management observations.

1.  Garbage collection or no garbage collection,
    garbage *avoidance* in the design is a good thing if you can do
    it without compromising other goals.

2.  Hash consing is astonishingly effective,
    but only with data you aren't mutating a lot, and which is not
    required to have its own unique identity.

3.  There is no substitute for measurements on real data.

4.  Seemingly unimportant design decisions can have huge effects on
    memory requirements.

5.  There is a tradeoff between memory and time, but it doesn't pay
    to assume you could afford both ends of the tradeoff.

6.  Sturgeon's Law applies to standards, even W3C recommendations.