[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Thu, 8 Mar 2001 11:54:47 +1300 (NZDT)


Bill Foote <Bill.Foote@eng.sun.com> wrote:
	So I take back what I said.  With enough work, once actually could word a normative,
	testable requirement of no String object sharing in a Java API.  It's hard to do
	and doesn't makes sense, but it can be done.
	
	Anyway, this is moot as it sounds like DOM didn't go that far.
	
The DOM goes rather further than you might suppose.
Recall that I divided strings into two kinds:

 - structural strings (element names and attribute names)
 - content strings (#PCDATA and attribute values).

If you know a bit of SGML or HTML you might think an attribute value
is just a string, e.g.
	<p class="acknowledgement">
But SGML allows a string to contain macro references, e.g.
	<p class="&ack; for support">
No worries, the only macros allowed in HTML are the predefined character
names, and in SGML the macros are supposed to expand out and just be
strings (roughly speaking).

XML is different.  XML allows a document to be processed "half way",
where macros are not expanded.  So
	<p class="&ack; for support">
would, in the DOM, be something like
	Element (name = "p", attributes =
	    Attr (name = "class", first child =
	        EntityReference (name = "ack", next sibling =
	        Text (value = " for support", next sibling = NIL ))))

So in the DOM, there are two ways to get at the value of an attribute:
    Attr.value	- a string
    Attr.childNodes  - a sequence of Text and/or EntityReference nodes.
(By the way, if an attribute's .childNodes include an EntityReference node,
it is by no means clear what the .value should be.  I cannot find any clear
explanation of this.)   "On setting" the .value of an Attr "this creates a
Text node".

Now the smallest I can get a Text node down to is 6 words.

The Document Value Model that I've been talking about is intended only for
valid SGML/XML documents, wherein all macros must have been expanded, so
that there is no need for the Document Object Model's partially digested
attribute values.

So *every* content string in the DOM pays at least a 6-word overhead
that is not paid in the DVM.  While it is arguable that the strings themselves
might be shared, it is unarguable that these Text nodes must NOT be shared.

It would be possible to use lazy initialisation for the children of an Attr
node.

What we see here that has general application is
1.  The DOM is an API for HTML/XML *editors*, designed to support
    frequent tiny changes to an incompletely processed document.

    The DVM is a data structure for SGML/XML *processors*, designed
    to support efficient storage and traversal of completely parsed
    and validated documents and efficent creation of transformed
    documents.

2.  A data structure designed for one use need not be expected to be
    good for other uses.  The DOM is horribly clumsy and inefficient
    for processing documents; the DVM cannot represent incompletely
    parsed documents at all.

3.  The rather large storage costs of the DOM compared with the DVM
    (even *with* shared strings) can be traced to its requirements
    combined with the decision to use mutation as a primary programming
    tool.

The general point therefore is that supporting functionality you don't
need (in my case, in-place edits and half-parsed documents) can cost you
a lot; garbage avoidance starts with data structures that are no more
capable than they need to be.