[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Tue, 6 Mar 2001 10:48:19 +1300 (NZDT)


	Really interesting results.  What you have shown here, though,
	does not seem to be that DOM API is terribly bad for memory savings.
	Rather, it seems to show that DOM implementations should use shared
	strings/attributes.

If you follow the letter of the DOM specification (the CORBA IDL, not the
Java and Javascript bindings) that is not *allowed*.

I think you may have overlooked one point I made, which is that the
DOM flatly and unconditionally *requires* that strings be sequences of
16-bit characters.  That's a factor of two overhead in space, even if
a DOM implementation *were* to use shared strings.

I think you may also have overlooked that for some kinds of documents,
there is a substantial saving to be made from sharing attribute
triples (Name,Type,Value), which the DOM forbids.  Not "the DOM does not
require" or "the DOM does not discuss", but "the DOM *forbids*".  You
cannot have shared attribute nodes and still call your interface "DOM".

	(1) If you look at the first result, and assume that you were
	using shared strings for DOM -- it would save nearly 1 meg.  In
	fact, if Java DOM used shared strings, it would be far more
	memory efficient than the non-shared case.. This holds for the
	rest of the examples as well.

In fact the space overheads for Java are considerably worse than that.
A Java object typically has 2 words of overhead.
A Java String object has 4 words of local data.
One of those words is a pointer to an array of Unicode characters.
(The idea is if you chop a string into substrings, the space cost per
substring is constant.)
An array of n Unicode characters will be (3 + ceiling(n/2)) words long.
The total is thus 36+4*ceiling(n/2) bytes for a string, where I assumed
4+4*ceiling(n/2) bytes.

To implement unique strings in Java would add to the space overhead
per string.

Having read the ECMAscript standard and Netscape's Javascript Reference
manual, it is clear that the space overheads for Javascript strings must
be comparable.

Note that in Java, string operations are guaranteed to return new objects,
so it is possible for a Java program to *tell* whether strings are shared
or not.  So it really isn't clear whether/how much sharing is allowed.

	(2) I noticed that DOM's memory use for attributes is bad.  But it uses
	always about 3 times as much as the unshared model.  Again, this
	seems to show that it is sharing which has more effect than the fact
	one is following the DOM API spec.
	
Yes, but the DOM treats attributes (Name,Value) like any other kind of
node.  Every attribute node knows which Element node owns it; no kind
of sharing is allowed AT ALL.  To follow the DOM API spec *is* to refuse
to share attribute nodes.  I suggest reading the DOM specification.

	These observations lead to the following conclusion:
	While DOM API is not designed for writing memory efficient apps,
	the biggest problems in Java DOM stem not from the
	API spec, but the underlying implementation.

Three kinds of item:
1) strings
   Requiring that strings be sequences of 16-bit characters rather than
   UTF-8 or TR-8 encoded guarantees AT LEAST A FACTOR OF TWO compared
   with a more compact representation, even if strings are shared.

2) attributes
   AT LEAST A FACTOR OF TWO space cost would be incurred by all the
   cross-links you need to store, even if the DOM allowed sharing,
   which it doesn't.

3) elements
   AT LEAST A FACTOR OF TWO space cost would be incurred by all the
   cross-links you need to store, even if the DOM allowed charing,
   which it doesn't.

So AT LEAST A FACTOR OF TWO space cost is due simply to the requirements
of the DOM, however clever the implementor might be.  You *can't* match
the results I quoted and still have something even close to the DOM.

	    You might ask me, how do you know if Java does not use
	shared strings?

I don't think you noticed the word "estimate".  I was careful to say that
the "DOM" numbers, unlike the others, were NOT measured, but estimated.
(As it happens, I have a DOM in C which I wrote to be sure I understood it,
but once I had, the measured time overheads convinced me not to use it.)

In fact I did not account for all the overheads in Java.  The figures for
a Java implementation of the DOM would be considerably worse.