[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Mon, 5 Mar 2001 18:11:17 +1300 (NZDT)

Ken Anderson <kanderson@bbn.com> provided these numbers
about the DOM:
	Here's a summary of loading the CIM_Schema23.xml file from
	http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.
	This XML file describe a class hierarchy of 732 classes of the
	DMTF/CIM standard.  Some Statistics:
	File sizes:
	MBytes of What
	 3.76      XML
	 0.27      zipped
	40.2       DOM (IBM)
	Sharing strings can save a significant amount, and some XML parsers
	will do that.  Vector's are given a default size of 10, while 74% have
	size 1, and 12% have size 3.  Trimming Vectors to the right size saves
	I wrote a SAX parser that read the same file but produced an
	s-expression with structure sharing.  It required only 4.0MB, slightly
	more than the ASCII XML size.  
I have added more code to my program to report where the space is going.
Measurements were made on four moderately large files:
 - my collection of CS examination papers
 - SigmodRecord.xml; an example I picked up off the net somewhere
   and "cleaned", reducing it to 72% of its original size.
 - the collected plays of Shakespeare (XML version).  I must say that
   the markup here is not very idiomatic; ID/IDREF(S) attributes could
   have been used to excellent effect but weren't.
 - DocBook; the SGML source of the O'Reilly DocBook book (it's on the
   CD-ROM) that comes with the book.

The space required *with* sharing is measured.  It does NOT include
space for the hash tables themselves, but does include hash links.

The space required *without* sharing is partly measured and partly
calculated (the main calculation is subtracting off 4 bytes for a hash
link from each node).

the DOM estimate assumes that strings are arrays of 16-bit characters
(required) + 4 byte length field, padded to multiple of 4 bytes; other
nodes are all 10 4-byte words (the smallest I could get that would do
everything DOM2).

The file size is as reported by ls -l or wc (which agree).

EXAMS: a collection of CS examination papers (my DTD)

		    strings +     attrs +  elements =     total
Without sharing:  3,755,464 +   312,936 + 1,288,364 = 5,356,764
With sharing:        65,216 +     1,328 +   751,248 =   817,792
DOM estimate:     5,778,556 + 1,043,120 + 2,460,680 = 9,282,356
Source exams.xml:		 	 	    = 2,634,021
DOM/shared = 11.3

SIGMOD: the SigMod Record catalogue

		   strings +   attrs + elements =     total
Without sharing:   648,824 +  82,548 +  221,100 =   952,472
With sharing:      131,784 +  16,752 +  217,748 =   366,284
DOM estimate:    1,189,308 + 275,160 +  332,680 = 1,797,148
Source SigmodRecord.xml:                        =   360,329
DOM/shared = 4.9

PLAYS: collected plays of Shakespeare (as found on net)

		    strings + attrs +  elements =      total
Without sharing: 10,677,460 +     0 + 4,305,996 = 14,983,456
With sharing:     4,901,656 +     0 + 3,751,320 =  8,652,976
DOM estimate:    19,214,762 +     0 + 7,187,600 = 26,402,362
Source plays.xml:                               =  7,648,502
DOM/shared = 3.1

% DOCBOOK: source of DocBook book, parsed by nsgmls

		    strings +     attrs +  elements =      total
Without sharing: 30,266,564 +   497,196 + 1,458,096 = 32,221,856
With sharing:       641,168 +    40,272 + 1,108,608 =  1,790,048
DOM estimate:    60,857,232 + 1,657,320 + 2,665,160 = 65,179,712 
Source docbook.sgm:                                 =  2,896,326
DOM/shared = 36.4

One thing that makes the DOM look bad is that it specifies that
strings are arrays of 16-bit UCS-16-encoded Unicode characters.
It happens that all of these files use ASCII, or possibly just a
few Latin-1 or Unicode symbols, so that using UTF-8 would shrink
the string space by a factor of 2.  In fact it is possible to devise
a Unicode encoding which is guaranteed to take no more than 3 bytes
for any character in planes 0,1,2,14 and to take 1 byte for any
ISO Latin 1 character, an encoding which is if anything easier than
UTF-8, as long as you don't require special treatment for all NUL bytes
(which Java, Javascript, and my C code do not.)

If you poke around in the statistics a bit, you find that long strings
are unique anyway, but short strings have very many references indeed,
so that shared strings really pay off.

Note that with the worst ratio here, I could go through three successive
versions of a document without bothering to garbage collection, and still
take a wee bit less space than the DOM would require for one copy, so that
for several reasonable filtering/rearranging applications, *not* garbage
collecting is a perfectly workable strategy.

I think the first reference to hash consing was in a paper of Ershov's
in the early 60's; the method was used in a compiler.  This is not a
new idea (except to the people who invented the DOM, seemingly.)