[gclist] Garbage collection and XML
Richard A. O'Keefe
ok@atlas.otago.ac.nz
Mon, 5 Mar 2001 18:11:17 +1300 (NZDT)
Ken Anderson <kanderson@bbn.com> provided these numbers
about the DOM:
Here's a summary of loading the CIM_Schema23.xml file from
http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.
This XML file describe a class hierarchy of 732 classes of the
DMTF/CIM standard. Some Statistics:
File sizes:
MBytes of What
3.76 XML
0.27 zipped
40.2 DOM (IBM)
Sharing strings can save a significant amount, and some XML parsers
will do that. Vector's are given a default size of 10, while 74% have
size 1, and 12% have size 3. Trimming Vectors to the right size saves
5.6MB.
I wrote a SAX parser that read the same file but produced an
s-expression with structure sharing. It required only 4.0MB, slightly
more than the ASCII XML size.
I have added more code to my program to report where the space is going.
Measurements were made on four moderately large files:
- my collection of CS examination papers
- SigmodRecord.xml; an example I picked up off the net somewhere
and "cleaned", reducing it to 72% of its original size.
- the collected plays of Shakespeare (XML version). I must say that
the markup here is not very idiomatic; ID/IDREF(S) attributes could
have been used to excellent effect but weren't.
- DocBook; the SGML source of the O'Reilly DocBook book (it's on the
CD-ROM) that comes with the book.
The space required *with* sharing is measured. It does NOT include
space for the hash tables themselves, but does include hash links.
The space required *without* sharing is partly measured and partly
calculated (the main calculation is subtracting off 4 bytes for a hash
link from each node).
the DOM estimate assumes that strings are arrays of 16-bit characters
(required) + 4 byte length field, padded to multiple of 4 bytes; other
nodes are all 10 4-byte words (the smallest I could get that would do
everything DOM2).
The file size is as reported by ls -l or wc (which agree).
EXAMS: a collection of CS examination papers (my DTD)
strings + attrs + elements = total
Without sharing: 3,755,464 + 312,936 + 1,288,364 = 5,356,764
With sharing: 65,216 + 1,328 + 751,248 = 817,792
DOM estimate: 5,778,556 + 1,043,120 + 2,460,680 = 9,282,356
Source exams.xml: = 2,634,021
DOM/shared = 11.3
SIGMOD: the SigMod Record catalogue
strings + attrs + elements = total
Without sharing: 648,824 + 82,548 + 221,100 = 952,472
With sharing: 131,784 + 16,752 + 217,748 = 366,284
DOM estimate: 1,189,308 + 275,160 + 332,680 = 1,797,148
Source SigmodRecord.xml: = 360,329
DOM/shared = 4.9
PLAYS: collected plays of Shakespeare (as found on net)
strings + attrs + elements = total
Without sharing: 10,677,460 + 0 + 4,305,996 = 14,983,456
With sharing: 4,901,656 + 0 + 3,751,320 = 8,652,976
DOM estimate: 19,214,762 + 0 + 7,187,600 = 26,402,362
Source plays.xml: = 7,648,502
DOM/shared = 3.1
% DOCBOOK: source of DocBook book, parsed by nsgmls
strings + attrs + elements = total
Without sharing: 30,266,564 + 497,196 + 1,458,096 = 32,221,856
With sharing: 641,168 + 40,272 + 1,108,608 = 1,790,048
DOM estimate: 60,857,232 + 1,657,320 + 2,665,160 = 65,179,712
Source docbook.sgm: = 2,896,326
DOM/shared = 36.4
One thing that makes the DOM look bad is that it specifies that
strings are arrays of 16-bit UCS-16-encoded Unicode characters.
It happens that all of these files use ASCII, or possibly just a
few Latin-1 or Unicode symbols, so that using UTF-8 would shrink
the string space by a factor of 2. In fact it is possible to devise
a Unicode encoding which is guaranteed to take no more than 3 bytes
for any character in planes 0,1,2,14 and to take 1 byte for any
ISO Latin 1 character, an encoding which is if anything easier than
UTF-8, as long as you don't require special treatment for all NUL bytes
(which Java, Javascript, and my C code do not.)
If you poke around in the statistics a bit, you find that long strings
are unique anyway, but short strings have very many references indeed,
so that shared strings really pay off.
Note that with the worst ratio here, I could go through three successive
versions of a document without bothering to garbage collection, and still
take a wee bit less space than the DOM would require for one copy, so that
for several reasonable filtering/rearranging applications, *not* garbage
collecting is a perfectly workable strategy.
I think the first reference to hash consing was in a paper of Ershov's
in the early 60's; the method was used in a compiler. This is not a
new idea (except to the people who invented the DOM, seemingly.)