[gclist] Garbage collection and XML

Richard A. O'Keefe ok@atlas.otago.ac.nz
Fri, 2 Mar 2001 16:48:35 +1300 (NZDT)

Ken Anderson <kanderson@bbn.com>
backed up my recommendation against the DOM with some figures.

I have some of my own.  This is from a collection of Computer Science
exam papers marked up as XML (well, actually, SGML, automatically
transcoded to XML).

f% wc exams.xml
   26731  279088 2634021 exams.xml
f% sgml -xml exams.xml | a.out
There are 251215 references to 1341 strings, 187.334 references each.
69915 bytes were used, with 6029 bytes wasted,
for an average of 52.1365 bytes used and 4.4959 bytes wasted per string,
or 0.278307 bytes used and 0.0239994 bytes wasted per copy.

Here a "string" is any of
    - an element name
    - an attribute name
    - an attribute value
    - character content

The program puts all of these things in a hash table, so all strings are
unique.  When I wrote this I was concerned with the effectiveness of the
hash table, and whether it would be a good idea to do my own mallocking
out of blocks instead of allocating each string separately. (Yes it was.)

The program does NOT report the amount of space needed for the tree,
so the rather impressive total of 75,944 bytes required to hold the
string content of 2,634,021 bytes of XML (about 3%) is misleading.

Assuming that there's no sharing in the tree structure (which is unlikely,
because things like <name kind=pl>Java</name> occur quite often), the
data structure I use would charge
	1 word per string reference +
	2 words per element.
There are 61517 elements, so there'd be 374249 * 4 = 1,496,996 bytes for
the tree structure (AT MOST).

    1,496,996 bytes for tree structure
       75,944 bytes for strings
    1,572,940 bytes total storage.
    2,634,021 bytes for the XML sources.