[gclist] Garbage collection and XML

Ken Anderson kanderson@bbn.com
Thu, 01 Mar 2001 20:30:43 -0500


At 04:30 PM 3/1/2001 , Richard A. O'Keefe wrote:
>"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
>	    I have just begun my effort, and I am curious
>	what garbage collector/memory management
>	forum had to say about XML DOM and SAX 
>	specification and/or parser implementations.
>	
>The first thing is to say that the DOM specification is one of the
>worst specifications I've seen in a long time.  If your goal is to
>process XML, there are other approaches which are *far* better,
>especially in terms of memory consumption.  The only known reasons
>for using the DOM interface for *anything* are
> - market requirement
> - required interface to an existing DOM implementation.
>
>When I say that a naive implementation of an XML tree in C (using
>either UTF-8 or TR-6 for strings) is likely to take *half* the memory of
>any plausible DOM implementation, you must understand that I am not
>joking and not exaggerating.
>
>When I say that a clever (but not by any means innovative) implementation
>based on hash consing ideas can shrink memory requirements still further,
>you must again understand that I am not joking and not exaggerating.
>
>So the first step in memory management for XML is "Don't use the DOM".
>
>One of the issues with the DOM is that every node is rigidly locked in place
>by cross-links pointing every which way.  This makes naive reference counting
>useless.  Oddly enough, a more space-efficient approach could make very
>effective use of reference counting.  On the other hand, it does mean that
>if any node in a document is garbage, they are all garbage.  So if you were
>doing the DOM in C++, "pointers" into a document should be pairs consisting
>of a pointer to a root node and a pointer to a subnode.  Reference counting
>operations would affect the root node only, and when its count reached zero
>the whole tree could go.
>
>The traversal features in Level 2 DOM are downright nasty.  They are
>complex to implement, and complex to understand.  Implementing traversal
>by simply constructing a list object of some kind would be easier to use
>correctly, and when you consider the implementation overheads of the DOM,
>almost certainly rather more efficient.  I found that supporting traversal
>slowed a test DOM implementation I wrote down by about 20%.

I thought i'd support you with some numbers, though in Java.  Here's a
summary of loading the CIM_Schema23.xml file from
http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.

This XML file describe a class hierarchy of 732 classes of the
DMTF/CIM standard.  Some Statistics:

File sizes:

MBytes of What
 3.76      XML
 0.27      zipped
40.2       DOM (IBM)

Sharing strings can save a significant amount, and some XML parsers
will do that.  Vector's are given a default size of 10, while 74% have
size 1, and 12% have size 3.  Trimming Vectors to the right size saves
5.6MB.

I wrote a SAX parser that read the same file but produced an
s-expression with structure sharing.  It required only 4.0MB, slightly
more than the ASCII XML size.  

Its not so much an issue of GC as watching were the bytes go.
k