[gclist] Garbage collection and XML
Ken Anderson
kanderson@bbn.com
Thu, 01 Mar 2001 20:30:43 -0500
At 04:30 PM 3/1/2001 , Richard A. O'Keefe wrote:
>"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
> I have just begun my effort, and I am curious
> what garbage collector/memory management
> forum had to say about XML DOM and SAX
> specification and/or parser implementations.
>
>The first thing is to say that the DOM specification is one of the
>worst specifications I've seen in a long time. If your goal is to
>process XML, there are other approaches which are *far* better,
>especially in terms of memory consumption. The only known reasons
>for using the DOM interface for *anything* are
> - market requirement
> - required interface to an existing DOM implementation.
>
>When I say that a naive implementation of an XML tree in C (using
>either UTF-8 or TR-6 for strings) is likely to take *half* the memory of
>any plausible DOM implementation, you must understand that I am not
>joking and not exaggerating.
>
>When I say that a clever (but not by any means innovative) implementation
>based on hash consing ideas can shrink memory requirements still further,
>you must again understand that I am not joking and not exaggerating.
>
>So the first step in memory management for XML is "Don't use the DOM".
>
>One of the issues with the DOM is that every node is rigidly locked in place
>by cross-links pointing every which way. This makes naive reference counting
>useless. Oddly enough, a more space-efficient approach could make very
>effective use of reference counting. On the other hand, it does mean that
>if any node in a document is garbage, they are all garbage. So if you were
>doing the DOM in C++, "pointers" into a document should be pairs consisting
>of a pointer to a root node and a pointer to a subnode. Reference counting
>operations would affect the root node only, and when its count reached zero
>the whole tree could go.
>
>The traversal features in Level 2 DOM are downright nasty. They are
>complex to implement, and complex to understand. Implementing traversal
>by simply constructing a list object of some kind would be easier to use
>correctly, and when you consider the implementation overheads of the DOM,
>almost certainly rather more efficient. I found that supporting traversal
>slowed a test DOM implementation I wrote down by about 20%.
I thought i'd support you with some numbers, though in Java. Here's a
summary of loading the CIM_Schema23.xml file from
http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.
This XML file describe a class hierarchy of 732 classes of the
DMTF/CIM standard. Some Statistics:
File sizes:
MBytes of What
3.76 XML
0.27 zipped
40.2 DOM (IBM)
Sharing strings can save a significant amount, and some XML parsers
will do that. Vector's are given a default size of 10, while 74% have
size 1, and 12% have size 3. Trimming Vectors to the right size saves
5.6MB.
I wrote a SAX parser that read the same file but produced an
s-expression with structure sharing. It required only 4.0MB, slightly
more than the ASCII XML size.
Its not so much an issue of GC as watching were the bytes go.
k