[gclist] Garbage collection and XML
Richard A. O'Keefe
ok@atlas.otago.ac.nz
Fri, 2 Mar 2001 10:30:26 +1300 (NZDT)
"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
I have just begun my effort, and I am curious
what garbage collector/memory management
forum had to say about XML DOM and SAX
specification and/or parser implementations.
The first thing is to say that the DOM specification is one of the
worst specifications I've seen in a long time. If your goal is to
process XML, there are other approaches which are *far* better,
especially in terms of memory consumption. The only known reasons
for using the DOM interface for *anything* are
- market requirement
- required interface to an existing DOM implementation.
When I say that a naive implementation of an XML tree in C (using
either UTF-8 or TR-6 for strings) is likely to take *half* the memory of
any plausible DOM implementation, you must understand that I am not
joking and not exaggerating.
When I say that a clever (but not by any means innovative) implementation
based on hash consing ideas can shrink memory requirements still further,
you must again understand that I am not joking and not exaggerating.
So the first step in memory management for XML is "Don't use the DOM".
One of the issues with the DOM is that every node is rigidly locked in place
by cross-links pointing every which way. This makes naive reference counting
useless. Oddly enough, a more space-efficient approach could make very
effective use of reference counting. On the other hand, it does mean that
if any node in a document is garbage, they are all garbage. So if you were
doing the DOM in C++, "pointers" into a document should be pairs consisting
of a pointer to a root node and a pointer to a subnode. Reference counting
operations would affect the root node only, and when its count reached zero
the whole tree could go.
The traversal features in Level 2 DOM are downright nasty. They are
complex to implement, and complex to understand. Implementing traversal
by simply constructing a list object of some kind would be easier to use
correctly, and when you consider the implementation overheads of the DOM,
almost certainly rather more efficient. I found that supporting traversal
slowed a test DOM implementation I wrote down by about 20%.