From virtualcyber@erols.com Thu, 1 Mar 2001 00:36:37 -0500
Date: Thu, 1 Mar 2001 00:36:37 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Boehm's collector and "dynamic roots"

Hi,

    I am having a problem with getting
Boehm's collector to work for me.  I was wondering
if anyone could help.

    I have just incorporated Boehm's collector for my
scheme interpreter.  It is implemented in C++.
The interpreter keeps its own stack (rather than using 
C++ stack).  To get the collector
to scan "my" stack, I registered it as a
root segment using one of the
interfaces provided by Boehm's collector.

    My problem is that I have allocated approximately
200k for the stack, and on each collection cycle,
Boehm's collector is likely to scan the whole thing
as one of its roots.

    If I were using a C++ stack, Boehm's collector would just
scan the stack until it reached the top.  Because my stack 
is not a C++ stack, and it is registered as a root
segment, Boehm's collector
is treating it as just that -- a root segment
rather than as stack.

    I am wondering if there is any way to cause
the collector to scan only to the top of MY OWN
stack, rather than the whole 200k.

    For this, it seems that I need to implement
additional interfaces that could add
what one may call "dynamic" roots -- roots
whose size will change when I push
my stack.
    
    Has anyone encountered a similar problem
before, and know of a relatively simple solution
to this problem?  (Perhaps a few pointers
to how I might need to modify Boehm's 
collector?)

    I would appreciate any comments/enlightenment.


From fjh@cs.mu.oz.au Thu, 1 Mar 2001 17:32:07 +1100
Date: Thu, 1 Mar 2001 17:32:07 +1100
From: Fergus Henderson fjh@cs.mu.oz.au
Subject: [gclist] Boehm's collector and "dynamic roots"

On 01-Mar-2001, Ji-Yong D. Chung <virtualcyber@erols.com> wrote:
>     I am wondering if there is any way to cause
> the collector to scan only to the top of MY OWN
> stack, rather than the whole 200k.
> 
>     For this, it seems that I need to implement
> additional interfaces that could add
> what one may call "dynamic" roots -- roots
> whose size will change when I push
> my stack.
>
>     Has anyone encountered a similar problem
> before, and know of a relatively simple solution
> to this problem?

Use the GC_push_other_roots() hook, which is declared in gc_priv.h:

extern void (*GC_push_other_roots)();
                        /* Push system or application specific roots    */
                        /* onto the mark stack.  In some environments   */
                        /* (e.g. threads environments) this is          */
                        /* predfined to be non-zero.  A client supplied */
                        /* replacement should also call the original    */
                        /* function.                                    */

This will get called whenever a collection occurs.

In your function, you can call GC_push_all() or GC_push_all_stack()
to register your stack.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.


From virtualcyber@erols.com Thu, 1 Mar 2001 02:44:23 -0500
Date: Thu, 1 Mar 2001 02:44:23 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Garbage collection and XML

    Hi,

    I am trying to modify a c++ XML parser library,
so that it uses a GC.

    I have just begun my effort, and I am curious
what garbage collector/memory management
forum had to say about XML DOM and SAX 
specification and/or parser implementations.

    In particular, I was wondering how well XML
parsers (DOM and SAX) might get along with
 today's garbage collection/memory management 
techniques.  (Perhaps my question is ridiculously too
general).  Does the fact that DOM interface is a tree
structure manipulation tool make any difference?

    I will be grateful and satisfied with first impressions, 
general comments, in fact anything about garbage
collection and memory management of XML parsers, 
spec, etc.  


From ken@bitsko.slc.ut.us 01 Mar 2001 09:09:36 -0600
Date: 01 Mar 2001 09:09:36 -0600
From: Ken MacLeod ken@bitsko.slc.ut.us
Subject: [gclist] Garbage collection and XML

"Ji-Yong D. Chung" <virtualcyber@erols.com> writes:

>     I am trying to modify a c++ XML parser library, so that it uses
> a GC.
> 
>     I have just begun my effort, and I am curious what garbage
> collector/memory management forum had to say about XML DOM and SAX
> specification and/or parser implementations.
> 
>     In particular, I was wondering how well XML parsers (DOM and
> SAX) might get along with today's garbage collection/memory
> management techniques.  (Perhaps my question is ridiculously too
> general).  Does the fact that DOM interface is a tree structure
> manipulation tool make any difference?

I specifically selected garbage collection (and Boehm GC in
particular) for implementing the Orchard/Mostly-C XML library (which
has a fledgling C++ interface, by the way).

The performance is excellent (which most GCers will likely to have
expected :-).  We've variously stress and performance tested it using
large documents and thousands of small documents per second.  Using
the Expat C XML parser we're running about 1/3 the speed of the raw
parser, and that due mostly to creating objects for each parse event
and doing string copies of XML text (the latter may have a memmgmt
hook to prevent, haven't checked deeper yet).

One of the particular reasons for using GC is that most people want
their XML trees with "parent" references in them, which creates cycles.

>From your earlier postings, you may be interested in creating a scheme
binding to Orchard[1] (rather than using the C++ interface I mentioned
above), Orchard implements a very lightweight SAX and DOM that would
work well in Scheme.

  -- Ken

[1] <http://casbah.org/~kmacleod/orchard/>


From ok@atlas.otago.ac.nz Fri, 2 Mar 2001 10:30:26 +1300 (NZDT)
Date: Fri, 2 Mar 2001 10:30:26 +1300 (NZDT)
From: Richard A. O'Keefe ok@atlas.otago.ac.nz
Subject: [gclist] Garbage collection and XML

"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
	    I have just begun my effort, and I am curious
	what garbage collector/memory management
	forum had to say about XML DOM and SAX 
	specification and/or parser implementations.
	
The first thing is to say that the DOM specification is one of the
worst specifications I've seen in a long time.  If your goal is to
process XML, there are other approaches which are *far* better,
especially in terms of memory consumption.  The only known reasons
for using the DOM interface for *anything* are
 - market requirement
 - required interface to an existing DOM implementation.

When I say that a naive implementation of an XML tree in C (using
either UTF-8 or TR-6 for strings) is likely to take *half* the memory of
any plausible DOM implementation, you must understand that I am not
joking and not exaggerating.

When I say that a clever (but not by any means innovative) implementation
based on hash consing ideas can shrink memory requirements still further,
you must again understand that I am not joking and not exaggerating.

So the first step in memory management for XML is "Don't use the DOM".

One of the issues with the DOM is that every node is rigidly locked in place
by cross-links pointing every which way.  This makes naive reference counting
useless.  Oddly enough, a more space-efficient approach could make very
effective use of reference counting.  On the other hand, it does mean that
if any node in a document is garbage, they are all garbage.  So if you were
doing the DOM in C++, "pointers" into a document should be pairs consisting
of a pointer to a root node and a pointer to a subnode.  Reference counting
operations would affect the root node only, and when its count reached zero
the whole tree could go.

The traversal features in Level 2 DOM are downright nasty.  They are
complex to implement, and complex to understand.  Implementing traversal
by simply constructing a list object of some kind would be easier to use
correctly, and when you consider the implementation overheads of the DOM,
almost certainly rather more efficient.  I found that supporting traversal
slowed a test DOM implementation I wrote down by about 20%.


From kanderson@bbn.com Thu, 01 Mar 2001 20:30:43 -0500
Date: Thu, 01 Mar 2001 20:30:43 -0500
From: Ken Anderson kanderson@bbn.com
Subject: [gclist] Garbage collection and XML

At 04:30 PM 3/1/2001 , Richard A. O'Keefe wrote:
>"Ji-Yong D. Chung" <virtualcyber@erols.com> wrote:
>	    I have just begun my effort, and I am curious
>	what garbage collector/memory management
>	forum had to say about XML DOM and SAX 
>	specification and/or parser implementations.
>	
>The first thing is to say that the DOM specification is one of the
>worst specifications I've seen in a long time.  If your goal is to
>process XML, there are other approaches which are *far* better,
>especially in terms of memory consumption.  The only known reasons
>for using the DOM interface for *anything* are
> - market requirement
> - required interface to an existing DOM implementation.
>
>When I say that a naive implementation of an XML tree in C (using
>either UTF-8 or TR-6 for strings) is likely to take *half* the memory of
>any plausible DOM implementation, you must understand that I am not
>joking and not exaggerating.
>
>When I say that a clever (but not by any means innovative) implementation
>based on hash consing ideas can shrink memory requirements still further,
>you must again understand that I am not joking and not exaggerating.
>
>So the first step in memory management for XML is "Don't use the DOM".
>
>One of the issues with the DOM is that every node is rigidly locked in place
>by cross-links pointing every which way.  This makes naive reference counting
>useless.  Oddly enough, a more space-efficient approach could make very
>effective use of reference counting.  On the other hand, it does mean that
>if any node in a document is garbage, they are all garbage.  So if you were
>doing the DOM in C++, "pointers" into a document should be pairs consisting
>of a pointer to a root node and a pointer to a subnode.  Reference counting
>operations would affect the root node only, and when its count reached zero
>the whole tree could go.
>
>The traversal features in Level 2 DOM are downright nasty.  They are
>complex to implement, and complex to understand.  Implementing traversal
>by simply constructing a list object of some kind would be easier to use
>correctly, and when you consider the implementation overheads of the DOM,
>almost certainly rather more efficient.  I found that supporting traversal
>slowed a test DOM implementation I wrote down by about 20%.

I thought i'd support you with some numbers, though in Java.  Here's a
summary of loading the CIM_Schema23.xml file from
http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.

This XML file describe a class hierarchy of 732 classes of the
DMTF/CIM standard.  Some Statistics:

File sizes:

MBytes of What
 3.76      XML
 0.27      zipped
40.2       DOM (IBM)

Sharing strings can save a significant amount, and some XML parsers
will do that.  Vector's are given a default size of 10, while 74% have
size 1, and 12% have size 3.  Trimming Vectors to the right size saves
5.6MB.

I wrote a SAX parser that read the same file but produced an
s-expression with structure sharing.  It required only 4.0MB, slightly
more than the ASCII XML size.  

Its not so much an issue of GC as watching were the bytes go.
k


From virtualcyber@erols.com Thu, 1 Mar 2001 20:34:32 -0500
Date: Thu, 1 Mar 2001 20:34:32 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Garbage collection and XML

    Thank you for your replies.

    I will have to re-evaluate my goals and what I am
attempting to do, in light of your email messages.

    I had suspected a few problems with XML parsers
 (having heard of complaints) but I did not know that experts
would be this harsh toward DOM.


Take Care
Ji-Yong D. Chung


From ok@atlas.otago.ac.nz Fri, 2 Mar 2001 16:48:35 +1300 (NZDT)
Date: Fri, 2 Mar 2001 16:48:35 +1300 (NZDT)
From: Richard A. O'Keefe ok@atlas.otago.ac.nz
Subject: [gclist] Garbage collection and XML

Ken Anderson <kanderson@bbn.com>
backed up my recommendation against the DOM with some figures.

I have some of my own.  This is from a collection of Computer Science
exam papers marked up as XML (well, actually, SGML, automatically
transcoded to XML).

f% wc exams.xml
   26731  279088 2634021 exams.xml
f% sgml -xml exams.xml | a.out
There are 251215 references to 1341 strings, 187.334 references each.
69915 bytes were used, with 6029 bytes wasted,
for an average of 52.1365 bytes used and 4.4959 bytes wasted per string,
or 0.278307 bytes used and 0.0239994 bytes wasted per copy.

Here a "string" is any of
    - an element name
    - an attribute name
    - an attribute value
    - character content

The program puts all of these things in a hash table, so all strings are
unique.  When I wrote this I was concerned with the effectiveness of the
hash table, and whether it would be a good idea to do my own mallocking
out of blocks instead of allocating each string separately. (Yes it was.)

The program does NOT report the amount of space needed for the tree,
so the rather impressive total of 75,944 bytes required to hold the
string content of 2,634,021 bytes of XML (about 3%) is misleading.

Assuming that there's no sharing in the tree structure (which is unlikely,
because things like <name kind=pl>Java</name> occur quite often), the
data structure I use would charge
	1 word per string reference +
	2 words per element.
There are 61517 elements, so there'd be 374249 * 4 = 1,496,996 bytes for
the tree structure (AT MOST).

    1,496,996 bytes for tree structure
       75,944 bytes for strings
-------------------
    1,572,940 bytes total storage.
BUT
    2,634,021 bytes for the XML sources.


From dave@cs.adelaide.edu.au Fri, 2 Mar 2001 14:23:44 +1030
Date: Fri, 2 Mar 2001 14:23:44 +1030
From: Dave Munro dave@cs.adelaide.edu.au
Subject: [gclist] Real-Time GC for high-level languages


At 3:20 PM +0000 30/1/01, Andrew Cheadle wrote:
>I seem to remember a theoretical paper:
>  Guy E. Blelloch, Perry Cheng: On Bounding Time and Space for
>Multiprocessor Garbage Collection. PLDI 1999: 104-117
>
>which makes claims of bounded pause times. I believe, but I'm not sure,
>that Perry Cheng was looking at implementing the techniques mentioned in
>the above paper in the TILT ML compiler:

Just a note to say that William Brodie-Tyrrell, one of my students, 
implemented a version of the Blelloch and Cheng bounded GC which we 
reported in

Vaughan, Francis A., Brodie-Tyrrell, William F., Falkner, Katrina E. 
and Munro, David S., "Bounded Parallel Garbage 
Collection:Implementation and Adaptation", To appear in Proceedings 
of 7th Australian Parallel and Real Time PART'2000 Sydney.

This can be picked up from 
http://www.cs.adelaide.edu.au/users/jacaranda/publications.html


Cheers

Dave


-- 

--------------------------------------------------------------------------

David Munro                           _--_|\    phone:  +61 8 8303 6173
Department of Computer Science,      /      \   fax:    +61 8 8303 4366
University of Adelaide,              \_.--*_/
South Australia 5005                       v
Australia 
http://www.cs.adelaide.edu.au/~dave


From plakal@cs.wisc.edu Thu, 1 Mar 2001 21:57:22 -0600
Date: Thu, 1 Mar 2001 21:57:22 -0600
From: Manoj Plakal plakal@cs.wisc.edu
Subject: [gclist] Real-Time GC for high-level languages

Dave Munro wrote (Fri, Mar 02, 2001 at 02:23:44PM +1030) :
> At 3:20 PM +0000 30/1/01, Andrew Cheadle wrote:
> >I seem to remember a theoretical paper:
> >  Guy E. Blelloch, Perry Cheng: On Bounding Time and Space for
> >Multiprocessor Garbage Collection. PLDI 1999: 104-117
> >
> >which makes claims of bounded pause times. I believe, but I'm not sure,
> >that Perry Cheng was looking at implementing the techniques mentioned in
> >the above paper in the TILT ML compiler:
> 
> Just a note to say that William Brodie-Tyrrell, one of my students, 
> implemented a version of the Blelloch and Cheng bounded GC which we 
> reported in
> 
> Vaughan, Francis A., Brodie-Tyrrell, William F., Falkner, Katrina E. 
> and Munro, David S., "Bounded Parallel Garbage 
> Collection:Implementation and Adaptation", To appear in Proceedings 
> of 7th Australian Parallel and Real Time PART'2000 Sydney.
> 
> This can be picked up from 
> http://www.cs.adelaide.edu.au/users/jacaranda/publications.html


	It seems that Cheng & Blelloch have also implemented 
	their idea since they have a paper in the upcoming PLDI-2001 
	titled "A Parallel, Real-Time Garbage Collector".

	See http://www.cs.pitt.edu/~soffa/pldi01/pldi_program.html

	Manoj


From virtualcyber@erols.com Thu, 1 Mar 2001 23:20:52 -0500
Date: Thu, 1 Mar 2001 23:20:52 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Garbage collection and XML

:Hi, 

> [Ken Anderson wrote]   

> File sizes:
> 
> MBytes of What
>  3.76      XML
>  0.27      zipped
> 40.2       DOM (IBM)

    This is bad -- I need to break up with
Xerces (IBM DOM), and real soon.

P.S. By the way, this is a bit off the topic,
but it seems, from Robert O'Keefe's 
email and yours, implementing an XML parser does not
seem to be that difficult -- am I right? (or perhaps
you two are just super coders ...)


From danwang@CS.Princeton.EDU 01 Mar 2001 23:03:58 -0500
Date: 01 Mar 2001 23:03:58 -0500
From: Daniel Wang danwang@CS.Princeton.EDU
Subject: [gclist] Canonical citation for "memory pools"

Does anyone have a canonical citation for "memory pools/arenas? i.e. The
memory management scheme where you allocate several objects in one big
chunck of space and deallocate all the objects in one go. I need it for a
related work section of a paper. Ideally, someone can claim credit for being
the first to publish this idea. Larger works that include this kind of
scheme as a part of a whole  are just as good.

TIA.


From virtualcyber@erols.com Thu, 1 Mar 2001 23:45:05 -0500
Date: Thu, 1 Mar 2001 23:45:05 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Garbage collection and XML

Hi, you wrote

> The program puts all of these things in a hash table, so all strings are
> unique.  When I wrote this I was concerned with the effectiveness of the
> hash table, and whether it would be a good idea to do my own mallocking
> out of blocks instead of allocating each string separately. (Yes it was.)

    This is a strong argument in favor of hashtables indeed -- but how do
you
determine the dimension of the hashtable?  My guess here would be that
you chose the size based on the XML file size.  I'd think that a static
hashtable with a predetermined size would not work, because file sizes
can really vary,

>     1,496,996 bytes for tree structure
>        75,944 bytes for strings
> -------------------
>     1,572,940 bytes total storage.
> BUT
>     2,634,021 bytes for the XML sources.

    This is what I call efficient use of memory!  Actually, this makes
sense,
because XML contains many built-in inefficiencies.
(such as the matching tags). The preceding figures you provide gives
pretty good idea of what to shoot for in designing a good parser..

    One last detail -- from looking at your previous email messages,
I'd guess that you used reference counting, right?  Also, from what
Ken Anderson said, I'd guess that one could obtain similar
results (provided the code is of the similar
quality) from a parser that uses garbage collector.


Take Care,
Ji-Yong D. Chung


From emery@cs.utexas.edu Thu, 1 Mar 2001 23:43:43 -0600
Date: Thu, 1 Mar 2001 23:43:43 -0600
From: Emery Berger emery@cs.utexas.edu
Subject: [gclist] Canonical citation for "memory pools"

> Does anyone have a canonical citation for "memory pools/arenas? i.e. The
> memory management scheme where you allocate several objects in one big
> chunck of space and deallocate all the objects in one go. I need it for a
> related work section of a paper. Ideally, someone can claim
> credit for being
> the first to publish this idea. Larger works that include this kind of
> scheme as a part of a whole  are just as good.

Aiken & Gray's PLDI 1998 paper "Memory Management with Explicit Regions"
cites a number of authors for (regions|zones|groups|arenas).

http://www.acm.org/pubs/articles/proceedings/pldi/277650/p313-gay/p313-gay.p
df


The earliest citation is for "zones" [D. T. Ross. The AED free storage
package. Communications of the ACM, 10(8):481-492, August 1967]. See also
Paul Wilson's DSA survey -- page 48 describes the Ross paper and mentions
(with a citation) that others had earlier used similar schemes.

ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps

Regards,
-- Emery

--
Emery Berger
emery@cs.utexas.edu
http://www.cs.utexas.edu/users/emery


From jerrold.leichter@smarts.com Fri, 2 Mar 2001 10:36:35 -0500 (EST)
Date: Fri, 2 Mar 2001 10:36:35 -0500 (EST)
From: Jerrold Leichter jerrold.leichter@smarts.com
Subject: [gclist] Garbage collection and XML

| > The program puts all of these things in a hash table, so all strings are
| > unique.  When I wrote this I was concerned with the effectiveness of the
| > hash table, and whether it would be a good idea to do my own mallocking
| > out of blocks instead of allocating each string separately. (Yes it was.)
| 
|     This is a strong argument in favor of hashtables indeed -- but how do
| you determine the dimension of the hashtable?  My guess here would be that
| you chose the size based on the XML file size.  I'd think that a static
| hashtable with a predetermined size would not work, because file sizes
| can really vary....

Dynamically adjustable hash tables have been around for 25 years.  They never
seem to have made it into the standard textbooks, and most programmers are
unaware of them.  To this day, textbooks continue to list fixed size as a
minus for hash tables.  (The contents of textbooks gets frozen in place:  The
textbooks define the course requirements, which in turn define the constraints
on newer textbooks for the same courses.  Compare a data structures textbook
published today to one published in 1980 and what you'll find is that the new
one uses object-oriented concepts and C++ or Java, while the older one uses
Pascal or C - but the actual algorithms presented are pretty much the same,
and generally all those algorithms were familiar by the mid-70's.  Tying this
back to GC:  You can see the same phenomenon in compiler texts, or any other
texts that might be expected to discuss garbage collection techniques:  They
say little - and much of what they *do* say is long obsolete - because they've
*always* said little.)

I've implemented and used with very good results the "linear hashing" algo-
rithm, described by Witold Litwin in Proc. 6th International Conf. on Very
Large Databases, 1980.  A more easily accessible reference is Per-\AA ke Larson
in CACM 31 (1988), pg 446--457.  Linear hashing gives you expected constant
time access using a table whose size is (expected) bounded as a constant
multiple of the number of elements in the table.  It grows and shrinks
dynamically in small increments - the (expected) cost per growth/shrinkage
again a constant.  All these constants are linearly related to a user-settable
parameter which is basically the target load factor for the table.

I can't distribute my code because it's part of a product.  However, to give
you an idea of what's involved, the basic implemention is about 500 lines of
extensively commented C++ code.  This is for a templated class with your
usual insert, lookup, and delete operations, plus an embedded class that
implements a "cursor" abstraction for building iterators, and even a function
to compute and print in a nice format a number of statistics about instances
of the class.  Since it doesn't come up in my application, I never implemented
the code to shrink a hash table.  (Then again, the function to grow the table
is all of about 35 commented lines; the code to shrink would be about the
same.)  The individual hash buckets use a class somewhat like an STL deque,
but simpler and much more space-efficient.  The externally-visible classes
that use this as a base are container classes analogous to, but rather
different from, the STL.
							-- Jerry


From johnl@iecc.com Fri, 2 Mar 2001 11:21:46 -0500 (EST)
Date: Fri, 2 Mar 2001 11:21:46 -0500 (EST)
From: John R Levine johnl@iecc.com
Subject: [gclist] Canonical citation for "memory pools"

> Does anyone have a canonical citation for "memory pools/arenas? i.e. The
> memory management scheme where you allocate several objects in one big
> chunk of space and deallocate all the objects in one go.

PL/I has AREAs, storage pools in which you can allocate variables, and use
either POINTERs or OFFSETs from the base of the area to refernce them. The
IBM documentation implies that the main reason to have them is so you can
write an area out to disk or tape, read it back later, and have the
offsets still valid, but it makes it quite clear that when you free the
area, all of its contents are freed as well.

I believe this was in the original version of PL/I which was written in
about 1963.  It's such an obvious and useful technique that I'd be
surprised if it wasn't used in the 1950s.

Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Information Superhighwayman wanna-be, http://iecc.com/johnl, Sewer Commissioner
Finger for PGP key, f'print = 3A 5B D0 3F D9 A0 6A A4  2D AC 1E 9E A6 36 A3 47 


From hans_boehm@hp.com Fri, 2 Mar 2001 09:23:15 -0800
Date: Fri, 2 Mar 2001 09:23:15 -0800
From: Boehm, Hans hans_boehm@hp.com
Subject: [gclist] Garbage collection and XML

Dynamically resizable hash tables exist in other places, too.

They're discussed in "The Design and Analaysis of Computer Algorithms", Aho,
Hopcroft, and Ullman, 1974.

They're used in SGI's STL implementation. (See
http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is
unfortunately uglified to be namespace-correct.)

They're also used to keep track of things like finalizable objects in our
collector. (See
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/boehm-gc/finalize.c?rev=1.5&conten
t-type=text/x-cvsweb-markup for the gory details.)

The above implementations use chained hash buckets, but I think that's
pretty much an orthogonal issue, unless you want clever implementations to
bring down the constant in the running time of the resize operation.

Hans

> -----Original Message-----
> From: Jerrold Leichter [mailto:jerrold.leichter@smarts.com]
> Sent: Friday, March 02, 2001 7:37 AM
> To: Ji-Yong D. Chung
> Cc: Richard A. O'Keefe; gclist@iecc.com
> Subject: Re: [gclist] Garbage collection and XML
> 
> 
> | > The program puts all of these things in a hash table, so 
> all strings are
> | > unique.  When I wrote this I was concerned with the 
> effectiveness of the
> | > hash table, and whether it would be a good idea to do my 
> own mallocking
> | > out of blocks instead of allocating each string 
> separately. (Yes it was.)
> | 
> |     This is a strong argument in favor of hashtables indeed 
> -- but how do
> | you determine the dimension of the hashtable?  My guess 
> here would be that
> | you chose the size based on the XML file size.  I'd think 
> that a static
> | hashtable with a predetermined size would not work, because 
> file sizes
> | can really vary....
> 
> Dynamically adjustable hash tables have been around for 25 
> years.  They never
> seem to have made it into the standard textbooks, and most 
> programmers are
> unaware of them.  To this day, textbooks continue to list 
> fixed size as a
> minus for hash tables.  (The contents of textbooks gets 
> frozen in place:  The
> textbooks define the course requirements, which in turn 
> define the constraints
> on newer textbooks for the same courses.  Compare a data 
> structures textbook
> published today to one published in 1980 and what you'll find 
> is that the new
> one uses object-oriented concepts and C++ or Java, while the 
> older one uses
> Pascal or C - but the actual algorithms presented are pretty 
> much the same,
> and generally all those algorithms were familiar by the 
> mid-70's.  Tying this
> back to GC:  You can see the same phenomenon in compiler 
> texts, or any other
> texts that might be expected to discuss garbage collection 
> techniques:  They
> say little - and much of what they *do* say is long obsolete 
> - because they've
> *always* said little.)
> 
> I've implemented and used with very good results the "linear 
> hashing" algo-
> rithm, described by Witold Litwin in Proc. 6th International 
> Conf. on Very
> Large Databases, 1980.  A more easily accessible reference is 
> Per-\AA ke Larson
> in CACM 31 (1988), pg 446--457.  Linear hashing gives you 
> expected constant
> time access using a table whose size is (expected) bounded as 
> a constant
> multiple of the number of elements in the table.  It grows and shrinks
> dynamically in small increments - the (expected) cost per 
> growth/shrinkage
> again a constant.  All these constants are linearly related 
> to a user-settable
> parameter which is basically the target load factor for the table.
> 
> I can't distribute my code because it's part of a product.  
> However, to give
> you an idea of what's involved, the basic implemention is 
> about 500 lines of
> extensively commented C++ code.  This is for a templated 
> class with your
> usual insert, lookup, and delete operations, plus an embedded 
> class that
> implements a "cursor" abstraction for building iterators, and 
> even a function
> to compute and print in a nice format a number of statistics 
> about instances
> of the class.  Since it doesn't come up in my application, I 
> never implemented
> the code to shrink a hash table.  (Then again, the function 
> to grow the table
> is all of about 35 commented lines; the code to shrink would 
> be about the
> same.)  The individual hash buckets use a class somewhat like 
> an STL deque,
> but simpler and much more space-efficient.  The 
> externally-visible classes
> that use this as a base are container classes analogous to, but rather
> different from, the STL.
> 							-- Jerry
> 


From eliot@parcplace.com Fri, 2 Mar 2001 11:22:28 -0800
Date: Fri, 2 Mar 2001 11:22:28 -0800
From: eliot@parcplace.com eliot@parcplace.com
Subject: [2] [gclist] Garbage collection and XML

-covering message-

+-----------------------------
| Date:	Fri, 02 Mar 2001 10:36:35 -0500 (EST)
| From:	Jerrold Leichter <jerrold.leichter@smarts.com>
| Subject:	Re: [gclist] Garbage collection and XML

| Dynamically adjustable hash tables have been around for 25 years.  They n=
ever
| seem to have made it into the standard textbooks, and most programmers are
| unaware of them.  To this day, textbooks continue to list fixed size as a
| minus for hash tables.  (The contents of textbooks gets frozen in place: =
 The
| textbooks define the course requirements, which in turn define the constr=
aints
| on newer textbooks for the same courses.  Compare a data structures textb=
ook
| published today to one published in 1980 and what you'll find is that the=
 new
| one uses object-oriented concepts and C++ or Java, while the older one us=
es
| Pascal or C - but the actual algorithms presented are pretty much the sam=
e,
| and generally all those algorithms were familiar by the mid-70's.  Tying =
this
| back to GC:  You can see the same phenomenon in compiler texts, or any ot=
her
| texts that might be expected to discuss garbage collection techniques:  T=
hey
| say little - and much of what they *do* say is long obsolete - because th=
ey've
| *always* said little.)

| I've implemented and used with very good results the "linear hashing" alg=
o-
| rithm, described by Witold Litwin in Proc. 6th International Conf. on Very
| Large Databases, 1980.  A more easily accessible reference is Per-\AA ke =
Larson
| in CACM 31 (1988), pg 446--457.  Linear hashing gives you expected consta=
nt
| time access using a table whose size is (expected) bounded as a constant
| multiple of the number of elements in the table.  It grows and shrinks
| dynamically in small increments - the (expected) cost per growth/shrinkage
| again a constant.  All these constants are linearly related to a user-set=
table
| parameter which is basically the target load factor for the table.

You can find an implementation of dynamic hash tables in Smalltalk, which h=
as had them from the mid 70's.  There are a number of free Smalltalk implem=
entations, and they come with all source code.  See e.g. www.squeak.org or =
www.cincom.com/smalltalk/.

---
Eliot Miranda                ,,,^..^,,,               mailto:eliot@parcplac=
e.com
ParcPlace division, Cincom   Smalltalk: scene not herd       Tel +1 408 216=
 4581
3350 Scott Boulevard, Building 36, Santa Clara, CA 95054 USA Fax +1 408 216=
 4500


From fw@deneb.enyo.de 02 Mar 2001 22:58:24 +0100
Date: 02 Mar 2001 22:58:24 +0100
From: Florian Weimer fw@deneb.enyo.de
Subject: [gclist] Garbage collection and XML

"Boehm, Hans" <hans_boehm@hp.com> writes:

> Dynamically resizable hash tables exist in other places, too.
> 
> They're discussed in "The Design and Analaysis of Computer Algorithms", Aho,
> Hopcroft, and Ullman, 1974.
> 
> They're used in SGI's STL implementation. (See
> http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is
> unfortunately uglified to be namespace-correct.)

That seems to be a rather brute force approach: when the hash table is
resized, the entries are newly distributed to a different number of
hash buckets.


From jerrold.leichter@smarts.com Fri, 2 Mar 2001 18:15:04 -0500 (EST)
Date: Fri, 2 Mar 2001 18:15:04 -0500 (EST)
From: Jerrold Leichter jerrold.leichter@smarts.com
Subject: [gclist] Garbage collection and XML

| Dynamically resizable hash tables exist in other places, too.
| 
| They're discussed in "The Design and Analaysis of Computer Algorithms", Aho,
| Hopcroft, and Ullman, 1974.
| 
| They're used in SGI's STL implementation. (See
| http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is
| unfortunately uglified to be namespace-correct.)
| 
| They're also used to keep track of things like finalizable objects in our
| collector. (See
| http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/boehm-gc/finalize.c?rev=1.5&conten
| t-type=text/x-cvsweb-markup for the gory details.)

The design in AHU - also used in your GC - is the basic "when the table gets
too big, make a table of double the size and copy everything".  (I've looked
at the STL implementation, but I can't find the appropriate code.  However, it
looks as if it uses another classical variant:  There's a table of primes, so
I'm guessing that the hash value is taken modulo some prime, and when the
total number of elements exceeds some threshold, we move to the next prime,
make that many buckets, and rehash.)

While this approach works in some appropriate senses, it has many problems.
Most immediately noticeable is the cost of doubling when the size gets large.
Growing something like a vector by copying everything is relatively cheap.
Here, you have to rehash everything.  That can be quite expensive.  Linked-
list buckets significantly increase the space overhead for tables of small
objects.  Splitting a very large hash table will require you to walk through
all the linked lists - which will have poor memory locality.

The neat thing about linear hashing - and similar algorithms - is that they
can grow in small increments.  (Linear hashing works by splitting one bucket
at a time.  When all buckets have been split, you double the number of buckets,
split one, and leave the rest empty for later.)  This gives much more uniform
behavior.
							-- Jerry


From virtualcyber@erols.com Fri, 2 Mar 2001 20:20:24 -0500
Date: Fri, 2 Mar 2001 20:20:24 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Synchronization of finalization tables

    Hi,
.
    I was thinking about two ways to improve the use of dynamic
hashtable for storing and accessing finalization methods.

    I just wanted to hear what others thought of them.

    If a garbage collector uses a dynamic hashtables for finalization,
it *might* suffer from lock contention  if it is run in multi-threaded mode,
with lots of objects that require finalization. For static tables,
one can lock just a handful of buckets, so that other buckets are free
to be accessed.  For dynamic tables this is not possible,
because, without the locks, the buckets may be
re-malloced during the reads.

    Here are my "improvements" for reducing lock
contention in finalization tables.

(1) If the collector uses object class *type*
as key to finalization methods, finalizing each object
need not remove its finalization method from the table.
This means finalization method lookup is basically a read
operations (no deletes unless the host program is terminating
and the table is being destroyed).

    In such cases, one can reduce the lock contention by
using shared semaphores that distinguish read-locks and
write-locks (I am thinking of databases here).
Reading the table with one thread will not prevent
other threads from reading the table.

    Bad idea?

    (2) if one uses the "type" of object  (not the object reference itself)
to hash the finalization method, one can register all finalization
methods at the beginning of the host application.at runtime.
Generally, one would not need to perform locked inserts and deletes, because
at runtime, one just needs to read-access the methods.

    Look ma! No locks!
.

Take Care
Ji-Yong D. Chung


From ok@atlas.otago.ac.nz Mon, 5 Mar 2001 12:05:12 +1300 (NZDT)
Date: Mon, 5 Mar 2001 12:05:12 +1300 (NZDT)
From: Richard A. O'Keefe ok@atlas.otago.ac.nz
Subject: [gclist] Garbage collection and XML

I wrote (concerning strings in XML):
    > The program puts all of these things in a hash table,
    > so all strings are unique.

"Ji-Yong D. Chung" <virtualcyber@erols.com> asked:
    This is a strong argument in favor of hashtables indeed -- but
    how do you determine the dimension of the hashtable?

Why should I do any such thing?  A hash table is as big as it is.
I wouldn't *dream* of "determining the dimension of" a hash table.
Dynamic hashing is by now a fairly old technique; the code I use
is based on Per-Ake Larson's April 1988 CACM paper.  (Actually, one
ejp@ausmelb.oz wrote the code, and I rewrote it.)

Perl and TCL both have dynamically resizing hash table code that you
can rip out and use in other programs.  I haven't tried the TCL code,
but the Perl code is about as fast as the Larson/ejp code.

    One last detail -- from looking at your previous email messages, I'd
    guess that you used reference counting, right?  Also, from what Ken
    Anderson said, I'd guess that one could obtain similar results
    (provided the code is of the similar quality) from a parser that
    uses garbage collector.

In fact my code was written for a group of applications that load a
document into memory, walk over it a couple of times in order to
extract and format information, and then exit.  My "garbage collector"
was thus the C exit() function.

When you have stuff in a hash table like this, you have to be careful to
ensure that the hash table structure itself does not deceive the garbage
collector into believing strings/nodes are still live.


From ok@atlas.otago.ac.nz Mon, 5 Mar 2001 18:11:17 +1300 (NZDT)
Date: Mon, 5 Mar 2001 18:11:17 +1300 (NZDT)
From: Richard A. O'Keefe ok@atlas.otago.ac.nz
Subject: [gclist] Garbage collection and XML

Ken Anderson <kanderson@bbn.com> provided these numbers
about the DOM:
	Here's a summary of loading the CIM_Schema23.xml file from
	http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM.
	
	This XML file describe a class hierarchy of 732 classes of the
	DMTF/CIM standard.  Some Statistics:
	
	File sizes:
	
	MBytes of What
	 3.76      XML
	 0.27      zipped
	40.2       DOM (IBM)
	
	Sharing strings can save a significant amount, and some XML parsers
	will do that.  Vector's are given a default size of 10, while 74% have
	size 1, and 12% have size 3.  Trimming Vectors to the right size saves
	5.6MB.
	
	I wrote a SAX parser that read the same file but produced an
	s-expression with structure sharing.  It required only 4.0MB, slightly
	more than the ASCII XML size.  
	
I have added more code to my program to report where the space is going.
Measurements were made on four moderately large files:
 - my collection of CS examination papers
 - SigmodRecord.xml; an example I picked up off the net somewhere
   and "cleaned", reducing it to 72% of its original size.
 - the collected plays of Shakespeare (XML version).  I must say that
   the markup here is not very idiomatic; ID/IDREF(S) attributes could
   have been used to excellent effect but weren't.
 - DocBook; the SGML source of the O'Reilly DocBook book (it's on the
   CD-ROM) that comes with the book.

The space required *with* sharing is measured.  It does NOT include
space for the hash tables themselves, but does include hash links.

The space required *without* sharing is partly measured and partly
calculated (the main calculation is subtracting off 4 bytes for a hash
link from each node).

the DOM estimate assumes that strings are arrays of 16-bit characters
(required) + 4 byte length field, padded to multiple of 4 bytes; other
nodes are all 10 4-byte words (the smallest I could get that would do
everything DOM2).

The file size is as reported by ls -l or wc (which agree).

EXAMS: a collection of CS examination papers (my DTD)

		    strings +     attrs +  elements =     total
Without sharing:  3,755,464 +   312,936 + 1,288,364 = 5,356,764
With sharing:        65,216 +     1,328 +   751,248 =   817,792
DOM estimate:     5,778,556 + 1,043,120 + 2,460,680 = 9,282,356
Source exams.xml:		 	 	    = 2,634,021
DOM/shared = 11.3

SIGMOD: the SigMod Record catalogue

		   strings +   attrs + elements =     total
Without sharing:   648,824 +  82,548 +  221,100 =   952,472
With sharing:      131,784 +  16,752 +  217,748 =   366,284
DOM estimate:    1,189,308 + 275,160 +  332,680 = 1,797,148
Source SigmodRecord.xml:                        =   360,329
DOM/shared = 4.9

PLAYS: collected plays of Shakespeare (as found on net)

		    strings + attrs +  elements =      total
Without sharing: 10,677,460 +     0 + 4,305,996 = 14,983,456
With sharing:     4,901,656 +     0 + 3,751,320 =  8,652,976
DOM estimate:    19,214,762 +     0 + 7,187,600 = 26,402,362
Source plays.xml:                               =  7,648,502
DOM/shared = 3.1

% DOCBOOK: source of DocBook book, parsed by nsgmls

		    strings +     attrs +  elements =      total
Without sharing: 30,266,564 +   497,196 + 1,458,096 = 32,221,856
With sharing:       641,168 +    40,272 + 1,108,608 =  1,790,048
DOM estimate:    60,857,232 + 1,657,320 + 2,665,160 = 65,179,712 
Source docbook.sgm:                                 =  2,896,326
DOM/shared = 36.4

One thing that makes the DOM look bad is that it specifies that
strings are arrays of 16-bit UCS-16-encoded Unicode characters.
It happens that all of these files use ASCII, or possibly just a
few Latin-1 or Unicode symbols, so that using UTF-8 would shrink
the string space by a factor of 2.  In fact it is possible to devise
a Unicode encoding which is guaranteed to take no more than 3 bytes
for any character in planes 0,1,2,14 and to take 1 byte for any
ISO Latin 1 character, an encoding which is if anything easier than
UTF-8, as long as you don't require special treatment for all NUL bytes
(which Java, Javascript, and my C code do not.)

If you poke around in the statistics a bit, you find that long strings
are unique anyway, but short strings have very many references indeed,
so that shared strings really pay off.

Note that with the worst ratio here, I could go through three successive
versions of a document without bothering to garbage collection, and still
take a wee bit less space than the DOM would require for one copy, so that
for several reasonable filtering/rearranging applications, *not* garbage
collecting is a perfectly workable strategy.

I think the first reference to hash consing was in a paper of Ershov's
in the early 60's; the method was used in a compiler.  This is not a
new idea (except to the people who invented the DOM, seemingly.)


From virtualcyber@erols.com Mon, 5 Mar 2001 03:55:27 -0500
Date: Mon, 5 Mar 2001 03:55:27 -0500
From: Ji-Yong D. Chung virtualcyber@erols.com
Subject: [gclist] Garbage collection and XML

    Hi,

> [ your results were ...]
>
> EXAMS: a collection of CS examination papers (my DTD)
>
>     strings +     attrs +  elements =     total
> Without sharing:  3,755,464 +   312,936 + 1,288,364 = 5,356,764
> With sharing:        65,216 +     1,328 +   751,248 =   817,792
> DOM estimate:     5,778,556 + 1,043,120 + 2,460,680 = 9,282,356
> Source exams.xml:     = 2,634,021
> DOM/shared = 11.3
>
> SIGMOD: the SigMod Record catalogue
>
>    strings +   attrs + elements =     total
> Without sharing:   648,824 +  82,548 +  221,100 =   952,472
> With sharing:      131,784 +  16,752 +  217,748 =   366,284
> DOM estimate:    1,189,308 + 275,160 +  332,680 = 1,797,148
> Source SigmodRecord.xml:                        =   360,329
> DOM/shared = 4.9
>
> PLAYS: collected plays of Shakespeare (as found on net)
>
>     strings + attrs +  elements =      total
> Without sharing: 10,677,460 +     0 + 4,305,996 = 14,983,456
> With sharing:     4,901,656 +     0 + 3,751,320 =  8,652,976
> DOM estimate:    19,214,762 +     0 + 7,187,600 = 26,402,362
> Source plays.xml:                               =  7,648,502
> DOM/shared = 3.1
>
> % DOCBOOK: source of DocBook book, parsed by nsgmls
>
>     strings +     attrs +  elements =      total
> Without sharing: 30,266,564 +   497,196 + 1,458,096 = 32,221,856
> With sharing:       641,168 +    40,272 + 1,108,608 =  1,790,048
> DOM estimate:    60,857,232 + 1,657,320 + 2,665,160 = 65,179,712
> Source docbook.sgm:                                 =  2,896,326
> DOM/shared = 36.4
>

    Really interesting results.  What you have shown here, though, does not
seem
to be that DOM API is terribly bad for memory savings.  Rather, it seems to
show that DOM implementations should use shared strings/attributes.

    My observations are as follows:

    (1) If you look at the first result, and assume that you were using
shared
strings for DOM -- it would save nearly 1 meg.  In fact, if Java DOM used
shared strings, it would be far more memory efficient than the
non-shared case..  This holds for the rest of the examples
as well.

    (2) I noticed that DOM's memory use for attributes is bad.  But it uses
always about 3 times as much as the unshared model.  Again, this
seems to show that it is sharing which has more effect than the fact
one is following the DOM API spec.

    These observations lead to the following conclusion:
While DOM API is not designed for writing memory efficient apps,
the biggest problems in Java DOM stem not from the
API spec, but the underlying implementation.
Your parser saves memory because is just well implemented.
The main theme here is:to use shared objecs as much as possible.

    You might ask me, how do you know if Java does not use
shared strings?  I think your numbers clearly indicate this -- DOM's
string memory consumption is always approximately 2 x the amount
for non-shared case.  That they are proportionate shows DOM
is probably using the similar algorithm as the unshared case.  Similar
observation holds for attributes.


From ok@atlas.otago.ac.nz Tue, 6 Mar 2001 10:48:19 +1300 (NZDT)
Date: Tue, 6 Mar 2001 10:48:19 +1300 (NZDT)
From: Richard A. O'Keefe ok@atlas.otago.ac.nz
Subject: [gclist] Garbage collection and XML

	Really interesting results.  What you have shown here, though,
	does not seem to be that DOM API is terribly bad for memory savings.
	Rather, it seems to show that DOM implementations should use shared
	strings/attributes.

If you follow the letter of the DOM specification (the CORBA IDL, not the
Java and Javascript bindings) that is not *allowed*.

I think you may have overlooked one point I made, which is that the
DOM flatly and unconditionally *requires* that strings be sequences of
16-bit characters.  That's a factor of two overhead in space, even if
a DOM implementation *were* to use shared strings.

I think you may also have overlooked that for some kinds of documents,
there is a substantial saving to be made from sharing attribute
triples (Name,Type,Value), which the DOM forbids.  Not "the DOM does not
require" or "the DOM does not discuss", but "the DOM *forbids*".  You
cannot have shared attribute nodes and still call your interface "DOM".

	(1) If you look at the first result, and assume that you were
	using shared strings for DOM -- it would save nearly 1 meg.  In
	fact, if Java DOM used shared strings, it would be far more
	memory efficient than the non-shared case.. This holds for the
	rest of the examples as well.

In fact the space overheads for Java are considerably worse than that.
A Java object typically has 2 words of overhead.
A Java String object has 4 words of local data.
One of those words is a pointer to an array of Unicode characters.
(The idea is if you chop a string into substrings, the space cost per
substring is constant.)
An array of n Unicode characters will be (3 + ceiling(n/2)) words long.
The total is thus 36+4*ceiling(n/2) bytes for a string, where I assumed
4+4*ceiling(n/2) bytes.

To implement unique strings in Java would add to the space overhead
per string.

Having read the ECMAscript standard and Netscape's Javascript Reference
manual, it is clear that the space overheads for Javascript strings must
be comparable.

Note that in Java, string operations are guaranteed to return new objects,
so it is possible for a Java program to *tell* whether strings are shared
or not.  So it really isn't clear whether/how much sharing is allowed.

	(2) I noticed that DOM's memory use for attributes is bad.  But it uses
	always about 3 times as much as the unshared model.  Again, this
	seems to show that it is sharing which has more effect than the fact
	one is following the DOM API spec.
	
Yes, but the DOM treats attributes (Name,Value) like any other kind of
node.  Every attribute node knows which Element node owns it; no kind
of sharing is allowed AT ALL.  To follow the DOM API spec *is* to refuse
to share attribute nodes.  I suggest reading the DOM specification.

	These observations lead to the following conclusion:
	While DOM API is not designed for writing memory efficient apps,
	the biggest problems in Java DOM stem not from the
	API spec, but the underlying implementation.

Three kinds of item:
1) strings
   Requiring that strings be sequences of 16-bit characters rather than
   UTF-8 or TR-8 encoded guarantees AT LEAST A FACTOR OF TWO compared
   with a more compact representation, even if strings are shared.

2) attributes
   AT LEAST A FACTOR OF TWO space cost would be incurred by all the
   cross-links you need to store, even if the DOM allowed sharing,
   which it doesn't.

3) elements
   AT LEAST A FACTOR OF TWO space cost would be incurred by all the
   cross-links you need to store, even if the DOM allowed charing,
   which it doesn't.

So AT LEAST A FACTOR OF TWO space cost is due simply to the requirements
of the DOM, however clever the implementor might be.  You *can't* match
the results I quoted and still have something even close to the DOM.

	    You might ask me, how do you know if Java does not use
	shared strings?

I don't think you noticed the word "estimate".  I was careful to say that
the "DOM" numbers, unlike the others, were NOT measured, but estimated.
(As it happens, I have a DOM in C which I wrote to be sure I understood it,
but once I had, the measured time overheads convinced me not to use it.)

In fact I did not account for all the overheads in Java.  The figures for
a Java implementation of the DOM would be considerably worse.
	

From Bill.Foote@eng.sun.com  Tue Mar  6 13:56:44 2001
From: Bill.Foote@eng.sun.com (Bill Foote)
Date: Tue, 06 Mar 2001 14:56:44 +0100
Subject: [gclist] Garbage collection and XML
References: <200103052148.KAA31951@atlas.otago.ac.nz>
Message-ID: <3AA4EC9C.16D1260E@eng.sun.com>


"Richard A. O'Keefe" wrote:
> 
>         Really interesting results.  What you have shown here, though,
>         does not seem to be that DOM API is terribly bad for memory savings.
>         Rather, it seems to show that DOM implementations should use shared
>         strings/attributes.
> 
> If you follow the letter of the DOM specification (the CORBA IDL, not the
> Java and Javascript bindings) that is not *allowed*.
> 
> I think you may have overlooked one point I made, which is that the
> DOM flatly and unconditionally *requires* that strings be sequences of
> 16-bit characters.  That's a factor of two overhead in space, even if
> a DOM implementation *were* to use shared strings.


How on Earth did they manage to word a normative requirement that does
that?

Surely, if I store strings UTF-8 encoded in an immutable string type, there's
no way for an application to tell I'm sharing strings behind the scenes.  I
have trouble imagining wording that could place a testable normative requirement
like this on an API.  I'm genuinely curious; what wording in the DOM spec says
this?

Confused,

Bill
-- 
Bill Foote                                         bill.foote @ sun.com
Java TV Standards Engineer          http://java.sun.com/products/javatv


From ok@atlas.otago.ac.nz  Tue Mar  6 22:46:03 2001
From: ok@atlas.otago.ac.nz (Richard A. O'Keefe)
Date: Wed, 7 Mar 2001 11:46:03 +1300 (NZDT)
Subject: [gclist] Garbage collection and XML
Message-ID: <200103062246.LAA04709@atlas.otago.ac.nz>

I wrote:
	>If you follow the letter of the DOM specification (the CORBA IDL, not the
	>Java and Javascript bindings) that is not *allowed*.
	
From: David Chase <chase@world.std.com> asked
	Pardon my potential ignorance here, but who would care if there
	were sharing, especially if:
	
	1 - the binding was done to a GC'd language, where last-owner-of-a
	    string is less of an issue.
	
The two bindings in the DOM specifications are to Java and Javascript,
where strings are immutable.  It's really difficult to figure out *what*
the DOM specifies, because
 - the primary specification is in CORBA IDL, in which every time you
   ask for a string the remote system sends you back a new copy
 - the object chosen to represent strings in the CORBA IDL for the DOM
   is a *mutable* array of 16-bit characters
 - the object chosen to represent strings in the Java and Javascript
   bindings is an *immutable* String of 16-bit characters.

I don't know about Javascript, but in Java it is perfectly possible to
have two String objects with the same (immutable!) state which must act
the same for all future time, but have distinct identities.  A Java
program which tried to keep track of which nodes strings came from by
using String identities as keys could be confused if strings were shared.

I came to hate the DOM when I tried to implement it in Smalltalk.  Since
Smalltalk strings are *mutable*, it was important to know whether I was
allowed to return the string object already inside a text node, or whether
I had to copy it.  I couldn't figure out *what* to do, and amongst other
things discovered the contradiction above, that according to the IDL you
get a new mutable array whenever you ask about a string in the model, but
according to the Java and Javascript bindings you get an immutable object.

An explicit statement about sharing in the DOM specification would help a
LOT, as would explicit advice about what to do in languages like Eiffel,
Lisp, and Smalltalk, where strings are mutable.

However, it *is* absolutely clear that no sharing of non-string objects is
allowed at all.  The figures I have show that you save a useful amount of
space by sharing attribute=value bindings.

	2 - the resulting implementation were much smaller/faster.
	
It is clear that there is a substantial space saving from sharing strings.
It's not just "structural" strings like element names and attribute names
either.  There are a lot of repetitions of "content" strings like attribute
values and #PCDATA nodes.

Since the DOM absolutely requires the use of UTF-16-encoded strings
(UTF-8 is *NOT ALLOWED*, still less anything more compact than that),
there is still at least a factor of two compared with what you can get
in C or Smalltalk.  Why does the DOM require UTF-16?  Because it's *really*
an attempt to pretty up something the browser vendors bodged together to be
accessed from Javascript, which has UTF-16-encoded strings these days.

	Is this possibly just some sort of pin-headed overspecification
	that may safely be ignored, or do people actually write programs
	(in particular, Java programs) that depend on the lack of sharing?
	
As noted above, I have to admit that the specification is actually
inconsistent on this point.  However, I also note that there is nothing
in the Javascript reference material I recently downloaded from Netscape
or the ECMA 262 standard for ECMAscript that would make sharing
particularly easy to implement in Javascript.  I thought there _was_ a
UniqueString class in Java, but when I looked for it, I couldn't find one.
Perhaps someone can correct me about that.  It is far easier to implement
the DOM *without* string sharing in Java and Javascript.

And as noted before, any other kind of sharing is *explicitly* forbidden,
and could not be provided without comprehensively wrecking the entire design.


From hans_boehm@hp.com  Tue Mar  6 23:18:51 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Tue, 6 Mar 2001 15:18:51 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <140D21516EC2D3119EE7009027876644049B5C74@hplex1.hpl.hp.com>

>  I thought there _was_ a
> UniqueString class in Java, but when I looked for it, I 
> couldn't find one.
Isn't java.lang.String.intern() what you want?

Hans


From ok@atlas.otago.ac.nz  Wed Mar  7 00:59:41 2001
From: ok@atlas.otago.ac.nz (Richard A. O'Keefe)
Date: Wed, 7 Mar 2001 13:59:41 +1300 (NZDT)
Subject: [gclist] Garbage collection and XML
Message-ID: <200103070059.NAA06646@atlas.otago.ac.nz>

I wrote:
    I thought there _was_ a UniqueString class in Java, but when I
    looked for it, I couldn't find one.

Hans Boehm replied:
    Isn't java.lang.String.intern() what you want?
	
It is indeed the thing that I had vaguely remembered.

What I *wanted* was a UniqueString class, with a less space-hungry
representation than Java's String class.  A Java String requires
1 or 2 words of per-object overhead
+
4 instance variables (one word each)
+
2 or 3 words of per-array overhead
+
storage for the string proper.

The overheads are pretty substantial.


From gcolvin@us.oracle.com  Wed Mar  7 01:04:03 2001
From: gcolvin@us.oracle.com (Greg Colvin)
Date: Tue, 6 Mar 2001 18:04:03 -0700
Subject: [gclist] Garbage collection and XML
References: <200103070059.NAA06646@atlas.otago.ac.nz>
Message-ID: <011e01c0a6a2$81f1e3c0$37781990@us.oracle.com>

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
> I wrote:
>     I thought there _was_ a UniqueString class in Java, but when I
>     looked for it, I couldn't find one.
> 
> Hans Boehm replied:
>     Isn't java.lang.String.intern() what you want?
> 
> It is indeed the thing that I had vaguely remembered.
> 
> What I *wanted* was a UniqueString class, with a less space-hungry
> representation than Java's String class.  A Java String requires
> 1 or 2 words of per-object overhead
> +
> 4 instance variables (one word each)
> +
> 2 or 3 words of per-array overhead
> +
> storage for the string proper.
> 
> The overheads are pretty substantial.

If you care about overheads then you shouldn't be using java ;->


From bos@serpentine.com  Wed Mar  7 01:07:57 2001
From: bos@serpentine.com (Bryan O'Sullivan)
Date: Tue, 6 Mar 2001 17:07:57 -0800 (PST)
Subject: [gclist] Garbage collection and XML
In-Reply-To: <200103070059.NAA06646@atlas.otago.ac.nz>
References: <200103070059.NAA06646@atlas.otago.ac.nz>
Message-ID: <15013.35309.439591.995984@pelerin.serpentine.com>

r> What I *wanted* was a UniqueString class, with a less space-hungry
r> representation than Java's String class.

Since java.lang.String can't be subclassed and Java's notion of type
equivalence is based on name, not structure, I fear that a
UniqueString would be something of an annoyance to use in practice.

If only Cardelli and company had glommed C-like syntax over Modula-3's
semantics, we might inhabit a slightly happier world.

	<b


From davidbak@microsoft.com  Tue Mar  6 23:56:17 2001
From: davidbak@microsoft.com (David Bakin)
Date: Tue, 6 Mar 2001 15:56:17 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <6605DE4621E5934593474F620A14D70001EDE9B1@red-msg-07.redmond.corp.microsoft.com>

Since the DOM presumably defines an interface (esp. as you're talking
about a CORBA IDL) I don't see what the requirement for what strings
(mutable/immutable/16-bit/8-bit/whatever) look like on the outside have
to do with how the implementation stores nodes.  What's to keep an
implementation from doing whatever sharing and compression it wishes and
just satisfying the semantics whenever a caller traverses the DOM and
executes getters or setters for string valued attributes?

-- Dave

-----Original Message-----
From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz]
Sent: Tuesday, March 06, 2001 2:46 PM
To: chase@world.std.com; gclist@iecc.com
Cc: icis-developers@bbn.com
Subject: Re: [gclist] Garbage collection and XML


I wrote:
	>If you follow the letter of the DOM specification (the CORBA
IDL, not the
	>Java and Javascript bindings) that is not *allowed*.
=09
From: David Chase <chase@world.std.com> asked
	Pardon my potential ignorance here, but who would care if there
	were sharing, especially if:
=09
	1 - the binding was done to a GC'd language, where
last-owner-of-a
	    string is less of an issue.
=09
The two bindings in the DOM specifications are to Java and Javascript,
where strings are immutable.  It's really difficult to figure out *what*
the DOM specifies, because
 - the primary specification is in CORBA IDL, in which every time you
   ask for a string the remote system sends you back a new copy
 - the object chosen to represent strings in the CORBA IDL for the DOM
   is a *mutable* array of 16-bit characters
 - the object chosen to represent strings in the Java and Javascript
   bindings is an *immutable* String of 16-bit characters.

I don't know about Javascript, but in Java it is perfectly possible to
have two String objects with the same (immutable!) state which must act
the same for all future time, but have distinct identities.  A Java
program which tried to keep track of which nodes strings came from by
using String identities as keys could be confused if strings were
shared.

I came to hate the DOM when I tried to implement it in Smalltalk.  Since
Smalltalk strings are *mutable*, it was important to know whether I was
allowed to return the string object already inside a text node, or
whether
I had to copy it.  I couldn't figure out *what* to do, and amongst other
things discovered the contradiction above, that according to the IDL you
get a new mutable array whenever you ask about a string in the model,
but
according to the Java and Javascript bindings you get an immutable
object.

An explicit statement about sharing in the DOM specification would help
a
LOT, as would explicit advice about what to do in languages like Eiffel,
Lisp, and Smalltalk, where strings are mutable.

However, it *is* absolutely clear that no sharing of non-string objects
is
allowed at all.  The figures I have show that you save a useful amount
of
space by sharing attribute=3Dvalue bindings.

	2 - the resulting implementation were much smaller/faster.
=09
It is clear that there is a substantial space saving from sharing
strings.
It's not just "structural" strings like element names and attribute
names
either.  There are a lot of repetitions of "content" strings like
attribute
values and #PCDATA nodes.

Since the DOM absolutely requires the use of UTF-16-encoded strings
(UTF-8 is *NOT ALLOWED*, still less anything more compact than that),
there is still at least a factor of two compared with what you can get
in C or Smalltalk.  Why does the DOM require UTF-16?  Because it's
*really*
an attempt to pretty up something the browser vendors bodged together to
be
accessed from Javascript, which has UTF-16-encoded strings these days.

	Is this possibly just some sort of pin-headed overspecification
	that may safely be ignored, or do people actually write programs
	(in particular, Java programs) that depend on the lack of
sharing?
=09
As noted above, I have to admit that the specification is actually
inconsistent on this point.  However, I also note that there is nothing
in the Javascript reference material I recently downloaded from Netscape
or the ECMA 262 standard for ECMAscript that would make sharing
particularly easy to implement in Javascript.  I thought there _was_ a
UniqueString class in Java, but when I looked for it, I couldn't find
one.
Perhaps someone can correct me about that.  It is far easier to
implement
the DOM *without* string sharing in Java and Javascript.

And as noted before, any other kind of sharing is *explicitly*
forbidden,
and could not be provided without comprehensively wrecking the entire
design.


From ok@atlas.otago.ac.nz  Wed Mar  7 02:32:05 2001
From: ok@atlas.otago.ac.nz (Richard A. O'Keefe)
Date: Wed, 7 Mar 2001 15:32:05 +1300 (NZDT)
Subject: [gclist] Garbage collection and XML
Message-ID: <200103070232.PAA04727@atlas.otago.ac.nz>

	From: "David Bakin" <davidbak@microsoft.com>

	Since the DOM presumably defines an interface (esp. as you're talking
	about a CORBA IDL) I don't see what the requirement for what strings
	(mutable/immutable/16-bit/8-bit/whatever) look like on the outside have
	to do with how the implementation stores nodes.

*Strings* and *Nodes* are indeed different things, with different sharing
payoffs and possibilities.

But it is precisely the job of an interface to state the properties of
the objects it is an interface *to*.

The CORBA IDL in the DOM specifications (which I keep on-line and check
when I make claims about the DOM) state that when you ask a document
model for a string, what you get back is a sequence of characters.
With respect to encoding, the DOM explicitly says
    "APPLICATIONS must encode DOMString using UTF-16."

Never mind the DOM:  for all alphabetic and some syllabic scripts you can
do a *lot* better space-wise than using UTF-16, which is what you get in
Java or Javascript.

Thing is, if you *don't* implement things pretty close to the "natural"
image of the DOM, you are going to do unbelievable amounts of copying.

I discussed sharing at four levels:
    - "structural" strings			HUGE payoff for sharing
    - "content" strings				SERIOUS payoff for sharing
    - **XML** attribute=value nodes		SERIOUS payoff for sharing
    - element nodes				MODERATE payoff for sharing

If anyone will take the trouble to read the DOM, they will discover that
sharing the last two of those are unambiguously ruled out, in the sense
that if you have
    <foo><bar>Ho</bar> <bar>Ho</bar> <bar>Ho</bar></foo>
then you *must* have three distinct <bar>Ho</bar> Element nodes with
three distinct identities and distinct relationships to other nodes.

If a DOM implementation implements the DocumentType at all well, it is
likely that it will share "structural" strings.  

A DOM implementation cannot share attribute=value or Element nodes
without some very fancy and rather expensive behind-the-scenes calculation,
which would more than offset the space savings.

The one open area is sharing "content" strings (values of attributes and
PCData nodes).  The DOM actually envisages that these will be too big for
your programming language so that you have to get the information out in
pieces (CharacterData::substringData()).  A DOM implementor who hasn't
done the measurements might think (mistakenly) that it didn't pay to share
content strings.  Does anyone actually *know* what actual DOM implementations
do?

The reported high space cost of DOM use (I have heard 8 to 10 times as
many bytes in core as XML on disc) is most simply explained on the hypothesis
that popular DOM implementations *don't* share content strings.

	What's to keep an implementation from doing whatever sharing and
	compression it wishes and just satisfying the semantics whenever
	a caller traverses the DOM and executes getters or setters for string
	valued attributes?
	
A four-letter word:  T I M E.

Amongst other things, the Document Object Model is an *Object* model and
is totally oriented to manipulating documents by mutating data structures.
When you select a collection of nodes, a NodeList is stated to be "live";
mutations to the original document show up in the list.  Until you've
actually read the DOM2, you wouldn't believe how complex things can get
when someone makes a serious attempt to define iteration over mutating
data structures.

This is the garbage collection list, not the DOM list, so I'd like to
close with a number of general storage management observations.

1.  Garbage collection or no garbage collection,
    garbage *avoidance* in the design is a good thing if you can do
    it without compromising other goals.

2.  Hash consing is astonishingly effective,
    but only with data you aren't mutating a lot, and which is not
    required to have its own unique identity.

3.  There is no substitute for measurements on real data.

4.  Seemingly unimportant design decisions can have huge effects on
    memory requirements.

5.  There is a tradeoff between memory and time, but it doesn't pay
    to assume you could afford both ends of the tradeoff.

6.  Sturgeon's Law applies to standards, even W3C recommendations.


From chase@world.std.com  Wed Mar  7 04:57:06 2001
From: chase@world.std.com (David Chase)
Date: Tue, 06 Mar 2001 23:57:06 -0500
Subject: [gclist] Garbage collection and XML
In-Reply-To: <15013.35309.439591.995984@pelerin.serpentine.com>
References: <200103070059.NAA06646@atlas.otago.ac.nz>
 <200103070059.NAA06646@atlas.otago.ac.nz>
Message-ID: <4.3.2.7.0.20010306215829.02038008@pop.std.com>

At 05:07 PM 3/6/2001 -0800, Bryan O'Sullivan wrote:

actually, Richard O'Keefe wrote:
>r> What I *wanted* was a UniqueString class, with a less space-hungry
>r> representation than Java's String class.

The best you're likely to get out of most Java implementations
for any type is 2 words of header, plus one or two for data,
depending on how they deal with possible alignment of doubles
and longs.

Java strings are also not necessarily quite as costly
as you make them out to be.  The basic object is
header + array pointer + offset + count (5 or 6 words, depending
on padding) but it is entirely possible to share the array
portion of equal strings.  You could, for instance, say

  new String(s.intern())

to ensure that you get a string that is mostly shared,
yet not equal to any other string.  That's perhaps only
5 words per string after the first one is created, versus
maybe 3 words per whatever object you might come up with
for your unique-string type.

However, you could also play the game of indexing your
entities, and indexing instances of entities.  That
is, map objects to integers.

That way, you can store any object in 2 words, one identifying
the value, the other identifying the instance ID, with full
sharing under the covers (where "under the covers" is in
a sort of a hash table, where the "value" associated with
each object stored in the table is the integer for the
object.  When the instance id wraps, you grab another
slot for the same object in the table.)

This is probably too horrible to contemplate for most
people, given that you've got untyped integers
instead of typed objects, and no garbage collection at
all under the covers.  A loon might even push it to the
bit level, and reserve 8 bits for the instance ID, and
24 bits for the value index.  (If Fortran is outlawed,
only outlaws will use Fortran.)

>Since java.lang.String can't be subclassed and Java's notion of type
>equivalence is based on name, not structure, I fear that a
>UniqueString would be something of an annoyance to use in practice.

final public class U {
   public static String u(String s) { return new String(s.intern()); }
}

.... U.u("a string") ...

Less typing than those clunky old Modula-3 keywords :-).

>If only Cardelli and company had glommed C-like syntax over Modula-3's
>semantics, we might inhabit a slightly happier world.

I'm not sure, but I think some of us in the "peanut gallery"
raised the issue at the time.  I may have my old email
still from back then; maybe someday I'll see what I
can find.

Java's got one other thing that Modula-3 didn't, which is an
answer to the multiple inheritance question.  The problem, for
the M-3 definers, was that most of the people who wanted MI were
unable or unwilling to explain what it was that they wanted in any
sort of a sound semantic framework (William Cook was an exception,
I think) and the attitude of the M-3 people toward inheritance
in general would probably have come up with something different.
Not clear if better or worse, but different.  Unsurprisingly,
Java's type system is at its flakiest where it deals with
multiple inheritance, but to an engineering approximation
nobody cares, and nobody I know has figured out how to turn
the theoretical glitch into a security hole.

David Chase


From bos@serpentine.com  Wed Mar  7 05:19:37 2001
From: bos@serpentine.com (Bryan O'Sullivan)
Date: Tue, 6 Mar 2001 21:19:37 -0800 (PST)
Subject: [gclist] Garbage collection and XML
In-Reply-To: <4.3.2.7.0.20010306215829.02038008@pop.std.com>
References: <200103070059.NAA06646@atlas.otago.ac.nz>
 <4.3.2.7.0.20010306215829.02038008@pop.std.com>
Message-ID: <15013.50409.380653.862521@pelerin.serpentine.com>

[Claims that this thread is relevant to garbage collection are
starting to feel a little weak.  Oh well.]

d> However, you could also play the game of indexing your entities,
d> and indexing instances of entities.  That is, map objects to
d> integers. [...]  This is probably too horrible to contemplate for
d> most people, given that you've got untyped integers instead of
d> typed objects, and no garbage collection at all under the covers.

I actually went off and did this for an indexing and searching app
fairly recently.  Provided your API doesn't reveal the integer-ness of
the underlying representation to its users, and can overcome the cost
of converting back and forth at method entry and exit points, it is
possible to surprise people with the kind of performance and memory
overhead you can sustain with this kind of Java application.

The usual fearsome memory requirements lose some teeth as integers
aren't as heavily-boxed as objects in Java.  Granted, you now have
huge tables of interned strings sitting around that won't shrink or go
away until you drop all references to the entire tables, but for this
kind of application, it's easy to sidle around the problem with
references to "necessary time/space tradeoffs".

What is less fun is writing and maintaining the code behind the
pristine-looking API.  Unless you want to rebox all the integers you
had carefully unboxed earlier so you can use the java.util.Map
interface, you're condemned to CS201 rebuild-silly-data-structures
hell.

Makes one yearn for parametric classes and interfaces, =E0 la GJ.

	<b


From crawley@dstc.edu.au  Wed Mar  7 05:50:06 2001
From: crawley@dstc.edu.au (Stephen Crawley)
Date: Wed, 07 Mar 2001 15:50:06 +1000
Subject: [gclist] Garbage collection and XML
In-Reply-To: Message from "Richard A. O'Keefe" <ok@atlas.otago.ac.nz>
 of "Wed, 07 Mar 2001 11:46:03 +1300." <200103062246.LAA04709@atlas.otago.ac.nz>
Message-ID: <200103070549.f275nW627336@piglet.dstc.edu.au>

Richard,

I believe that some of your assertions about CORBA IDL are incorrect.

You wrote:
> The two bindings in the DOM specifications are to Java and Javascript,
> where strings are immutable.  It's really difficult to figure out *what*
> the DOM specifies, because
>  - the primary specification is in CORBA IDL, in which every time you
>    ask for a string the remote system sends you back a new copy

If you treat IDL as an abstract interface specification language ... as
DOM does ... it doesn't say how data type values are passed across an
interface.  Such details only need to be considered when the IDL is
mapped to some target language(s) in some implementation context.

Even when IDL is used describe a client / server interaction; e.g. using
a conventional ORB and standard language mappings, data types (like
strings) are not always passed by copying.  If the client and object are
colocated, the ORB may pass values by reference ... even in the case of
C++ where C++ types that represent the IDL data types are mutable.  [One
of the rules that you must obey to write portable CORBA C++ code is that
a 'server' must not change values that have been passed as 'in' args.]
In the case of Java, it hardly matters which way 'string' values are
passed in the colocated case unless your code compares java.lang.String
values using '==' ... which is dodgy at the best of times!

>  - the object chosen to represent strings in the CORBA IDL for the DOM
>    is a *mutable* array of 16-bit characters

CORBA IDL doesn't define strings (or any other data type) as mutable or
immutable.  Such issues belong in the CORBA language mappings, and different
mappings make different decisions.  Furthermore, if you hand-map the IDL 
to native APIs, it is entirely up to you how you address the issue.

>  - the object chosen to represent strings in the Java and Javascript
>    bindings is an *immutable* String of 16-bit characters.

The standard CORBA IDL -> Java mapping ALSO maps CORBA 'string' to Java's
'java.lang.String'; i.e. immutable arrays of UTF-16 characters.

You might argue that DOM should use a non-standard mapping for 'string';
e.g. to some object wrapper for an immutable array of bytes.  This would
be more space efficient, but it would be a right royal pain to write
Java applications that used such a DOM API.

-- Steve


From virtualcyber@erols.com  Wed Mar  7 07:05:05 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Wed, 7 Mar 2001 02:05:05 -0500
Subject: [gclist] CORBA C++ bindings and garbage collection
References: <200103070549.f275nW627336@piglet.dstc.edu.au>
Message-ID: <002b01c0a6d4$f0d89fe0$0100007f@cradle>

Hi,

    A number of people in this forum seem to be
well acquainted with CORBA. 

    Having just written a small app using it,
I have two question about C++ CORBA and garbage
collection: 

    (1) Does IDL to C++ language mapping
rule out the use of garbage collector in
designing and implementing an ORB? 

    (2) If the mapping does rule it out,
then wasn't it a mistake for the original
mapping designers to not to consider
garbage collection?

      If the language mapping does not 
rule out the use of a garbage collector,
does anyone know of an C++ ORB implementation
which does use a garbage collector?

    When I was programming a bit with Java RMI and 
C++ CORBA, and I kept thinking how simple
memory cleanup task for Java RMI is, compared to
C++ CORBA ORB I was using.

    I will appreciate any comments, answers
and perspectives, unless this topic has
been beaten on before -- in which case
I will be happy and grateful just to be directed to 
an existing mail archive dungeon.


Take Care,
Ji-Yong D. Chung


From bos@serpentine.com  Wed Mar  7 08:28:09 2001
From: bos@serpentine.com (Bryan O'Sullivan)
Date: Wed, 7 Mar 2001 00:28:09 -0800 (PST)
Subject: [gclist] CORBA C++ bindings and garbage collection
In-Reply-To: <002b01c0a6d4$f0d89fe0$0100007f@cradle>
References: <200103070549.f275nW627336@piglet.dstc.edu.au>
 <002b01c0a6d4$f0d89fe0$0100007f@cradle>
Message-ID: <15013.61721.784591.670436@pelerin.serpentine.com>

j> Does IDL to C++ language mapping rule out the use of garbage
j> collector in designing and implementing an ORB?

It's not the language mapping that rules out GC, it's the programming
model and the wire protocol.  IIOP has no facility for tracking the
number of clients talking to a server.  In order for a client be able
to talk to an object over the wire, it has to be explicitly exported
on the server side.

j> If the mapping does rule it out, then wasn't it a mistake for the
j> original mapping designers to not to consider garbage collection?

I don't think so.  Distributed garbage collection is a nice idea in
the abstract, but it comes with far too many problems to be something
you really want in a commercial setting.  Also, at the time CORBA was
being agglutinated, there were no commercially-noticeable languages
that supported GC in existence.

As to my claims of problems with distributed GC:

1. See figure 1.

2. No matter what algorithm you choose, it's really, really hard to
   get right.  I have never seen a distributed GC (and I've worked
   with, and near, several) that didn't have serious bugs, no matter
   how bright the people were who worked on it.  Of course, the bugs
   only show up in situations where you can't even log the problems,
   much less reproduce them later.  See figure 1.

3. Beyond simple reference counting (and its problems with messaging
   overhead and cyclic structures), any of the DGC algorithms I used
   to look at back when I stayed abreast of the literature were
   really, really hard to even understand.  None of the professors or
   grad students I knew around 1993 or so claimed to understand the
   most widely-respected DGC algorithm of the day (written, as I
   recall, by John Hughes).  See figure 1.

4. Since #2 means that everyone (to a first approximation) uses
   reference counting with heartbeats or lease renewals and a few
   other frills, you end up with a lot of cross-chatter in a large
   network that is nominally idle.  If your network gets busy, GC
   traffic starts to add appreciable overhead.  Oh, and now you have
   to think about debugging GC bugs in a 16-node cluster of live stock
   market trading servers.  See figure 1.

Perhaps you're getting the picture.  DGC is still a fruitful source of
research papers.  This should scare you.

The last project I worked on that used DGC was BEA's WebLogic Server,
a very profitable web application server.  When I left BEA, we had
been talking seriously for several months about turning off DGC
altogether.  The Enterprise JavaBeans programming model didn't require
DGC, even though it was nominally implemented on top of RMI, and EJB
has almost entirely displaced RMI as the distributed programming model
of choice for large Java apps.  The popularity of EJB made it very
tempting to kill off all of our DGC infrastructure and its horrible
Heisenbugs.

Prior to WLS, I worked on Jini (remember that?), where we effectively
handwaved away the intractable problems of DGC in large, semi-coherent
systems by requiring that clients explicitly maintain leases to server
objects.

j> When I was programming a bit with Java RMI and C++ CORBA, and I
j> kept thinking how simple memory cleanup task for Java RMI is,
j> compared to C++ CORBA ORB I was using.

There's no doubt that DGC makes programming seem nicer.  Right up
until it breaks irreproducibly in deployment or doesn't scale beyond a
handful of participants, at which point you can take your app out back
and shoot it.

j> I will appreciate any comments, answers and perspectives, unless
j> this topic has been beaten on before -- in which case I will be
j> happy and grateful just to be directed to an existing mail archive
j> dungeon.

Actually, I haven't seen much written about the theory-vs-practice
dichotomy in the DGC world.  Am I merely scaring the kids, or have
others also found it to be a tremendous headache in non-trivial cases?

	<b


(P.S.  Wondering where figure 1 was?
http://www.parc.xerox.com/csl/members/dourish/goodies/see-figure-1.html)


From Bill.Foote@eng.sun.com  Wed Mar  7 09:31:50 2001
From: Bill.Foote@eng.sun.com (Bill Foote)
Date: Wed, 07 Mar 2001 10:31:50 +0100
Subject: [gclist] Garbage collection and XML
References: <200103062246.LAA04709@atlas.otago.ac.nz>
Message-ID: <3AA60006.9F153370@eng.sun.com>

"Richard A. O'Keefe" wrote:


> I don't know about Javascript, but in Java it is perfectly possible to
> have two String objects with the same (immutable!) state which must act
> the same for all future time, but have distinct identities.  A Java
> program which tried to keep track of which nodes strings came from by
> using String identities as keys could be confused if strings were shared.

Ah, good point.  It is theoretically possible to use string identities to
discriminate String objects.  You'd have to be mad to try, and it's not
easy, but you can:

    public class IdentityStringKey {
        private String value;

        ...

        public int hashCode() {
            return value.hashCode();
        }

        public boolean equals(Object other) {
            if  (other instanceof IdentityStringKey) {
                return ((IdentityStringKey) other).value == value;
            } else {
                return false;
            }
        }
    }

So I take back what I said.  With enough work, once actually could word a normative,
testable requirement of no String object sharing in a Java API.  It's hard to do
and doesn't makes sense, but it can be done.

Anyway, this is moot as it sounds like DOM didn't go that far.

Cheers,

Bill
-- 
Bill Foote                                         bill.foote @ sun.com
Java TV Standards Engineer          http://java.sun.com/products/javatv


From kanderson@bbn.com  Wed Mar  7 14:39:57 2001
From: kanderson@bbn.com (Ken Anderson)
Date: Wed, 07 Mar 2001 09:39:57 -0500
Subject: [gclist] Garbage collection and XML
In-Reply-To: <4.3.2.7.0.20010306215829.02038008@pop.std.com>
References: <15013.35309.439591.995984@pelerin.serpentine.com>
 <200103070059.NAA06646@atlas.otago.ac.nz>
 <200103070059.NAA06646@atlas.otago.ac.nz>
Message-ID: <4.1.20010307093704.00a67100@zima.bbn.com>

At 11:57 PM 3/6/2001 , David Chase wrote:
>At 05:07 PM 3/6/2001 -0800, Bryan O'Sullivan wrote:
>
>final public class U {
>   public static String u(String s) { return new String(s.intern()); }
>}
>
>.... U.u("a string") ...
>
>Less typing than those clunky old Modula-3 keywords :-).
>

Unfortunately, the String() constructor copies the underlying char[].
I think this will work the way you intended.

final public class U {
  public static String u(String s) { return s.intern(); }}


From hans_boehm@hp.com  Wed Mar  7 17:16:33 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Wed, 7 Mar 2001 09:16:33 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com>

> -----Original Message-----
> From: Bill Foote [mailto:Bill.Foote@eng.sun.com]
> "Richard A. O'Keefe" wrote:
> > I don't know about Javascript, but in Java it is perfectly 
> possible to
> > have two String objects with the same (immutable!) state 
> which must act
> > the same for all future time, but have distinct identities.  A Java
> > program which tried to keep track of which nodes strings 
> came from by
> > using String identities as keys could be confused if 
> strings were shared.
> 
> Ah, good point.  It is theoretically possible to use string 
> identities to
> discriminate String objects.  You'd have to be mad to try, 
> and it's not
> easy, but you can: ...

You could presumably also synchronize on the strings, effectively turning
them into locks.  In that case sharing might result in unexpected lock
contention or deadlock.  The fact that every object can be used for
synchronization means that in some sense nothing is immutable.

Hans


From hans_boehm@hp.com  Wed Mar  7 17:30:50 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Wed, 7 Mar 2001 09:30:50 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <140D21516EC2D3119EE7009027876644049B5C7B@hplex1.hpl.hp.com>

> -----Original Message-----
> From: David Chase [mailto:chase@world.std.com]
> The best you're likely to get out of most Java implementations
> for any type is 2 words of header, plus one or two for data,
> depending on how they deal with possible alignment of doubles
> and longs.
> 
> Java strings are also not necessarily quite as costly
> as you make them out to be.  The basic object is
> header + array pointer + offset + count (5 or 6 words, depending
> on padding) but it is entirely possible to share the array
> portion of equal strings. ...

A lot of this clearly varies greatly with the implementation.  I believe
that gcj (with a patch that hasn't yet made it into the official tree) will
in the best case represent a String as a single chunk of memory containing:

1 word object header (vtable pointer only, objects are not moved,
synchronization is handled with a separate table)
1 word pointer to array (in the best case points to the string object
itself)
1 "int" byte offset to start of string.
1 "int" length
Sequence of 16 bit characters

Thus strings up to 4 characters are 4 words on a 64 bit machine, and 6 on a
32 bit machine.  (Object sizes are even numbers of words for alignment
reasons.)

Disclaimer:  I didn't write the String implementation.  This is based on my
reading of the code.

Hans


From fjh@cs.mu.oz.au  Wed Mar  7 17:35:34 2001
From: fjh@cs.mu.oz.au (Fergus Henderson)
Date: Thu, 8 Mar 2001 04:35:34 +1100
Subject: [gclist] Garbage collection and XML
In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com>
References: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com>
Message-ID: <20010308043534.A13328@hg.cs.mu.oz.au>

On 07-Mar-2001, Boehm, Hans <hans_boehm@hp.com> wrote:
> You could presumably also synchronize on the strings, effectively turning
> them into locks.  In that case sharing might result in unexpected lock
> contention or deadlock.  The fact that every object can be used for
> synchronization means that in some sense nothing is immutable.

How would you synchronize on the strings?
The java.lang.String class is declared `final', so you can't
inherit from it, and AFAIK there are no synchronized methods
in java.lang.String.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.


From Bob.Kerns@brightware.com  Wed Mar  7 17:42:07 2001
From: Bob.Kerns@brightware.com (Bob Kerns)
Date: Wed, 7 Mar 2001 09:42:07 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <4B946AD84FD3D2119696009027463F9D019C0E2F@bwnvfs16.brightware.com>

String foo = dosomething();
synchronized (foo) {
  ....
}

As he said, every object can be used for synchronization. Synchronized
methods effectively wrap

synchronized (this) {
  ...
}

around the body of the method, but you can synchronize on any object at any
time.

-----Original Message-----
From: Fergus Henderson [mailto:fjh@cs.mu.oz.au]
Sent: Wednesday, March 07, 2001 9:36 AM
To: Boehm, Hans
Cc: 'Bill Foote'; Richard A. O'Keefe; chase@world.std.com;
gclist@iecc.com; icis-developers@bbn.com
Subject: Re: [gclist] Garbage collection and XML


On 07-Mar-2001, Boehm, Hans <hans_boehm@hp.com> wrote:
> You could presumably also synchronize on the strings, effectively turning
> them into locks.  In that case sharing might result in unexpected lock
> contention or deadlock.  The fact that every object can be used for
> synchronization means that in some sense nothing is immutable.

How would you synchronize on the strings?
The java.lang.String class is declared `final', so you can't
inherit from it, and AFAIK there are no synchronized methods
in java.lang.String.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.


From chase@world.std.com  Wed Mar  7 17:48:44 2001
From: chase@world.std.com (David Chase)
Date: Wed, 07 Mar 2001 12:48:44 -0500
Subject: [gclist] Garbage collection and XML
In-Reply-To: <4.1.20010307093704.00a67100@zima.bbn.com>
References: <4.3.2.7.0.20010306215829.02038008@pop.std.com>
 <15013.35309.439591.995984@pelerin.serpentine.com>
 <200103070059.NAA06646@atlas.otago.ac.nz>
 <200103070059.NAA06646@atlas.otago.ac.nz>
Message-ID: <4.3.2.7.0.20010307121108.01f4d7a0@pop.std.com>

This is not about garbage collection at all, unless we want
to grumble about the design decisions underlying some of these
things.

At 09:39 AM 3/7/2001 -0500, Ken Anderson wrote:
>At 11:57 PM 3/6/2001 , David Chase wrote:
>>final public class U {
>>   public static String u(String s) { return new String(s.intern()); }
>
>Unfortunately, the String() constructor copies the underlying char[].
>I think this will work the way you intended.
>
>final public class U {
>  public static String u(String s) { return s.intern(); }}

Nope, we're both wrong.  Richard O'Keefe was looking for a
way to get space-saving, but not-eq, equal strings.  You're
right about the String constructor, but it turns out that there
IS a way:

  s.intern().substring(0)

That will first intern s, to ensure sharing, then create
a new String object that shares storage with the interned
String, but is not == to it.

Regarding Hans's observations about gcj -- if they want to
roll their own String class, that's fine, but if they intend
to interoperate with native code (JNI code), they'll need
to use the same String data structures, field names, and types
as Sun uses for their classes.  I learned this the hard
way. 

Hans is regrettably correct about the effects of "every-object-
is-a-lock".  Though this has led to some really impressive
innovation in lock implementation technology, it mucks up sharing,
and anyone who actually wants to make a system that is robust
in the face of denial-of-service attacks (there are some cute
ones involving locking) has to create their own private Objects
for locking anyhow.  If you DO want to take advantage of sharing,
you can't lock on those objects either, since you can never
keep track of who's got what lock when/where, so again there's
no use for EOiaL.  It's kind of a useless feature, but there
it is.

David Chase


From hans_boehm@hp.com  Wed Mar  7 18:17:31 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Wed, 7 Mar 2001 10:17:31 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <140D21516EC2D3119EE7009027876644049B5C7E@hplex1.hpl.hp.com>

> From: David Chase [mailto:chase@world.std.com]
> 
> This is not about garbage collection at all, unless we want
> to grumble about the design decisions underlying some of these
> things.
Ditto, though my impression is that there is a lot of interaction between
strings and GC performance.  Strings may be a big part of the heap, and
depending on how you answer some of these questions here, the GC may or may
not have to look at them very much.
> 
> Regarding Hans's observations about gcj -- if they want to
> roll their own String class, that's fine, but if they intend
> to interoperate with native code (JNI code), they'll need
> to use the same String data structures, field names, and types
> as Sun uses for their classes.  I learned this the hard
> way. 
> 
My impression is that this is only an issue for programs that rely on
features not documented in the Java or JNI spec?  If so, it seems to me the
other answer is "fix the client code", which seems to be an easier thing to
say with open source code than with commercial code, though it's never easy.

Hans


From ok@atlas.otago.ac.nz  Wed Mar  7 22:54:47 2001
From: ok@atlas.otago.ac.nz (Richard A. O'Keefe)
Date: Thu, 8 Mar 2001 11:54:47 +1300 (NZDT)
Subject: [gclist] Garbage collection and XML
Message-ID: <200103072254.LAA08975@atlas.otago.ac.nz>

Bill Foote <Bill.Foote@eng.sun.com> wrote:
	So I take back what I said.  With enough work, once actually could word a normative,
	testable requirement of no String object sharing in a Java API.  It's hard to do
	and doesn't makes sense, but it can be done.
	
	Anyway, this is moot as it sounds like DOM didn't go that far.
	
The DOM goes rather further than you might suppose.
Recall that I divided strings into two kinds:

 - structural strings (element names and attribute names)
 - content strings (#PCDATA and attribute values).

If you know a bit of SGML or HTML you might think an attribute value
is just a string, e.g.
	<p class="acknowledgement">
But SGML allows a string to contain macro references, e.g.
	<p class="&ack; for support">
No worries, the only macros allowed in HTML are the predefined character
names, and in SGML the macros are supposed to expand out and just be
strings (roughly speaking).

XML is different.  XML allows a document to be processed "half way",
where macros are not expanded.  So
	<p class="&ack; for support">
would, in the DOM, be something like
	Element (name = "p", attributes =
	    Attr (name = "class", first child =
	        EntityReference (name = "ack", next sibling =
	        Text (value = " for support", next sibling = NIL ))))

So in the DOM, there are two ways to get at the value of an attribute:
    Attr.value	- a string
    Attr.childNodes  - a sequence of Text and/or EntityReference nodes.
(By the way, if an attribute's .childNodes include an EntityReference node,
it is by no means clear what the .value should be.  I cannot find any clear
explanation of this.)   "On setting" the .value of an Attr "this creates a
Text node".

Now the smallest I can get a Text node down to is 6 words.

The Document Value Model that I've been talking about is intended only for
valid SGML/XML documents, wherein all macros must have been expanded, so
that there is no need for the Document Object Model's partially digested
attribute values.

So *every* content string in the DOM pays at least a 6-word overhead
that is not paid in the DVM.  While it is arguable that the strings themselves
might be shared, it is unarguable that these Text nodes must NOT be shared.

It would be possible to use lazy initialisation for the children of an Attr
node.

What we see here that has general application is
1.  The DOM is an API for HTML/XML *editors*, designed to support
    frequent tiny changes to an incompletely processed document.

    The DVM is a data structure for SGML/XML *processors*, designed
    to support efficient storage and traversal of completely parsed
    and validated documents and efficent creation of transformed
    documents.

2.  A data structure designed for one use need not be expected to be
    good for other uses.  The DOM is horribly clumsy and inefficient
    for processing documents; the DVM cannot represent incompletely
    parsed documents at all.

3.  The rather large storage costs of the DOM compared with the DVM
    (even *with* shared strings) can be traced to its requirements
    combined with the decision to use mutation as a primary programming
    tool.

The general point therefore is that supporting functionality you don't
need (in my case, in-place edits and half-parsed documents) can cost you
a lot; garbage avoidance starts with data structures that are no more
capable than they need to be.


From chase@world.std.com  Thu Mar  8 01:23:16 2001
From: chase@world.std.com (David Chase)
Date: Wed, 07 Mar 2001 20:23:16 -0500
Subject: [gclist] Garbage collection and XML
In-Reply-To: <200103072254.LAA08975@atlas.otago.ac.nz>
Message-ID: <4.3.2.7.0.20010307193523.0263f230@pop.std.com>

At 11:54 AM 3/8/2001 +1300, Richard A. O'Keefe wrote:
>3.  The rather large storage costs of the DOM compared with the DVM
>    (even *with* shared strings) can be traced to its requirements
>    combined with the decision to use mutation as a primary programming
>    tool.

How does it go if you play the simulated mutation game?

For instance, supposing you work with applicative data
structures (e.g., splay trees, or red-black trees) so that
you never modify anything, instead only reallocating
along the spine?  Yes, it does generate garbage (so there
is some fractional relevance to gc-list :-) but only
expected-O(log N) garbage per update operation.  One
advantage of applicative data structures is that if the
assignment of the root is atomic, then you only have to
lock for modification.

It is more frustrating than amusing to watch people learn
lessons already well-understood over a decade ago.  As soon
as you start working on a big project in a garbage-collected
language, any data that "escapes" your little sandbox really
has to be regarded as immutable, and (in the case of Java)
unlockable.  In a multi-threaded world, mutability also has
the annoying overhead of synchronization (for a competently
designed memory allocator, synchronization on a multiprocessor
can be 10-20 times as expensive as allocation (*)).  Sure,
mutable data structures are great for programming in the
small, but build something big, and they become a major
pain, and given the overheads they're not necessarily any
faster.

(*) non-recursive synchronization costs two bus locks,
or (my machine, with cpu clocked 6x memory bus) 120 cycles.
Heap memory allocation is load, add, compare, conditional
branch (predicted not taken), store, store, followed by
field initialization.

It's also possible, again if you are a loon, and not in
Java, to create applicative speculatively updated data
structures -- there, you reallocate a spine to "modify"
(say for a red-black tree) and attempt to compare-and-swap
in the new root.  If you fail, simply retry the entire
operation (you can, optionally, attempt to reuse some of
the previous operation if you saved the old spine and
compare with the new as you recompute -- as long as threads
are modifying in different places, this should cut the cost
of a retry).  If contention is low, the synchronization
costs are only half what you pay in the conventional
lock-while-modifying data structure (and if contention
isn't low, then your synchronization costs get grim anyway).

David Chase


From virtualcyber@erols.com  Thu Mar  8 04:10:30 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Wed, 7 Mar 2001 23:10:30 -0500
Subject: [gclist] Garbage collection and XML
References: <4.3.2.7.0.20010307193523.0263f230@pop.std.com>
Message-ID: <00a101c0a785$b7598480$0100007f@cradle>

    Hi,

> [David Chase wrote]

> It's also possible, again if you are a loon, and not in
> Java, to create applicative speculatively updated data
> structures -- there, you reallocate a spine to "modify"
> (say for a red-black tree) and attempt to compare-and-swap
> in the new root.

    This is a side issue, but is it possible to apply locks on subnodes,
and rather than the root?  If one locks a node, then,
sibling nodes should be accessible to multiple threads.

    Also, is it possible to apply  shared locks (read locks, intent-read
locks),
 that is locking concepts, available from database?
Or are these types of locks too expensive?

    Just being curious.


From virtualcyber@erols.com  Thu Mar  8 05:20:27 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Thu, 8 Mar 2001 00:20:27 -0500
Subject: [gclist] CORBA C++ bindings and garbage collection
References: <200103070549.f275nW627336@piglet.dstc.edu.au><002b01c0a6d4$f0d89fe0$0100007f@cradle> <15013.61721.784591.670436@pelerin.serpentine.com>
Message-ID: <00ad01c0a78f$7d376c40$0100007f@cradle>

    Hi

> > [I asked] Does IDL to C++ language mapping rule out the use of garbage
> > collector in designing and implementing an ORB?
>
> [you replied] It's not the language mapping that rules out GC, it's the
programming
> model and the wire protocol.  IIOP has no facility for tracking the
> number of clients talking to a server.  In order for a client be able
> to talk to an object over the wire, it has to be explicitly exported
> on the server side.

    Here, what you are saying is that the collector has no
easy way of knowing when there are no more live remote references
to servants, right?

> j> [I asked] If the mapping does rule it out, then wasn't it a mistake for
the
> j> original mapping designers to not to consider garbage collection?
>
> [you replied] I don't think so.  Distributed garbage collection is a nice
idea in
> the abstract, but it comes with far too many problems to be something
> you really want in a commercial setting.  Also, at the time CORBA was
> being agglutinated, there were no commercially-noticeable languages
> that supported GC in existence.

    I was not thinking of applying DGC -- rather
I was thinking of using GC locally only, and treating (1) references
and (2) servants in special ways.

    If an object is a servant, we can just save if from being gc'ed
and use a specialized threads to perform eviction (or whatever).
If an object is a reference to a remote object,  then we simply look
check to see if the host of the reference's target object  is in
the list of hosts that are reachable
(the list is pre-computed prior to GCing,  in another thread)
and remove or finalize on these references.

    This approach basically treats all local objects (other than
servants) as vanilla garbage collectible (including references to
remote objects).

    The hardest problem (how do you GC a servant) is obviously not
solved by the preceding method.  I was not thinking that GC would be
a way to fix that.  The problem of not knowing when clients have
dropped references seems to be inherent problem to most distributed
systems (I am not sure if it is realistic to have a protocol that would keep
track of
all client connection on per servant basis).

    I was just hoping though, that local GCs would be good enough
to simplify the semantics of memory allocation/deallocation
for C++ CORBA systems.  For example, take a simple CORBA string.  Even
managing this requires one to use string_dupe() and string_free().
If you deallocate a string that an ORB has given you, you can easily
get a core dump.  If you could locally GC, then, you just receive
a "reference" to that object -- no string dupe, no string free.

    With local GC, I was wondering, ll this local bookkeeping which come
with C++ CORBA might be eliminated.  In Mitch Henning and Vinosky's
book, the authors devote large chapters to explain client- and server-side
C++ mappings
for memory management.  All this seems just too complicated to me.

> DGC is still a fruitful source of
> research papers.  This should scare you.

    I am scared of doing DGC -- you bet.

> The last project I worked on that used DGC was BEA's WebLogic Server,
> a very profitable web application server.  When I left BEA, we had
> been talking seriously for several months about turning off DGC
> altogether.  The Enterprise JavaBeans programming model didn't require
> DGC, even though it was nominally implemented on top of RMI, and EJB
> has almost entirely displaced RMI as the distributed programming model
> of choice for large Java apps.  The popularity of EJB made it very
> tempting to kill off all of our DGC infrastructure and its horrible
> Heisenbugs.
>
> Prior to WLS, I worked on Jini (remember that?), where we effectively
> handwaved away the intractable problems of DGC in large, semi-coherent
> systems by requiring that clients explicitly maintain leases to server
> objects.

    What you say above make a lot of sense to me (unless I am totally
confused).

> There's no doubt that DGC makes programming seem nicer.  Right up
> until it breaks irreproducibly in deployment or doesn't scale beyond a
> handful of participants, at which point you can take your app out back
> and shoot it.

    Does Java RMI suffer from this problem?  While I thought that
Java's RMI looked good, I also heard a little voice in my head saying "this
is
too good to be real."  Given that many people have spent much energy over
years to tackle distributed computing problems, I wondered  whether
it was realistic to believe that Java RMI simply made these problems vanish.
(I suppose I could have looked at Java source code ...  but that is a huge
source and I was scared off by its size).


Ji-Yong D. Chung


From bos@serpentine.com  Thu Mar  8 05:56:24 2001
From: bos@serpentine.com (Bryan O'Sullivan)
Date: Wed, 7 Mar 2001 21:56:24 -0800 (PST)
Subject: [gclist] CORBA C++ bindings and garbage collection
In-Reply-To: <00ad01c0a78f$7d376c40$0100007f@cradle>
References: <200103070549.f275nW627336@piglet.dstc.edu.au>
 <002b01c0a6d4$f0d89fe0$0100007f@cradle>
 <15013.61721.784591.670436@pelerin.serpentine.com>
 <00ad01c0a78f$7d376c40$0100007f@cradle>
Message-ID: <15015.7944.42569.334661@pelerin.serpentine.com>

j> Here, what you are saying is that the collector has no easy way of
j> knowing when there are no more live remote references to servants,
j> right?

Yes.  If one client passes an IOR (CORBA-speak for a reference to a
server-side object) to another, but the second doesn't actually open a
connection to the server, then the server has no way of knowing that
the second has a reference to it.

j> I was just hoping though, that local GCs would be good enough to
j> simplify the semantics of memory allocation/deallocation for C++
j> CORBA systems.

For client-side code, you could simply try linking in something like
Boehm-Weiser and seeing if it worked.  I would be surprised if it
didn't work for purely client-side memory management.

b> There's no doubt that DGC makes programming seem nicer.  Right up
b> until it breaks irreproducibly in deployment or doesn't scale
b> beyond a handful of participants, at which point you can take your
b> app out back and shoot it.

j> Does Java RMI suffer from this problem?

Yes.  Jini (which I mentioned in my earlier article) uses leasing of
activatable objects to sidestep the problems of DGC in large RMI
systems.

j> Given that many people have spent much energy over years to tackle
j> distributed computing problems, I wondered whether it was realistic
j> to believe that Java RMI simply made these problems vanish.

Many aspects of RMI are fantastically useful, including DGC.  You just
have to assume that DGC will cause your scalability curve to assume
unexpected and catastrophic shapes after a point, and that said point
will always occur earlier in the curve than you'd like.  If your app
never reaches that point, then all is peachy.

	<b


From hans_boehm@hp.com  Thu Mar  8 22:31:41 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Thu, 8 Mar 2001 14:31:41 -0800
Subject: [gclist] Garbage collection and XML
Message-ID: <140D21516EC2D3119EE7009027876644049B5C8B@hplex1.hpl.hp.com>

> In a multi-threaded world, mutability also has
> the annoying overhead of synchronization (for a competently
> designed memory allocator, synchronization on a multiprocessor
> can be 10-20 times as expensive as allocation (*)).  Sure,
> mutable data structures are great for programming in the
> small, but build something big, and they become a major
> pain, and given the overheads they're not necessarily any
> faster.
I agree with the general conclusion, but you need to be careful about the
costs.  Reallocating objects along a path in a large applicative data
structure tends to involve allocating and dropping relatively long-lived
objects.  I suspect the garbage collection costs will outweigh the
allocation costs by a fair margin, no matter what garbage collector you use.
(This might not hold if you can keep the heap very sparsely occupied.  But
even then the allocation cost wil be dominated by the cache miss to first
write to the object.)

But sharing can still be a huge advantage of the applicative data structure.

> 
> (*) non-recursive synchronization costs two bus locks,
> or (my machine, with cpu clocked 6x memory bus) 120 cycles.
> Heap memory allocation is load, add, compare, conditional
> branch (predicted not taken), store, store, followed by
> field initialization.
Is that an X86 machine?  I just timed a Pentium III/500/100 machine at
something near 25 cycles per
"lock; cmpxchgl".  I'm interested because I've sometimes heard the claim
that X86 is particularly bad at this, but that hasn't really been consistent
with my experience.  Is this chipset dependent, perhaps?

Hans


From chase@world.std.com  Fri Mar  9 00:43:10 2001
From: chase@world.std.com (David Chase)
Date: Thu, 08 Mar 2001 19:43:10 -0500
Subject: [gclist] Garbage collection and XML
In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C8B@hplex1.hpl.hp.com
 >
Message-ID: <4.3.2.7.0.20010308184655.027e7ea0@pop.std.com>

At 02:31 PM 3/8/2001 -0800, Boehm, Hans wrote:
>> (*) non-recursive synchronization costs two bus locks,
>> or (my machine, with cpu clocked 6x memory bus) 120 cycles.
>> Heap memory allocation is load, add, compare, conditional
>> branch (predicted not taken), store, store, followed by
>> field initialization.
>Is that an X86 machine?

An x86 machine.  I was given to understand that the cost
was 10 memory bus cycles per lock-cmpxchgl, and I thought
that was what I measured on my old 200Mhz PPro.  Not sure
if it was clocking at 2x or 3x bus speed.  New machine is
(two) 800Mhz P-II, 133Mhz bus.

It's also somewhat chip(set) dependent; I've been told
that the Xeon chips lock only a subset of memory (a "line",
for some definition of the word) versus the entire bus.

>I just timed a Pentium III/500/100 machine at
>something near 25 cycles per "lock; cmpxchgl".
>I'm interested because I've sometimes heard the claim
>that X86 is particularly bad at this, but that
>hasn't really been consistent with my experience.

Maybe all the other chips are bad, too, but 25
cycles seems pretty horrible to me.  We made some
benchmarks to measure two versions of a tight
little loop (one always doing the cmpxchgl, the
other always branching around it) and it seemed
like a pair of locked instructions cost two or
three times as much as the entire rest of the loop,
which was not completely empty.

You're probably right about the cost of abandoning
that long-lived spine.  What wins depends on the
relative mix of probes versus updates, since the
applicative data structure let you avoid locking
on probes.  Don't forget, in some systems, if you
are modifying data structures in place, you are
probably doing some card-marking and creating
old-to-young pointers.  Those carry their own
extra costs.  (Avoiding the card-marking on newly
allocated memory is a cute trick, if you can
manage to do it.)

David Chase


From emery@cs.utexas.edu  Fri Mar  9 02:51:10 2001
From: emery@cs.utexas.edu (Emery Berger)
Date: Thu, 8 Mar 2001 20:51:10 -0600
Subject: [gclist] Garbage collection and XML
In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C8B@hplex1.hpl.hp.com>
Message-ID: <JIEFKPOIJLFODHLICCIKGEKPCHAA.emery@cs.utexas.edu>

> Is that an X86 machine?  I just timed a Pentium III/500/100 machine at
> something near 25 cycles per
> "lock; cmpxchgl".  I'm interested because I've sometimes heard the claim
> that X86 is particularly bad at this, but that hasn't really been
> consistent
> with my experience.  Is this chipset dependent, perhaps?

Timing just the "lock; cmpxchgl" doesn't give you the whole picture. The
problem is that the Pentium flushes the pipeline when it encounters a locked
instruction. The performance penalty is pretty spectacular. I'm told the P4
has a 24-stage pipeline, so locked instructions will become effectively even
more expensive.

Regards,
-- Emery

--
Emery Berger
emery@cs.utexas.edu
http://www.cs.utexas.edu/users/emery


From hans_boehm@hp.com  Fri Mar  9 18:07:08 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Fri, 9 Mar 2001 10:07:08 -0800
Subject: [gclist] synchronization cost (was: Garbage collection and XM
 L)
Message-ID: <140D21516EC2D3119EE7009027876644049B5C92@hplex1.hpl.hp.com>

Does anyone know if this is documented somewhere?

My experience with using "lock; cmpxchgl" to atomically set a mark bit in a
bit vector was that it didn't seem to have that much of an impact.  But the
only measurement I made was a comparison to using mark bytes instead, which
appeared to be slower on a Pentium III, presumably as a result of the larger
data structure and hence added cache misses.

Since these are out-of-order machines, the other question is whether
subsequent instructions that don't depend on later memory references will
continue to execute during the wait.  If so, this might explain some of the
diffferences in measurements.

Hans

> -----Original Message-----
> From: Emery Berger [mailto:emery@cs.utexas.edu]
> Sent: Thursday, March 08, 2001 6:51 PM
> To: Boehm, Hans; 'David Chase'; gclist@iecc.com
> Cc: icis-developers@bbn.com
> Subject: RE: [gclist] Garbage collection and XML
> 
> 
> > Is that an X86 machine?  I just timed a Pentium III/500/100 
> machine at
> > something near 25 cycles per
> > "lock; cmpxchgl".  I'm interested because I've sometimes 
> heard the claim
> > that X86 is particularly bad at this, but that hasn't really been
> > consistent
> > with my experience.  Is this chipset dependent, perhaps?
> 
> Timing just the "lock; cmpxchgl" doesn't give you the whole 
> picture. The
> problem is that the Pentium flushes the pipeline when it 
> encounters a locked
> instruction. The performance penalty is pretty spectacular. 
> I'm told the P4
> has a 24-stage pipeline, so locked instructions will become 
> effectively even
> more expensive.
> 
> Regards,
> -- Emery
> 
> --
> Emery Berger
> emery@cs.utexas.edu
> http://www.cs.utexas.edu/users/emery
> 


From emery@cs.utexas.edu  Fri Mar  9 20:09:41 2001
From: emery@cs.utexas.edu (Emery Berger)
Date: Fri, 9 Mar 2001 14:09:41 -0600
Subject: [gclist] synchronization cost (was: Garbage collection and XML)
In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C92@hplex1.hpl.hp.com>
Message-ID: <OCECIOCFJAAMEMNABABIGEDMEFAA.emery@cs.utexas.edu>

> -----Original Message-----
> From: Boehm, Hans [mailto:hans_boehm@hp.com]
> Sent: Friday, March 09, 2001 12:07 PM
> To: 'Emery Berger'; Boehm, Hans; 'David Chase'; gclist@iecc.com
> Cc: icis-developers@bbn.com
> Subject: RE: [gclist] synchronization cost (was: Garbage collection and
> XML)
>
>
> Does anyone know if this is documented somewhere?
>

http://developer.intel.com/design/pentium4/manuals/24547203.pdf

See Chapter 7.1. "For the P6 family processors, locked operations serialize
all outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 processor, with one
exception: load operations that reference weakly ordered memory types (such
as the WC memory type) may not be serialized. "

-- Emery


> Since these are out-of-order machines, the other question is whether
> subsequent instructions that don't depend on later memory references will
> continue to execute during the wait.  If so, this might explain
> some of the
> diffferences in measurements.
>
> Hans
>
> > -----Original Message-----
> > From: Emery Berger [mailto:emery@cs.utexas.edu]
> > Sent: Thursday, March 08, 2001 6:51 PM
> > To: Boehm, Hans; 'David Chase'; gclist@iecc.com
> > Cc: icis-developers@bbn.com
> > Subject: RE: [gclist] Garbage collection and XML
> >
> >
> > > Is that an X86 machine?  I just timed a Pentium III/500/100
> > machine at
> > > something near 25 cycles per
> > > "lock; cmpxchgl".  I'm interested because I've sometimes
> > heard the claim
> > > that X86 is particularly bad at this, but that hasn't really been
> > > consistent
> > > with my experience.  Is this chipset dependent, perhaps?
> >
> > Timing just the "lock; cmpxchgl" doesn't give you the whole
> > picture. The
> > problem is that the Pentium flushes the pipeline when it
> > encounters a locked
> > instruction. The performance penalty is pretty spectacular.
> > I'm told the P4
> > has a 24-stage pipeline, so locked instructions will become
> > effectively even
> > more expensive.
> >
> > Regards,
> > -- Emery
> >
> > --
> > Emery Berger
> > emery@cs.utexas.edu
> > http://www.cs.utexas.edu/users/emery
> >
>


From hans_boehm@hp.com  Fri Mar  9 22:31:23 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Fri, 9 Mar 2001 14:31:23 -0800
Subject: [gclist] synchronization cost (was: Garbage collection and XM
 L)
Message-ID: <140D21516EC2D3119EE7009027876644049B5C96@hplex1.hpl.hp.com>

> -----Original Message-----
> From: Emery Berger [mailto:emery@cs.utexas.edu]
> 
> http://developer.intel.com/design/pentium4/manuals/24547203.pdf
> 
> See Chapter 7.1. "For the P6 family processors, locked 
> operations serialize
> all outstanding load and store operations (that is, wait for them to
> complete). This rule is also true for the Pentium 4 
> processor, with one
> exception: load operations that reference weakly ordered 
> memory types (such
> as the WC memory type) may not be serialized. "
> 

Thanks for the pointer.  This is very interesting.

I read the above statement as dealing more with the memory model than the
implementation.  The processor is normally allowed to move reads to before
logically earlier writes, assuming this is locally consistent.  It may not
do this if the read is part of the atomic operation or a later read.  Thus I
assume it basically has to wait for any store buffers to drain to the cache
before beginning the read.  That seems like an unavoidable cost given the
way the operation is defined.  It doesn't imply to me that the rest of the
processor necessarily has to be idle.

The following statement is also enlightening:

"Because frequently used memory locations are often cached in
a processor's L1 or L2 caches, atomic operations can often be carried out
inside a processor's
caches without asserting the bus lock. Here the processor's cache coherency
protocols insure
that other processors that are caching the same memory locations are managed
properly while
atomic operations are performed on cached memory locations."

Later text is explicit that for P6 and later, the bus is NOT locked for
atomic operations if the processor already has exclusive access to the cache
line.  I believe this is similar to most other recent processors.

Hans


From virtualcyber@erols.com  Sat Mar 10 00:28:24 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Fri, 9 Mar 2001 19:28:24 -0500
Subject: [gclist] synchronization cost (was: Garbage collection and XML)
References: <JIEFKPOIJLFODHLICCIKGEKPCHAA.emery@cs.utexas.edu>
Message-ID: <024701c0a8f9$0593cfb0$0100007f@cradle>

    Hi,


> > Is that an X86 machine?  I just timed a Pentium III/500/100 machine at
> > something near 25 cycles per
> > "lock; cmpxchgl".  I'm interested because I've sometimes heard the claim
> > that X86 is particularly bad at this, but that hasn't really been
> > consistent
> > with my experience.  Is this chipset dependent, perhaps?

(1)    A few years ago, I had opportunity to do some measurements
on CMPXCHG and if I remember correctly, the preceding figure is
pretty close to what I got -- I was reading about 30-40 instructions per
cmpxchg
and more on cmpxchg8b on pentium II, 200 Mhz.  (Windows NT4.0).

(2)   If one is trying to use a faster locking mechanism
for the garbage collector on Windows NT (single process,
multithreaded), one might consider EnterCriticalSection.
For many cases, it is MUCH faster than
using mutexes and other synchronization mechanisms.
(likely to be based on CMPXCHG).

    However, see

http://www.cs.wustl.edu/~schmidt/win32-cv-1.html


(3) Does anyone know how EnterCriticalSeciton is implemented?

I tried writing semaphores, mutexses, shared semaphores,
based on CMPXCHG, CMPXCHG8, but my implementations
were always much slower than EnterCriticalSection.  I had
suspicion that it was not using CMPXCHG, and
that was the reason why it could be so
fast.  But I could never be sure.


From virtualcyber@erols.com  Sat Mar 10 00:50:37 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Fri, 9 Mar 2001 19:50:37 -0500
Subject: [gclist] synchronization cost (was: Garbage collection and XML)
References: <JIEFKPOIJLFODHLICCIKGEKPCHAA.emery@cs.utexas.edu> <024701c0a8f9$0593cfb0$0100007f@cradle>
Message-ID: <000a01c0a8fc$1fa3b020$0100007f@cradle>

    And I forgot to mention that EnterCriticalSection takes (I have read)
about 6 CPU cycles in optimal case.

    I have seen implementations of non-reentrant spin locks
that  take a 5 cycles per lock (implemented using
LOCK and XCHG and MOV.  LOCK takes 1 CPU cycle, XCHG takes
3 CPU cycles and MOV takes 1 cycle).


From hans_boehm@hp.com  Sat Mar 10 01:36:32 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Fri, 9 Mar 2001 17:36:32 -0800
Subject: [gclist] synchronization cost (was: Garbage collection and XM
 L)
Message-ID: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com>

I just tried this on a 500 MHz Pentium III.  I get about 23 cycles for

lock; cmpxchg

and about 19 or 20 cycles for xchg (which has an implicit lock prefix).

I got consistent results by timing a loop and by looking at an instruction
level profile.  Putting other stuff in the loop didn't seem to affect the
time taken by xchg much.  Here's the code in case someone else wants to try.
(This requires Linux/gcc)

(Compile with gcc -static -O -DPROF swap.c prof.c to get profile.)

swap.c:
--------------------------------------------
#include <stdio.h>

typedef int GC_bool;

         inline static GC_bool GC_test_and_set(volatile unsigned *addr)
         {
	   int oldval;
	  /* Note: the "xchg" instruction does not need a "lock" prefix */
	  __asm__ __volatile__("xchgl %0, %1"
		: "=r"(oldval), "=m"(*(addr))
		: "0"(1), "m"(*(addr)) : "memory");
	  return oldval;
         }

volatile unsigned lock;

int main()
{
    int i;
#   ifdef PROF
	init_profiling();
#   endif
    for (i = 0; i < 10000000; ++i) {
        int result;
	result = GC_test_and_set(&lock);
	lock = 0;
	if (result) printf("Failed\n");
    }
#   ifdef PROF
	dump_profile();
#   endif
    return 0;
}
----------------------------------------------------

prof.c:
----------------------------------------------------

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A very simple profiler.  Note that it should be possible to	*/
/* get function level information by concatenating this with nm	*/
/* output and running the result through the sort utility.	*/
/* This assumes that all interesting parts of the executable	*/
/* are statically linked.					*/

static size_t buf_size;
static u_short *profil_buf;

# ifdef __i386__
#   ifndef COMPRESSION
#     define COMPRESSION 1
#   endif
#   define TEXT_START 0x08000000
#   define PTR_DIGS 8
# endif
# ifdef __ia64__
#   ifndef COMPRESSION
#     define COMPRESSION 8
#   endif
#   define TEXT_START 0x4000000000000000
#   define PTR_DIGS 16 
# endif

extern int etext;

/*
 * Note that the ith entry in the profile buffer corresponds to
 * a PC value of TEXT_START + i * COMPRESSION * 2.
 * The extra factor of 2 is not apparent from the documentation,
 * but it is explicit in the glibc source.
 */

void init_profiling()
{
    buf_size = ((size_t)(&etext) - TEXT_START + 0x10)/COMPRESSION/2;
    profil_buf = calloc(buf_size, sizeof(u_short));
    if (profil_buf == 0) {
	fprintf(stderr, "Could not allocate profile buffer\n");
    }
    profil(profil_buf, buf_size * sizeof(u_short),
	   TEXT_START, 65536/COMPRESSION);
}

void dump_profile()
{
    size_t i;
    size_t sum = 0;
    for (i = 0; i < buf_size; ++i) {
	if (profil_buf[i] != 0) {
	    fprintf(stderr, "%0*lx\t%d !PROF!\n",
		    PTR_DIGS,
		    TEXT_START + i * COMPRESSION * 2,
		    profil_buf[i]);
	    sum += profil_buf[i];
	}
    }
    fprintf(stderr, "Total number of samples was %ld !PROF!\n", sum);
}

-------------------------------------------------


> -----Original Message-----
> From: Ji-Yong D. Chung [mailto:virtualcyber@erols.com]
> Sent: Friday, March 09, 2001 4:51 PM
> To: gclist@iecc.com
> Subject: Re: [gclist] synchronization cost (was: Garbage 
> collection and
> XML)
> 
> 
>     And I forgot to mention that EnterCriticalSection takes 
> (I have read)
> about 6 CPU cycles in optimal case.
> 
>     I have seen implementations of non-reentrant spin locks
> that  take a 5 cycles per lock (implemented using
> LOCK and XCHG and MOV.  LOCK takes 1 CPU cycle, XCHG takes
> 3 CPU cycles and MOV takes 1 cycle).
> 
> 


From chrisd@reservoir.com  Sat Mar 10 08:43:24 2001
From: chrisd@reservoir.com (Chris Dodd)
Date: Sat, 10 Mar 2001 00:43:24 -0800 (Pacific Standard Time)
Subject: [gclist] synchronization cost (was: Garbage collection and XML)
Message-ID: <Pine.WNT.3.96.1010310004158.360C-100000@zipper.reservoir.com>

> (1)    A few years ago, I had opportunity to do some measurements
> on CMPXCHG and if I remember correctly, the preceding figure is
> pretty close to what I got -- I was reading about 30-40 instructions per
> cmpxchg
> and more on cmpxchg8b on pentium II, 200 Mhz.  (Windows NT4.0).
Was this a simple cmpxchg or a lock+cmpxchg?  The former is much faster,
but of course isn't atomic on a multiprocessor machine.

> (2)   If one is trying to use a faster locking mechanism
> for the garbage collector on Windows NT (single process,
> multithreaded), one might consider EnterCriticalSection.
> For many cases, it is MUCH faster than
> using mutexes and other synchronization mechanisms.
> (likely to be based on CMPXCHG).
> 
>     However, see
> 
> http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
> 
> 
> (3) Does anyone know how EnterCriticalSeciton is implemented?
> 
> I tried writing semaphores, mutexses, shared semaphores,
> based on CMPXCHG, CMPXCHG8, but my implementations
> were always much slower than EnterCriticalSection.  I had
> suspicion that it was not using CMPXCHG, and
> that was the reason why it could be so
> fast.  But I could never be sure.

Well, one important "optimization" that WinNT does is to have TWO
DIFFERENT versions of EnterCriticalSection -- one for uniprocessor
machines and one for multiprocessors.  The UP version is considerably
faster.  I'm pretty sure the way they do that is to not use lock
prefixes in the UP version, since on a UP x86 machine, individual
instructions are (mostly) atomic.  On an MP machine, you need a
lock prefix to make them atomic.  I certainly had no difficultly
writing a lock+cmpxchg based mutex that was faster than the MP
version of EnterCriticalSection.  The exact same code without the
lock prefix was faster than the UP version of EnterCriticalSection.

Chris Dodd
chrisd@reservoir.com


From virtualcyber@erols.com  Sat Mar 10 21:55:16 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Sat, 10 Mar 2001 16:55:16 -0500
Subject: [gclist] synchronization cost (was: Garbage collection and XML)
References: <Pine.LNX.4.21.0103091757170.24837-100000@rock.reactivenetwork.com>
Message-ID: <000b01c0a9ac$e1e9d140$0100007f@cradle>

    Hi,

    I am embarassed to admit that, on my prev.
posting, I should have checked my facts before commenting
on spin locks and XCHG.

    After seeing Boehm's email, I ran tests on XCHG,
and I am getting slightly worse results, at about
22 instruction cycles for each XCHG.

    I have run the tests on Pentium II (200 MHZ),
WindowsNT4.0, SP4.  The test is compiled using VC++6.

    The test has to be run multiple times either
with a VC++6.0 profiler or with a timer,
and with minimum no of processes on the
host machine.

||======================================
||  Here is my code for running the test

#include <iostream.h>
#include <time.h>
#include <sys/timeb.h>

void main()
{
    struct _timeb    begin,    end;
    int locker = 0;
    int* lock_addr = &locker;

    _ftime(&begin);

    __asm
    {
        mov ebx, [lock_addr]
        mov ecx, 10000000

RETRY:

        mov edx, ebx
        mov eax, 1
        xchg eax, [edx]        // Here is the XCHG
        dec ecx
        jnz RETRY
    };

    _ftime(&end);

    cout << (end.time - begin.time) * 1000 + end.millitm - begin.millitm <<
endl;
};


From virtualcyber@erols.com  Sat Mar 10 23:12:10 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Sat, 10 Mar 2001 18:12:10 -0500
Subject: [gclist] collector optimization
References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com>
Message-ID: <000501c0a9b7$895b1150$0100007f@cradle>

    Hi,

    I just finished replacing my copying collector with 
Boehm's collector.  (I used the included C++ interface 
on VC++6.0, NT platform).  

    Eventually, I would like to try optimizing it for
speed.

    Does anyone know if there are application specific 
optimizations I can try with Boehm's collector? 
More specifically, I am wondering if there are parts of Boehm's 
code that are known to be hackable for application
specific optimization  -- I mean no disrespect to 
Boehm or to Boehm's collector, here  :)

    I do not mean just changing values of the tuning 
hooks that are provided, as I have done much of that.

    Thanks in advance, for any information related
to the collector optimization.


Take Care
Ji-Yong D. Chung


From chase@world.std.com  Sat Mar 10 23:35:56 2001
From: chase@world.std.com (David Chase)
Date: Sat, 10 Mar 2001 18:35:56 -0500
Subject: [gclist] collector optimization
In-Reply-To: <000501c0a9b7$895b1150$0100007f@cradle>
References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com>
Message-ID: <4.3.2.7.0.20010310182211.02510990@pop.std.com>

At 06:12 PM 3/10/2001 -0500, Ji-Yong D. Chung wrote:
>   Does anyone know if there are application specific 
>optimizations I can try with Boehm's collector? 
>More specifically, I am wondering if there are parts of Boehm's 
>code that are known to be hackable for application
>specific optimization  -- I mean no disrespect to 
>Boehm or to Boehm's collector, here  :)
>
>    I do not mean just changing values of the tuning 
>hooks that are provided, as I have done much of that.

Depends upon what you mean by application-specific.
Long ago, when I used the BW collector for a Modula-3
implementation, I took care (in compiler-generated
code) to use "gc_malloc_atomic" for pointer-free
data structures.  We got bit once when someone
loopholed pointers into an array of integers; the
collector recycled the memory, and it was quite
confusing.

Another possibility is to open-code the free list
selection code, for allocations of constant size.
Not a gigantic win, but every little bit helps.

Another "hack" you can apply, again through a
compiler, is to use the predict-free call (if it
still exists).  The reason for this is that if
the collector is reliably informed about how much
free space it might expect to reclaim, it can more
sensibly choose between collecting and simply
growing the heap.  If you do this, you must do it
pretty well, else you just waste memory, but if
you do it right, you avoid the thrashing that
some systems will give you when you grow big
data structures -- before they are willing to
expand the heap, they do an expensive and useless
collection (everything is still live), and repeat
that until the heap is large enough for the data
structure being built.

David Chase


From virtualcyber@erols.com  Mon Mar 12 01:50:15 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Sun, 11 Mar 2001 20:50:15 -0500
Subject: [gclist] collector optimization
References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> <000501c0a9b7$895b1150$0100007f@cradle> <20010311100242.A19346@goop.org>
Message-ID: <001a01c0aa96$c9adc9c0$0100007f@cradle>

    Hi,

> [you wrote[ I used BW as part of a JIT-compiling Java runtime.
> Apart from the things David mentioned I also:
>
> - hacked to the codegen to use the delay slots (SPARC target) to
>   zero known-dead pointers in registers.  If you know enough about the
>   instruction scheduling a compiler can probably often find a good spot
>   to stomp a dead reference.  I never quantified the improvement, but it
>   was basically free (the delay slots would have been nops otherwise),
>   and its hard to see how it couldn't help.

    I don't deal with code generator for SPARC, so my knowledge here
is limited -- but if I understand you right, you are basically replacing
dead code with instructions for zeroing pointers that will no longer be
referenced?  That seems to make sense.

    I will try to zero out all useless pointers/values from local variables.
in function calls.  (Does this gain you much, though, I still wonder)

> - used the typed-allocation interface.  The class-layout routines would
>   always clump pointer and non-pointer class members, so it was
>   reasonably easy to generate a descriptor with a number of clumps of
>   pointers (sub-class pointers could not be clumped with the super-class
>   pointers).

    This seems ... just a bit painful, as you have mentioned that
the performance gain may not be much :)  To use the typed allocator
interface,
I would need to invoke a bit-map factory, in a static function, for every
class I have created to be used with GC.  Thats significant amount of
work for gain that maybe marginal.


From David.Chase@naturalbridge.com  Mon Mar 12 02:06:44 2001
From: David.Chase@naturalbridge.com (David Chase)
Date: 11 Mar 2001 21:06:44 -0500
Subject: [gclist] collector optimization
In-Reply-To: <001a01c0aa96$c9adc9c0$0100007f@cradle>
References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com>
 <000501c0a9b7$895b1150$0100007f@cradle>
 <20010311100242.A19346@goop.org>
Message-ID: <4.3.2.7.0.20010311210337.0254e828@pop.std.com>

At 08:50 PM 3/11/2001 -0500, Ji-Yong D. Chung wrote:
>   I don't deal with code generator for SPARC, so my knowledge here
>is limited -- but if I understand you right, you are basically replacing
>dead code with instructions for zeroing pointers that will no longer be
>referenced?  That seems to make sense.
>
>    I will try to zero out all useless pointers/values from local variables.
>in function calls.  (Does this gain you much, though, I still wonder)

One thing to watch out for here -- if you are generating
code at the C level, and you insert assignments to zero
out dead pointers, and you feed it to a decent C compiler,
it will turn right around and remove those nulling assignments.

After all, you're assigning values to a DEAD VARIABLE, right?
Most times, you'd want the compiler to get rid of those
assignments :-).

David Chase


--
David.Chase@NaturalBridge.com


From hans_boehm@hp.com  Mon Mar 12 17:56:26 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Mon, 12 Mar 2001 09:56:26 -0800
Subject: [gclist] collector optimization
Message-ID: <140D21516EC2D3119EE7009027876644049B5C9C@hplex1.hpl.hp.com>

As David pointed out, it's very helpful to tell the collector which objects
are completely pointer-free.  Besides reducing the potential for false
pointers, this reduces the number of cache lines and pages touched during
GC, sometimes by a large fraction.

There are several ways to pass more detailed layout information to the
collector.  The C typed allocation interface is one.  Gcj uses another
that's more geared towards a world in which every object has a "vtable"
pointer anyway.  They help primarily in reducing the potential for false
pointers.  The impact on typical GC time is usually minimal, since this
often doesn't change the set of cache lines that need to be touched during
GC by much.  (If you end up using primarily one of these facilities, it may
be worth restructuring the mark loop to check for the most common case
first.  Currently it assumes that it will most often be asked to scan
sequential ranges of memory, as opposed to ranges described by bit maps.)

As David also points out, there are hooks for controlling the triggering of
garbage collections, if you notice that you are spending significant amounts
of time in badly timed collections, i.e. when nothing gets reclaimed.

In your environment, you could probably get a significant win (10% of GC
time for X86?) if you are willing to tie the GC object code to a specific
machine.  Enabling prefetching in the marker often results in a significant
reduction of GC time. (See my ISMM 2000 paper.)  Code to do this currently
exists for Linux/X86, but not NT.  It should be easy to add, assuming there
is a way to tell the compiler to generate a prefetch instruction.  The
problem is that you need either a Pentium II+ or a recent AMD processor, and
the Intel and AMD prefetch instructions are incompatible.  I've been
considering optionally including all versions, and switching them based on a
dynamic test for the processor type.  But that's not yet there.

I would expect that for something like Scheme implementation, versions 6.x
will outperform the 5.x versions of the collector, due to a more refined GC
triggering heuristic.

I've found that under Linux the collector is now occasionally faster in
incremental/generational mode.  That's application dependent.  I'm not sure
whether that's true under NT, since (based on obsolete anecdotal evidence
only) I believe the signal/exception handling overhead for the VM write
barrier is higher under NT.

Hans

> -----Original Message-----
> From: Ji-Yong D. Chung [mailto:virtualcyber@erols.com]
> Sent: Saturday, March 10, 2001 3:12 PM
> To: gclist@iecc.com
> Subject: [gclist] collector optimization
> 
> 
>     Hi,
> 
>     I just finished replacing my copying collector with 
> Boehm's collector.  (I used the included C++ interface 
> on VC++6.0, NT platform).  
> 
>     Eventually, I would like to try optimizing it for
> speed.
> 
>     Does anyone know if there are application specific 
> optimizations I can try with Boehm's collector? 
> More specifically, I am wondering if there are parts of Boehm's 
> code that are known to be hackable for application
> specific optimization  -- I mean no disrespect to 
> Boehm or to Boehm's collector, here  :)
> 
>     I do not mean just changing values of the tuning 
> hooks that are provided, as I have done much of that.
> 
>     Thanks in advance, for any information related
> to the collector optimization.
> 
> 
> Take Care
> Ji-Yong D. Chung
> 
> 
>     
> 
> 


From virtualcyber@erols.com  Tue Mar 13 01:33:25 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Mon, 12 Mar 2001 20:33:25 -0500
Subject: [gclist] collector optimization
References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com><000501c0a9b7$895b1150$0100007f@cradle><20010311100242.A19346@goop.org> <4.3.2.7.0.20010311210337.0254e828@pop.std.com>
Message-ID: <006301c0ab5d$9bcd86e0$0100007f@cradle>

    Hi

> One thing to watch out for here -- if you are generating
> code at the C level, and you insert assignments to zero
> out dead pointers, and you feed it to a decent C compiler,
> it will turn right around and remove those nulling assignments.      
    
    Thank you for pointing that out.  

    Doing code work only to have a compiler rub them out -- 
that not only would have been an example of highly inefficient coding, 
but also embarassing.  :)
    
    -- More embarassing than making a mistaken claim on 
the execution cost of assembly instruction XCHG, and then
emailing your claim to 10000000000000 people.


From virtualcyber@erols.com  Tue Mar 13 03:16:31 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Mon, 12 Mar 2001 22:16:31 -0500
Subject: [gclist] collector optimization
References: <140D21516EC2D3119EE7009027876644049B5C9C@hplex1.hpl.hp.com>
Message-ID: <007e01c0ab6c$00eb6b60$0100007f@cradle>

    Hi

    Thanks for your reply -- I will pay careful attention to your
suggestions.

> I've found that under Linux the collector is now occasionally faster in
> incremental/generational mode.  That's application dependent.

    Though, I wonder if it is a sign that Linux has some personal
issues to deal with.  Linux, I have read, does not
manage threads too well, at least as a real-time OS.  See

http://www.ittc.ukans.edu/~zvishal/courses/800/RT_ORB/


From jmaessen@mit.edu  Wed Mar 14 16:15:08 2001
From: jmaessen@mit.edu (Jan-Willem Maessen)
Date: Wed, 14 Mar 2001 11:15:08 -0500
Subject: [gclist] RE: synchronization cost (was: Garbage collection and XML)
Message-ID: <200103141615.LAA00703@lauzeta.mit.edu>

Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
pointing out that Pentium-class processors allow you to do all sorts
of locked operations.  From the IA-64 manual, Pentium ISA section [p
5-261, LOCK, if you care; the older Pentium manuals agree on this
information]: 

  The LOCK prefix can be prepended only to the following instructions
  and to those forms of the instructions that use a memory operand:
  ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
  SUB, XOR, XADD, and XCHG. ...

That's a pretty long list, and one of the above instructions is
usually what you actually want.  For example (getting back to GC
here), BTC/BTR/BTS allow you to atomically update a shared
allocation/mark bitmap efficiently.

As a datapoint, when I was first designing the synchronization for
Eager Haskell (lots of shared updates and a global GC'd heap on an
SMP), I took a look at the synchronization in the Linux kernel.  The
result was enlightening:

$ cat /proc/version
Linux version 2.2.12-20 (root@lauzeta.mit.edu) [...]
$ cd /usr/src/linux
$ find . -name '*.[ch]' -print | xargs fgrep 'cmpxchg'
./drivers/usb/uhci.c:		asm volatile("lock ; cmpxchg %4,%2 ; sete %0"
./drivers/usb/uhci.c:	asm volatile("lock ; cmpxchg %0,%1"

Only one file with a cmpxchg---in the [I believe] then-experimental
usb code.  There are tons of calls to xchg, and many atomic
increments, bit set/clear operations, and the like (you can find these
using a similar command, though it's a bit more work separating wheat
from chaff).

For me, the result of this observation was simple.  I wrote interfaces
which provide the atomic operations I actually require for my system:
exchange, compare and swap, bit vector operations, and so forth.  On a
Pentium, these each have their own inline assembly.  On another
architecture, such as SPARC, I roll them using compare and swap.

I'm in the odd situation of doing some synchronization directly in
compiler-generated code (I'm generating C); I suspect most similar
systems restrict synchronization to run-time routines.  In my case gcc
seems to do a noticeably better job of optimization if I avoid "asm
volatile" entirely in favor of "asm" and a correct set of instruction
effects.  This isn't surprising, really; what is surprising is how
infrequently it seems to be done in others' code.

I'll close with a question I have not yet managed to answer.  Our GC
uses an unshared nursery.  Right now, I test for nusery-ness on some
paths in order to determine whether to perform a write barrier.  The
same test allows me to perform synchronization (in this case a
Store/Store fence) only on shared objects.  The open question: is it
worth checking for locality before _every_ such synchronization?  If
so, it is not worth eliminating the test from my write barrier code,
and I should add a similar test to other code.  Otherwise, I should
use a test-free write barrier (blind card marking of some sort),
perform synchronization all over the place and rely on the fact that
the local stuff will happen in cache.  Has anyone done an experiment
like this on a multiprocessor?

-Jan-Willem Maessen
Eager Haskell project
jmaessen@mit.edu


From hans_boehm@hp.com  Wed Mar 14 17:21:02 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Wed, 14 Mar 2001 09:21:02 -0800
Subject: [gclist] RE: synchronization cost (was: Garbage collection an
 d XML)
Message-ID: <140D21516EC2D3119EE7009027876644049B5CB0@hplex1.hpl.hp.com>

> Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
> pointing out that Pentium-class processors allow you to do all sorts
> of locked operations.  From the IA-64 manual, Pentium ISA section [p
> 5-261, LOCK, if you care; the older Pentium manuals agree on this
> information]: 
> 
>   The LOCK prefix can be prepended only to the following instructions
>   and to those forms of the instructions that use a memory operand:
>   ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
>   SUB, XOR, XADD, and XCHG. ...
> 
> ...
>
> For me, the result of this observation was simple.  I wrote interfaces
> which provide the atomic operations I actually require for my system:
> exchange, compare and swap, bit vector operations, and so forth.  On a
> Pentium, these each have their own inline assembly.  On another
> architecture, such as SPARC, I roll them using compare and swap.

Extrapolating from the posted results, I would guess that, assuming all
cache
hits, implementing say fetch-and-add using CMPXCHG costs somewhere in the
25-27
cycle range, where LOCK; ADD is probably around 20.  I would guess that on
architectures using LL-SC, the difference is vaguely comparable.  And it's
tiny compared to the difference between either of these and wrapping the
operation in a lock.  It's probably still worthwhile in many cases, and I
will probably go back and do it in my collector code, but it currently seems
to be an engineering tradeoff between less machine-dependent code and saving
a few cycles.  Now if there were a standard library that implemented all the
variants efficiently so that everyone didn't have to reimplement them ...

> I'm in the odd situation of doing some synchronization directly in
> compiler-generated code (I'm generating C); I suspect most similar
> systems restrict synchronization to run-time routines.  In my case gcc
> seems to do a noticeably better job of optimization if I avoid "asm
> volatile" entirely in favor of "asm" and a correct set of instruction
> effects.  This isn't surprising, really; what is surprising is how
> infrequently it seems to be done in others' code.

Does it really make much difference if one of the effects is "memory"?
For updating a mark bit it wouldn't need to include that.  In many other
cases, you do want the compiler to treat it as a memory barrier.

I have recently tended to specify both "memory" and "volatile", largely
out of paranoia.  I think linuxthreads is similar.  Is there a cheaper
way to ensure that gcc preserves the memory barrier semantics?  (I agree
that it's not needed when you don't need the barrier.)

Hans


From plakal@cs.wisc.edu  Wed Mar 14 17:26:45 2001
From: plakal@cs.wisc.edu (Manoj Plakal)
Date: Wed, 14 Mar 2001 11:26:45 -0600
Subject: [gclist] RE: synchronization cost (was: Garbage collection and XML)
In-Reply-To: <200103141615.LAA00703@lauzeta.mit.edu>; from Jan-Willem Maessen on Wed, Mar 14, 2001 at 11:15:08AM -0500
References: <200103141615.LAA00703@lauzeta.mit.edu>
Message-ID: <20010314112645.B12837@cs.wisc.edu>

Jan-Willem Maessen wrote (Wed, Mar 14, 2001 at 11:15:08AM -0500) :
> Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
> pointing out that Pentium-class processors allow you to do all sorts
> of locked operations.  From the IA-64 manual, Pentium ISA section [p
> 5-261, LOCK, if you care; the older Pentium manuals agree on this
> information]: 
> 
>   The LOCK prefix can be prepended only to the following instructions
>   and to those forms of the instructions that use a memory operand:
>   ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
>   SUB, XOR, XADD, and XCHG. ...
> 
> That's a pretty long list, and one of the above instructions is
> usually what you actually want.  For example (getting back to GC
> here), BTC/BTR/BTS allow you to atomically update a shared
> allocation/mark bitmap efficiently.

	If you look at the part of the Intel manuals describing
	optimizations for the Pentium-II/III/IV, I think you'll
	find that they deprecate the use of prefixes like this.

	Manoj


From chase@world.std.com  Wed Mar 14 18:32:03 2001
From: chase@world.std.com (David Chase)
Date: Wed, 14 Mar 2001 13:32:03 -0500
Subject: [gclist] RE: synchronization cost (was: Garbage collection
 and XML)
In-Reply-To: <200103141615.LAA00703@lauzeta.mit.edu>
Message-ID: <4.3.2.7.0.20010314104030.02731da8@pop.std.com>

At 11:15 AM 3/14/2001 -0500, Jan-Willem Maessen wrote:
>Much discussion of CMPXCHG on Intels has gone by.  However, it's worth
>pointing out that Pentium-class processors allow you to do all sorts
>of locked operations.

>Only one file with a cmpxchg---in the [I believe] then-experimental
>usb code.  There are tons of calls to xchg, and many atomic
>increments, bit set/clear operations, and the like (you can find these
>using a similar command, though it's a bit more work separating wheat
>from chaff).

I can only speak for myself, but designing a VM for Java,
I worked with the instructions that I knew were available
on most architectures (Sparc V9, Pentium, PowerPC, M68040,
MIPS, notably NOT Sparc V8) and not all of those have a
general class of locked instructions.  It's possible
to simulate everything else with CAS or LL/SC, but in those
cases you probably would have been better off in the first
place designing algorithms that worked directly with CAS
or LL/SC.

I was also interested in building a system that was
starvation-free, and also that would scale efficiently.
Given limited intellectual bandwidth, and these constraints,
I decided to use wait-free internal data structures
designed around CAS or LL/SC.  I'd read the Herlihy papers
on this, and given the tricky world I was working on, this
seemed like the most practical choice.  It's possible I could
have come up with something faster that used (for instance)
exchange or atomic add, but what we've got now is quite
fast (we benchmark against the competition, obviously)
and quite scalable (synchronization never, ever, requires
allocation of additional anything, a huge simplification,
and a decoupling of two pieces of the VM that I am really
want to be decoupled).

The use of CAS also allows us to maintain always-globally-
consistent data structures, which is a very good thing --
for instance, our VM runs with deadlock-detection always
enabled for Java locks.  (In fact, this sped things
up a little on most benchmarks, which indicates that we
ought to do just a hair more busy-waiting before putting
a thread properly to sleep.)  In addition, using CAS
we were able to design synchronization that is slow only
in the following cases:

1) contended first acquisition
2) contended release

Anything else goes fast (single CAS) or faster (no CAS).

The internals of a parallel garbage collector are another
animal of course, as is the card-marking code.  That, we
generate directly from a compiler.  On Pentium, we simply
store bytes with no additional synchronization.  On a
machine with much weaker memory consistency, we might do
something else altogether.  My assumption is that the
conditionals and comparisons necessary to determine that
a card mark could be avoided are more expensive than simply
doing the mark blind.  I could be wrong, but modern
processors aren't very happy about conditional branches in
general.  This decision might be different on processors that
didn't provide consistency of byte stores across processors
(that is, if one cache line could overwrite another).
I don't care about apparent ordering, as long as a write
never disappears (because card-marking is monotonic
until a GC -- only the GC unmarks cards).

BUT, if you can avoid a memory bus lock, at least on
Pentium, you can afford the conditionals.  This I 
know from benchmarks, though I also know that the
branches are all extremely well-predicted.  
The conditional branch seems to add 5-10% to the cost
of the locked CAS, but subtracts 90-95% if the lock
can be avoided.  We detect when we are running on a
uniprocessor, and simply avoid using the locked
version of CAS in that case.  A JIT could compile
this in-line; we impose what is essentially a
small tax on the MP case.

>[ telegraphically described experiment elided]

>Has anyone done an experiment
>like this on a multiprocessor?

Not yet :-).  One experiment we could run, but have not yet,
is to determine the overhead of (optimized in various ways)
card marking.  We support two garbage collectors; we could
take a set of benchmarks, run them under full-copying with
card marks enabled, then recompile them w/o card marks and
rerun and measure the difference.

David Chase


From fjh@cs.mu.oz.au  Wed Mar 14 20:37:03 2001
From: fjh@cs.mu.oz.au (Fergus Henderson)
Date: Thu, 15 Mar 2001 07:37:03 +1100
Subject: [gclist] Boehm GC & Linux/SPARC
In-Reply-To: <140D21516EC2D3119EE700902787664401E3A7F7@hplex1.hpl.hp.com>
Message-ID: <20010315073703.A26598@hg.cs.mu.oz.au>

Q1: is there a boehm-gc-developers mailing list?
Should there be?

Q2: Is anyone using the Boehm et al collector with Linux/SPARC?
There seems to be some code in it to handle that combination, but it
doesn't work.  I tried it (using cf.sourceforge.net) and found that it
crashes very early.  The definition of DATASTART using LINUX_DATA_START
doesn't work, because __data_start is not defined (both __data_start
and data_start are zero).

I had a look at the linker script (output by `ld -v'),
and based on that, I tried using __etext for DATASTART.
However, that didn't work, because there are some unmapped pages
between the rodata (which follows __etext) and the other data.
There didn't seem to be any linker-defined symbol I could use
to find the end of rodata or the start of the remaining data.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.


From hans_boehm@hp.com  Wed Mar 14 23:12:40 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Wed, 14 Mar 2001 15:12:40 -0800
Subject: [gclist] Boehm GC & Linux/SPARC
Message-ID: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com>

> Q2: Is anyone using the Boehm et al collector with Linux/SPARC?
> There seems to be some code in it to handle that combination, but it
> doesn't work.  I tried it (using cf.sourceforge.net) and found that it
> crashes very early.  The definition of DATASTART using 
> LINUX_DATA_START
> doesn't work, because __data_start is not defined (both __data_start
> and data_start are zero).
> 
> I had a look at the linker script (output by `ld -v'),
> and based on that, I tried using __etext for DATASTART.
> However, that didn't work, because there are some unmapped pages
> between the rodata (which follows __etext) and the other data.
> There didn't seem to be any linker-defined symbol I could use
> to find the end of rodata or the start of the remaining data.
> 
This seems to be somewhat distribution dependent.  (Dependent on both glibc
and the linker script, to be more precise.)  I believe there is now a
consensus that __data_start should be defined, and it is defined in glibc on
most of the architectures, I believe.  If it still isn't defined on SPARC,
there's a good chance the glibc maintainer (drepper@redhat.com) would
appreciate a patch.

In the latest 6.0alpha versions, if you define SEARCH_FOR_DATA_START the
collector will look for a nonzero definition of first __data_start, then
data_start, and then search backward from _end if they both fail.  That
should work OK if someone also submits the glibc patch.

My version actually uses GC_SysVGetDataStart on Linux/SPARC.  Does that not
work?

Hans


From fjh@cs.mu.oz.au  Thu Mar 15 04:40:04 2001
From: fjh@cs.mu.oz.au (Fergus Henderson)
Date: Thu, 15 Mar 2001 15:40:04 +1100
Subject: [gclist] Boehm GC & Linux/SPARC
In-Reply-To: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com>
References: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com>
Message-ID: <20010315154004.A26458@hg.cs.mu.oz.au>

On 14-Mar-2001, Boehm, Hans <hans_boehm@hp.com> wrote:
> 
> My version actually uses GC_SysVGetDataStart on Linux/SPARC.  Does that not
> work?

Before sending my mail, I checked that gc6.0alpha6 fails too.  It too seg
faults in the same place (at `deferred = *limit', line 654 in mark.c).
But I didn't notice that it was using GC_SysVGetDataStart();
it looks like it is failing for a different reason.

I found a work-around: compile with `-O0'.

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.


From cef@geodesic.com  Thu Mar 15 21:35:20 2001
From: cef@geodesic.com (Charles Fiterman)
Date: Thu, 15 Mar 2001 15:35:20 -0600
Subject: [gclist] Java Phantom references.
Message-ID: <3.0.1.32.20010315153520.014b67b0@pop3.geodesic.com>

First is there a Java Manual other than the Sun Website new enough to
discuss phantom references?

What are they supposed to be used for?

How do collectors handle them? My guess is when the object is toast the
phantom reference is put on a queue.


From DICK@watson.ibm.com  Thu Mar 15 21:49:06 2001
From: DICK@watson.ibm.com (DICK@watson.ibm.com)
Date: Thu, 15 Mar 01 16:49:06 EST
Subject: [gclist] Java Phantom references.
Message-ID: <200103152155.QAA38986@sp1n189at0.watson.ibm.com>

***** Reply to your note of: Thu, 15 Mar 2001 15:35:20 -0600 *************
The Addison=Wesley book "The Java Class Libraries Second Edition,
Volume 1 Supplement for the Java 2 Platform Standard Edition, v1.2"
ISBN 0 201 48552 4 discusses phantom references - p 698.

I think they're supposed to be a generalization of finalizable
objects (with wrinkles - see soft/weak references also) - rather
than just a finalize() method of a finalizable object, _any_
object can be notified when some object of interest becomes
garbage.

C.R. Attanasio


From virtualcyber@erols.com  Fri Mar 16 08:07:08 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Fri, 16 Mar 2001 03:07:08 -0500
Subject: [gclist] My copying collector or Boehm's?
References: <20010315073703.A26598@hg.cs.mu.oz.au>
Message-ID: <000901c0adf0$19af2980$0100007f@cradle>

    Hi,

    This question is regarding a design decision,
whether to use Boehm's collector or my copying
collector.

    I have written a scheme interpreter in C++, 
soley to manipulate XML parsing.  I have not implemented 
the XML parser yet, but I am planning to write one, 
as a C++ extension to the scheme interpreter.
Its memory management will be done by the garbage collector for 
the interpreter itself.

    To get ready to write this extension, I have just
replaced my super simple collector 
(which uses Cheney's algorithm)
with Boehm's collector.  After a number of test runs 
comparing both collectors, I cannot decide whether to 
keep my original collector or use Boehm's.
    
    With Boehm's collector, my interpreter runs about 2.5
times slower than before.  This is not a knock on Boehm's 
collector.  The drop off in performance was expected, 
because (1) my original collector is turned on and off 
at precise points in my C++ code to minimize
collection (2) my collector uses type information all the time, 
(3) it uses no locks for allocation, because it has a
separate heap for each thread.and (4) heap residency 
was low for the test cases. -- which favors copying 
collector over mark-sweep.

    Of course, Boehm's collector offers things
other than raw performance.  First of all, it is written
for C/C++, so that if I writeXML extensions in C++,
Boehm's collector fits in nicely with my C++ implementations
of Scheme.

    My collector, on the other hand, is a copying collector,
and it is not generalized for C/C++.  If I write any scheme
extensions in C++, I must "protect" local C++
variables that might lose its pointer to Scheme objects
(again, implemented in C++) due to the garbage collector.  
This makes C++ code writing
more painful than it would be if I were using
Boehm's collector..  

    Secondly, for each class, I must implement
a static "Move" function (for copying one
object from FromSpace to ToSpace).  Writing
support structures to dispatch this at high speed makes
my code much more complex (and ugly).

    I have difficult time deciding whether to 
use Boehm's collector or not because (1) on one
hand, I have read that I am generally supposed to 
sacrifice performance for design improvement.
(2) on the other hand, with Boehm's collector, 
my interpreter runs 2.5 times as slow -- which
maybe too much to sacrifice.

    In summary, using Boehm's collector
would simplify C++ code writing, but it would make
my code regrettably slower.
   .
    Which one should I use?  I have been banging 
my head against walls for a few 
days.  I would appreciate 
any comments or insights
that would alleviate my headache.  


P.S.    In case anyone asks me what are my
"user requirement" --  Because I am writing a "new" 
type of application, I am not sure at this point if there is 
such a thing.  There is some chance that performance
will become an issue, so that I have always been trying not
to be ad-hoc in my design decisions.


From pekka@harlequin.co.uk  Fri Mar 16 13:58:40 2001
From: pekka@harlequin.co.uk (Pekka P. Pirinen)
Date: Fri, 16 Mar 2001 13:58:40 GMT
Subject: [gclist] Java Phantom references.
In-Reply-To: <3.0.1.32.20010315153520.014b67b0@pop3.geodesic.com> (message
 from Charles Fiterman on Thu, 15 Mar 2001 15:35:20 -0600)
Message-ID: <200103161358.NAA20538@zaphod.cam.harlequin.co.uk>

> First is there a Java Manual other than the Sun Website new enough to
> discuss phantom references?

That's canonical.  I suppose there must be other JDK 1.2 manuals by
now.  There's a nice article on the technicalities of weak refs at
<http://developer.java.sun.com/developer/technicalArticles/ALT/RefObj/index.html>.
I wrote precise definitions of the Java weakness concepts for the MM
Reference, see
<http://www.xanalys.com/software_tools/mm/glossary/p.html#phantom.reference>.

> What are they supposed to be used for?

Cleanup after the finalize method (which might well be inherited) has
run.  Also, I think it's nice for doing finalization (in the generic
sense) without opening the door for resurrection: you just subclass
PhantomReference to hold the info you need for finalization.

> How do collectors handle them? My guess is when the object is toast the
> phantom reference is put on a queue.

That's pretty much given, since one has to implement the
ReferenceQueue anyway.
-- 
Pekka P. Pirinen
Harlequin Limited


From hans_boehm@hp.com  Fri Mar 16 17:19:22 2001
From: hans_boehm@hp.com (Boehm, Hans)
Date: Fri, 16 Mar 2001 09:19:22 -0800
Subject: [gclist] My copying collector or Boehm's?
Message-ID: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com>

A few observations:
>     With Boehm's collector, my interpreter runs about 2.5
> times slower than before.  This is not a knock on Boehm's 
> collector.
I still find that surprising, since it suggests that your interpreter is
spending >60% of its time allocating/collecting with our collector,
something I don't normally see.  A profile and GC log would be interesting
to me.  In particular, it would be nice to know:
a) How much time is spent in the marker?
b) How much time is spent locking around allocation?  (The win32 threads
port uses EnterCriticalSection() and LeaveCriticalSection().  On many other
platforms a custom locking scheme is used instead, since the standard one
exhibited serious performance problems, some form of convoying being the
most common.  I've heard from several sources that at least some
implementations off the win32 primitives have similar problems, so a custom
solution should be used as well.  That would be fairly easy, but I haven't
gotten around to it yet.  A confirmation that it really is a problem would
help.)
c) What fraction of the time is spent context switching (see (b)).
d) Does the amount of live data in the GC log look right?
> The drop off in performance was expected, 
> because (1) my original collector is turned on and off 
> at precise points in my C++ code to minimize
> collection
That's potentially a big win, clearly.  You can
also do that with our collector, though it may be much
less effective with a global heap than per-thread heaps.

> (2) my collector uses type information all the time, 
My experience has been that for small objects this matters in the expected
case only to the extent that it reduces the overhead of checking a real
pointer, i.e. only if you can actually reduce checking on a "pointer" field
to a comparison against null.  That requires that you disallow pointers to
statically allocated data, which typically requires some copying for
constants.  And even the significance of that has been decreasing as GC
costs become more dominated by the costs of cache misses on the data being
traced/copied.

> (3) it uses no locks for allocation, because it has a
> separate heap for each thread.and
Potentially a large win, but very restrictive on the client code.

> (4) heap residency 
> was low for the test cases. -- which favors copying 
> collector over mark-sweep.
Somewhat.  But it also helps our collector a lot, provided you similarly
increase the heap size.  (This should work better in 6.0 than in the 5.x
releases.)  If it doesn't help, I would definitely suspect the locking code.

Hans


From virtualcyber@erols.com  Fri Mar 16 23:20:59 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Fri, 16 Mar 2001 18:20:59 -0500
Subject: [gclist] My copying collector or Boehm's?
References: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com>
Message-ID: <001101c0ae6f$c3237e60$0100007f@cradle>

Hi,

    There is always a good chance, given my relative inexperience,
that I really could be screwing up with my use of your collector.  At least
it is not crashing :)

> [I wrote]
> >     With Boehm's collector, my interpreter runs about 2.5
> > times slower than before.  This is not a knock on Boehm's
> > collector.

> [you replied] I still find that surprising, since it suggests that your
interpreter is
> spending >60% of its time allocating/collecting with our collector,
> something I don't normally see.  A profile and GC log would be interesting
> to me.  In particular, it would be nice to know:
> a) How much time is spent in the marker?
> b) How much time is spent locking around allocation?  (The win32 threads
> port uses EnterCriticalSection() and LeaveCriticalSection().  On many
other
> platforms a custom locking scheme is used instead, since the standard one
> exhibited serious performance problems, some form of convoying being the
> most common.

    Only 1 thread was really active in my test runs, so I'd guess that
 convoy did not have a chance to form.

    In any case, I did two profile runs on VC++6.0.  One for my Scheme
interpreter's
main evaluation loop, and another one for GC_malloc.  (While it would have
been
preferable to profile them in the same run, I had slight difficulty with
getting my profiler
to run correctly).

Both profiles were taken, with my interpreter running the same scheme source
code.


=====================================================================
||     This profile run is on the Evaluation loop of my Scheme interpreter.
|| This is dificult to read, but there as supposed to be 6 columns.
|| They are:
|| (1) function time in milliseconds,
|| (2) function time as percent of overall measured time
|| (3) function+child time,
|| (4) function+child time as percent of overall measured time
|| (5) hit count
|| (6) name of the function.


Profile: Function timing, sorted by time
Date:    Fri Mar 16 17:37:57 2001

Program Statistics
------------------
    Command line at 2001 Mar 16 17:34: tpscript.exe
    Start function: _GC_malloc
    Total time: 18652.854 millisecond
    Time outside of functions: 7747.360 millisecond
    Call depth: 13
    Total functions: 3854
    Total hits: 5216935
    Function coverage: 2.9%
    Overhead Calculated 6
    Overhead Average 6

Module Statistics for tpscript.exe
----------------------------------
    Time in module: 6488.312 millisecond
    Percent of time in module: 59.5%
    Functions in module: 3589
    Hits in module: 2337349
    Module function coverage: 0.1%

        Func          Func+Child                           Hit
        Time   %                 Time                      %      Count
Function
---------------------------------------------------------
    6475.395    59.4    10726.211         98.4          2336757
_GC_malloc (gc.dll)
      11.505       0.1       28.241                 0.3          148
_GC_default_push_other_roots (gc.dll)
       0.838       0.0        1.082                   0.0              148
_GC_push_all_stack (gc.dll)
       0.573       0.0        1.086                   0.0          148
_GC_push_finalizer_structures (gc.dll)
       0.000       0.0        0.000                   0.0              148
Script::SchemeVM::PushRoots(void)

Note* -- As you indicated in your email, my program spends most
the time in _GC_malloc.  I think this included the time spent on
garbage collection as well.

============================================================
|| Here is the measurement on your GC.dll

Module Statistics for gc.dll
----------------------------
    Time in module: 4417.182 millisecond
    Percent of time in module: 40.5%
    Functions in module: 265
    Hits in module: 2879586
    Module function coverage: 39.6%

        Func          Func+Child                 Hit
        Time      %         Time      %      Count  Function
---------------------------------------------------------
    2097.812  19.2     2112.566  19.4    10188 _GC_mark_from (mark.obj)
     559.504   5.1      559.504   5.1    13777 _GC_reclaim_clear
(reclaim.obj)
     125.837   1.2      171.310   1.6  2380112 _GC_malloc (malloc.obj)
     111.131   1.0      133.912   1.2     2093 _GC_build_fl (new_hblk.obj)
     105.330   1.0      384.709   3.5      296 _GC_apply_to_all_blocks
(headers.obj)
      99.517   0.9     4245.413  38.9    18596 _GC_generic_malloc
(malloc.obj)
      75.057   0.7      863.804   7.9    18744 _GC_continue_reclaim
(reclaim.obj)
      74.471   0.7       74.471   0.7     1794 _GC_reclaim_clear2
(reclaim.obj)
      66.135   0.6       66.135   0.6    25577 _GC_clear_hdr_marks
(mark.obj)
      65.022   0.6     4058.541  37.2    18596 _GC_allocobj (alloc.obj)
      56.553   0.5      188.229   1.7    24667 _GC_reclaim_block
(reclaim.obj)
      53.679   0.5      750.716   6.9    16503 _GC_reclaim_generic
(reclaim.obj)
      49.125   0.5     4107.725  37.7    18596 _GC_generic_malloc_inner
(malloc.obj)
      48.739   0.4       73.931   0.7    22284 _GC_block_nearly_full
(reclaim.obj)
      45.451   0.4      300.788   2.8      148 _GC_finish_collection
(alloc.obj)
      44.964   0.4      111.610   1.0    14328 _GC_allochblk_nth
(allchblk.obj)
      42.626   0.4     2286.872  21.0    11816 _GC_mark_some (mark.obj)
      38.236   0.4       38.236   0.4    24815 _GC_next_used_block
(headers.obj)
      38.181   0.4       38.181   0.4    18596 _GC_write_hint (os_dep.obj)
      38.171   0.4       38.171   0.4    18596 _GC_invoke_finalizers
(finalize.obj)
      38.030   0.3      788.746   7.2    16503
_GC_reclaim_small_nonempty_block (reclaim.obj)
      36.689   0.3       36.689   0.3      148 _GC_read_changed
(stubborn.obj)
      35.890   0.3       91.150   0.8    24667 _clear_marks_for_block
(mark.obj)
      32.599   0.3     2382.769  21.8      148 _GC_stopped_mark (alloc.obj)
      31.552   0.3       50.877   0.5    18596 _GC_clear_stack (misc.obj)
      26.944   0.2       26.944   0.2      780 _GC_reclaim_clear4
(reclaim.obj)
      26.059   0.2       81.437   0.7     1332
_GC_push_next_marked_uncollectable (mark.obj)
      25.969   0.2       25.969   0.2    25406 _GC_block_empty (reclaim.obj)
      21.913   0.2       21.913   0.2      445 _GC_build_fl_clear4
(new_hblk.obj)
      21.005   0.2      135.505   1.2     2241 _GC_allochblk (allchblk.obj)
      19.420   0.2       19.420   0.2    18744 _GC_approx_sp (mark_rts.obj)
      17.333   0.2       17.333   0.2     9792 _GC_add_to_black_list_stack
(blacklst.obj)
      16.263   0.1       17.142   0.2     1184 _GC_push_marked (mark.obj)
      15.211   0.1      252.143   2.3      148 _GC_start_reclaim
(reclaim.obj)
      14.620   0.1       14.620   0.1      296 _GC_get_lo_stack_addr
(win32_threads.obj)
      13.995   0.1       13.995   0.1      148 _GC_clear_bl (blacklst.obj)
      12.799   0.1       12.799   0.1    11073 _GC_block_nearly_full3
(reclaim.obj)
      12.394   0.1       12.394   0.1    10472 _GC_block_nearly_full1
(reclaim.obj)
      12.031   0.1       15.668   0.1     2118 _GC_add_to_fl (allchblk.obj)
      11.344   0.1       11.344   0.1      148 _GC_stop_world
(win32_threads.obj)
      10.281   0.1       16.618   0.2     2093 _GC_get_first_part
(allchblk.obj)
       8.093   0.1       32.655   0.3     1939 _GC_freehblk (allchblk.obj)
       7.453   0.1        7.453   0.1    11816 _GC_default_oom_fn (misc.obj)
       7.453   0.1        7.453   0.1    11816 _GC_never_stop_func
(misc.obj)
       7.368   0.1      276.784   2.5     2241 _GC_new_hblk (new_hblk.obj)
       7.141   0.1        7.141   0.1      148 _GC_start_world
(win32_threads.obj)
       6.898   0.1       22.951   0.2     2094 _setup_header (allchblk.obj)
       6.653   0.1        6.653   0.1     2102 _GC_is_black_listed
(blacklst.obj)
       6.282   0.1       11.214   0.1     2093 _GC_install_counts
(headers.obj)
       6.064   0.1        6.981   0.1     1939 _GC_free_block_ending_at
(allchblk.obj)


============================================================

> c) What fraction of the time is spent context switching (see (b)).

    I have just two threads, one of which is taking a nice nap while
the other is running.  (So there is not much overhead on that).

> d) Does the amount of live data in the GC log look right?

    I have not yet looked into this -- I have to dig into your package to
find how to use GC_log.

> > The drop off in performance was expected,
> > because (1) my original collector is turned on and off
> > at precise points in my C++ code to minimize
> > collection
>
> That's potentially a big win, clearly.  You can
> also do that with our collector, though it may be much
> less effective with a global heap than per-thread heaps.

    Do you mean, provided that I can make your collector to use per-thread
heap, I can disallow GC_malloc from checking whether
it should garbage collect?

> > (2) my collector uses type information all the time,
> My experience has been that for small objects this matters in the expected
> case only to the extent that it reduces the overhead of checking a real
> pointer, i.e. only if you can actually reduce checking on a "pointer"
field
> to a comparison against null.  [snipped, as the rest is a bit over my
> head]

    And indeed, there is no need for pointer checking, as it does
use the type information.

> > (4) heap residency
> > was low for the test cases. -- which favors copying
> > collector over mark-sweep.
> Somewhat.  But it also helps our collector a lot, provided you similarly
> increase the heap size.  (This should work better in 6.0 than in the 5.x
> releases.)  If it doesn't help, I would definitely suspect the locking
code.

    I have increased the starting heap size for your collector to about
1 Meg, but I have not yet seen substantial speedup yet.  I will try
2 Megs.


From jmaessen@mit.edu  Sat Mar 17 01:37:10 2001
From: jmaessen@mit.edu (Jan-Willem Maessen)
Date: Fri, 16 Mar 2001 20:37:10 -0500
Subject: [gclist] Re: RE: synchronization cost (was: Garbage collection and XML)
Message-ID: <200103170137.UAA01053@lauzeta.mit.edu>

Manoj Plakal <plakal@cs.wisc.edu> replied to my post on Intel synchronization:
> >   The LOCK prefix can be prepended only to the following instructions
> >   and to those forms of the instructions that use a memory operand:
> >   ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB,
> >   SUB, XOR, XADD, and XCHG. ...
> > 
> > That's a pretty long list, and one of the above instructions is
> > usually what you actually want.  For example (getting back to GC
> > here), BTC/BTR/BTS allow you to atomically update a shared
> > allocation/mark bitmap efficiently.
> 
>         If you look at the part of the Intel manuals describing
>         optimizations for the Pentium-II/III/IV, I think you'll
>         find that they deprecate the use of prefixes like this.

After a bit of back-and-forth with Manoj, my conclusion is that this
was probably a misreading of (paraphrasing) "Don't use prefixes except
for 0F".  This seems to refer specifically to 0F introducing
multi-byte instructions such as MMX, SIMD, CMPXCHG, and the like.  My
impression is that prefixes bottleneck instruction fetch/decode.
There were various other warnings about avoiding the LOCK prefix
wherever possible, but I was unable to find anything specifically
blessing LOCK; CMPXCHG or depracating other uses.

That being said, kabbalistic readings of Intel documents are a blood
sport in multiprocessor memory model circles.  Manoj helpfully
provided a link to the appropriate documentation:
>         I was looking at the Pentium III Architecture Optimization
>         Manual at this URL:
>            http://developer.intel.com/design/pentiumii/manuals/245127.htm

You may be best off reading it and drawing your own conclusions!

-Jan-Willem Maessen
Eager Haskell project
jmaessen@mit.edu


From virtualcyber@erols.com  Sat Mar 17 05:36:38 2001
From: virtualcyber@erols.com (Ji-Yong D. Chung)
Date: Sat, 17 Mar 2001 00:36:38 -0500
Subject: [gclist] Ooops, sent the wrong profile result
References: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com>
Message-ID: <000001c0aea4$4f689f20$0100007f@cradle>

    HI,

    Sorry about this -- I included wrong profile result for my
Evaluator in my earlier email.  The correct profiling result
is included below.  I also included the profile of run for
my Evaluator when it was usinng my copying collector.
If you look at the results, it looks as though the version using my copying
collector is slower than one using Boehm collector.  That is
not the case -- I remeasured relative speeds,
and the inclusion of Boehm collector slows my
program by a factor very close to 2.
(Not 2.5 as I said in earlier email).

    The fact that Boehm collector uses only 32% (GC_malloc +
GC_malloc_atomic) of overall execution time does not convey
the whole picture.  After I started using it, my whole program
slowed down, even in sections that was not calling GC_malloc.
I have no idea why it should be so -- this is where doing computer
work feels like voodoo.

   Perhaps part of my problem is that I am using C++ interface
rather than C interface, rather than using the
Boehm collector.When I ported to Boehm
collector, I included more C++ syntax -- but
most statements I changed are C equivalent or even faster
in terms of speed (??).

    I must be missing something.


============================================================
||  This profile run is the correct one.
||

Module Statistics for tpscript.exe
----------------------------------
    Time in module: 26528.648 millisecond
    Percent of time in module: 100.0%
    Functions in module: 3589
    Hits in module: 6629027
    Module function coverage: 3.7%

        Func          Func+Child           Hit
        Time   %         Time      %      Count  Function
---------------------------------------------------------
   11369.062  42.9    26528.578 100.0      187
Script::Evaluator::Evaluate(class Script::Scheme *,class Script::Environment
*) (eval1.obj)
    7582.324  28.6     7608.891  28.7  2329467 _GC_malloc (gc.dll)
    2987.185  11.3    26528.166 100.0  1578097
Script::Primitive::Invoke(class Script::Scheme * *,int,class
Script::Environment *,class Script::SchemeVM *) (primitives.obj)
     878.060   3.3      878.783   3.3   341136 _GC_malloc_atomic (gc.dll)
     859.823   3.2     1400.314   5.3   139338 Script::PrimitiveBase<class
Script::Addition>::Fun(class Script::Scheme * *,int,class
Script::Environment *,class Script::SchemeVM *) (primitivesnumber.obj)
     462.704   1.7      462.704   1.7   379151 Script::PrimitiveBase<class
Script::VectorRef>::Fun(class Script::Scheme * *,int,class
Script::Environment *,class Script::SchemeVM *) (primitivesvector.obj)
     279.420   1.1      279.420   1.1   338680 Script::Integer::Integer(int)
(objnumbers.obj)

============================================================
||  This profile run is one for my program using copying collector.


Module Statistics for tpscript.exe
----------------------------------
    Time in module: 33731.923 millisecond
    Percent of time in module: 100.0%
    Functions in module: 4067
    Hits in module: 7616101
    Module function coverage: 5.6%

        Func          Func+Child           Hit
        Time   %         Time      %      Count  Function
---------------------------------------------------------
   13619.320  40.4    33731.807 100.0      227
Script::Evaluator::Evaluate(struct Script::Scheme *,struct
Script::Environment *) (eval1.obj)
    6265.637  18.6    33730.767 100.0  1578140
Script::PrimitiveFT::Invoke(struct Script::Primitive *,struct Script::Scheme
* *,int,class Script::SchemeVM *) (primitives.obj)
    3255.299   9.7     3255.299   9.7  1748575
Script::Recycler::AllocateNOGC(int,int &) (recycler.obj)
    1564.422   4.6     2526.844   7.5   379151 Script::PrimitiveBase<class
Script::VectorRef>::Fun(struct Script::Primitive *,struct Script::Scheme *
*,int,class Script::SchemeVM *) (primitivesvector.obj)
    1059.304   3.1     1328.214   3.9   139338 Script::PrimitiveBase<class
Script::Addition>::Fun(struct Script::Primitive *,struct Script::Scheme *
*,int,class Script::SchemeVM *) (primitivesnumber.obj)
     818.749   2.4      818.749   2.4   558899
Script::VectorFT::Dimension(struct Script::Scheme *) (objvector.obj)
     633.502   1.9      808.520   2.4   344876 Script::Recycler::Copy(struct
Script::Scheme *,void * &) (recycler.obj)
     549.137   1.6      549.137   1.6   499194
Script::VectorFT::ElementAt(struct Script::Scheme *,int) (objvector.obj)


From dfb@watson.ibm.com  Mon Mar 19 23:19:39 2001
From: dfb@watson.ibm.com (David F. Bacon)
Date: Mon, 19 Mar 2001 18:19:39 -0500
Subject: [gclist] quantitative comparison of garbage collectors
Message-ID: <3AB6940B.6C1EF068@watson.ibm.com>

does anyone know of work that has been done quantitatively comparing
different GCs?

whether or not you know of such work, if (just for example) someone were
writing a paper on that topic, what is the parameter space that you
think ought to be explored?  application speed?  memory utilization?
others??

thanks in advance,

david


From mwh@dsl.cis.upenn.edu  Sun Mar 25 02:48:21 2001
From: mwh@dsl.cis.upenn.edu (Michael Hicks)
Date: Sat, 24 Mar 2001 21:48:21 -0500 (EST)
Subject: [gclist] Re: Daily gclist MIME digest V3 #150
In-Reply-To: <200103231020.f2NAKLp04362@gradient.cis.upenn.edu> from "owner-gclist@lists.iecc.com" at Mar 23, 2001 05:20:21 am
Message-ID: <200103250248.f2P2mL506553@codex.cis.upenn.edu>

> does anyone know of work that has been done quantitatively comparing
> different GCs?

We wrote some papers comparing different GC *mechanisms*, but not different
GC's.  The metric we were interested in was GC speed, and we hypothesized
about effects to the client (mutator).  See http://www.cis.upenn.edu/~oscar
for papers and details.

Mike

-- 
Michael Hicks
Ph.D. Candidate, the University of Pennsylvania
http://www.cis.upenn.edu/~mwh            mailto://mwh@dsl.cis.upenn.edu
"In essential things, unity; in doubtful things, liberty; in all things,
 charity." --Pope John XXIII, Ad Petri Cathedram, and popularly 
 attributed to St. Augustine


From will@ccs.neu.edu  Wed Mar 28 16:58:18 2001
From: will@ccs.neu.edu (William D Clinger)
Date: Wed, 28 Mar 2001 11:58:18 -0500 (EST)
Subject: [gclist] older-first garbage collection
Message-ID: <200103281658.f2SGwIg13274@electra.ccs.neu.edu>

[I sent this last week, but it didn't show up so I'm trying a different
email address.]

Lars Hansen has completed his PhD thesis, titled "Older-First Garbage
Collection in Practice".  This thesis describes and compares the
performance of several interchangeable garbage collectors in Larceny,
our implementation of Scheme for the SPARC.  The abstract and gzip'ed
Postscript for this thesis are online at

    http://www.ccs.neu.edu/home/will/GC/lth-thesis/index.html

This URL also contains a link to the development version of Larceny
that Lars used to collect the data in his thesis.  This development
version of Larceny is newer than version 1.0a1 but contains several
compiler bugs that we elected not to fix until Lars had completed
his benchmarking.  When (most of) these bugs are fixed, we will
release Larceny 2.0a1.

Will Clinger
Northeastern University
(on sabbatical at Sun Labs East)


From moss@cs.umass.edu  Wed Mar 28 20:53:01 2001
From: moss@cs.umass.edu (Eliot Moss)
Date: Wed, 28 Mar 2001 15:53:01 -0500 (EST)
Subject: [gclist] Post Doc position at UMass/Amherst
Message-ID: <15042.20269.246229.615670@kiwi.cs.umass.edu>

The Architecture and Language Implementation group of the Department of
Computer Science, University of Massachusetts at Amherst seeks a
post-doctoral researcher to join our laboratory. Our current researcher,
Dr. Stephen Blackburn, will be leaving in about a year and we would prefer
a person who could overlap with him, starting sometime between June 2001
and January 2002. We have funding for about 2.5 - 3 years.

We are engaged in work on garbage collection, compiler optimization, and
persistent and distributed systems. We have our own compiler infrastructure
(Scale, written in Java) as well as a source license to IBM's Jalapeno Java
Virtual Machine. Focus for this researcher is to be Java performance, for
garbage collection, persistence, or distributed shared memory -- or,
something else if you can convince me :-). We have a group of about 15,
including 3 professors, a staff programmer, and graduate student research
assistants, a significant portion of which are engaged in Scale or Jalapeno
work. We are also concerned with developing and simulating new hardware
architectural ideas, simulators to support such work, dynamic management of
energy/power, and dynamic and adaptive performance improvement
(particularly of memory accesses) in general.

We have a good track record in developing and evaluating gc, persistence,
and OO language implementation mechanisms, including the Mature Object
Space (Train) algorithm used in some commercial Java systems.

If you have a serious interest, send me email and I'll forward application
information.

Regards --						Eliot Moss
==============================================================================
J. Eliot B. Moss, Associate Professor     http://www.cs.umass.edu/~moss    www
Department of Computer Science            +1-413-545-4206                voice
140 Governor's Drive, Room 372            +1-413-545-1249                  fax
University of Massachusetts               moss@cs.umass.edu              email
Amherst, MA  01003-4610  USA              +1-413-545-3733 Priscilla Coe  sec'y
==============================================================================