From virtualcyber@erols.com Thu, 1 Mar 2001 00:36:37 -0500 Date: Thu, 1 Mar 2001 00:36:37 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Boehm's collector and "dynamic roots" Hi, I am having a problem with getting Boehm's collector to work for me. I was wondering if anyone could help. I have just incorporated Boehm's collector for my scheme interpreter. It is implemented in C++. The interpreter keeps its own stack (rather than using C++ stack). To get the collector to scan "my" stack, I registered it as a root segment using one of the interfaces provided by Boehm's collector. My problem is that I have allocated approximately 200k for the stack, and on each collection cycle, Boehm's collector is likely to scan the whole thing as one of its roots. If I were using a C++ stack, Boehm's collector would just scan the stack until it reached the top. Because my stack is not a C++ stack, and it is registered as a root segment, Boehm's collector is treating it as just that -- a root segment rather than as stack. I am wondering if there is any way to cause the collector to scan only to the top of MY OWN stack, rather than the whole 200k. For this, it seems that I need to implement additional interfaces that could add what one may call "dynamic" roots -- roots whose size will change when I push my stack. Has anyone encountered a similar problem before, and know of a relatively simple solution to this problem? (Perhaps a few pointers to how I might need to modify Boehm's collector?) I would appreciate any comments/enlightenment. From fjh@cs.mu.oz.au Thu, 1 Mar 2001 17:32:07 +1100 Date: Thu, 1 Mar 2001 17:32:07 +1100 From: Fergus Henderson fjh@cs.mu.oz.au Subject: [gclist] Boehm's collector and "dynamic roots" On 01-Mar-2001, Ji-Yong D. Chung wrote: > I am wondering if there is any way to cause > the collector to scan only to the top of MY OWN > stack, rather than the whole 200k. > > For this, it seems that I need to implement > additional interfaces that could add > what one may call "dynamic" roots -- roots > whose size will change when I push > my stack. > > Has anyone encountered a similar problem > before, and know of a relatively simple solution > to this problem? Use the GC_push_other_roots() hook, which is declared in gc_priv.h: extern void (*GC_push_other_roots)(); /* Push system or application specific roots */ /* onto the mark stack. In some environments */ /* (e.g. threads environments) this is */ /* predfined to be non-zero. A client supplied */ /* replacement should also call the original */ /* function. */ This will get called whenever a collection occurs. In your function, you can call GC_push_all() or GC_push_all_stack() to register your stack. -- Fergus Henderson | "I have always known that the pursuit | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From virtualcyber@erols.com Thu, 1 Mar 2001 02:44:23 -0500 Date: Thu, 1 Mar 2001 02:44:23 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Garbage collection and XML Hi, I am trying to modify a c++ XML parser library, so that it uses a GC. I have just begun my effort, and I am curious what garbage collector/memory management forum had to say about XML DOM and SAX specification and/or parser implementations. In particular, I was wondering how well XML parsers (DOM and SAX) might get along with today's garbage collection/memory management techniques. (Perhaps my question is ridiculously too general). Does the fact that DOM interface is a tree structure manipulation tool make any difference? I will be grateful and satisfied with first impressions, general comments, in fact anything about garbage collection and memory management of XML parsers, spec, etc. From ken@bitsko.slc.ut.us 01 Mar 2001 09:09:36 -0600 Date: 01 Mar 2001 09:09:36 -0600 From: Ken MacLeod ken@bitsko.slc.ut.us Subject: [gclist] Garbage collection and XML "Ji-Yong D. Chung" writes: > I am trying to modify a c++ XML parser library, so that it uses > a GC. > > I have just begun my effort, and I am curious what garbage > collector/memory management forum had to say about XML DOM and SAX > specification and/or parser implementations. > > In particular, I was wondering how well XML parsers (DOM and > SAX) might get along with today's garbage collection/memory > management techniques. (Perhaps my question is ridiculously too > general). Does the fact that DOM interface is a tree structure > manipulation tool make any difference? I specifically selected garbage collection (and Boehm GC in particular) for implementing the Orchard/Mostly-C XML library (which has a fledgling C++ interface, by the way). The performance is excellent (which most GCers will likely to have expected :-). We've variously stress and performance tested it using large documents and thousands of small documents per second. Using the Expat C XML parser we're running about 1/3 the speed of the raw parser, and that due mostly to creating objects for each parse event and doing string copies of XML text (the latter may have a memmgmt hook to prevent, haven't checked deeper yet). One of the particular reasons for using GC is that most people want their XML trees with "parent" references in them, which creates cycles. >From your earlier postings, you may be interested in creating a scheme binding to Orchard[1] (rather than using the C++ interface I mentioned above), Orchard implements a very lightweight SAX and DOM that would work well in Scheme. -- Ken [1] From ok@atlas.otago.ac.nz Fri, 2 Mar 2001 10:30:26 +1300 (NZDT) Date: Fri, 2 Mar 2001 10:30:26 +1300 (NZDT) From: Richard A. O'Keefe ok@atlas.otago.ac.nz Subject: [gclist] Garbage collection and XML "Ji-Yong D. Chung" wrote: I have just begun my effort, and I am curious what garbage collector/memory management forum had to say about XML DOM and SAX specification and/or parser implementations. The first thing is to say that the DOM specification is one of the worst specifications I've seen in a long time. If your goal is to process XML, there are other approaches which are *far* better, especially in terms of memory consumption. The only known reasons for using the DOM interface for *anything* are - market requirement - required interface to an existing DOM implementation. When I say that a naive implementation of an XML tree in C (using either UTF-8 or TR-6 for strings) is likely to take *half* the memory of any plausible DOM implementation, you must understand that I am not joking and not exaggerating. When I say that a clever (but not by any means innovative) implementation based on hash consing ideas can shrink memory requirements still further, you must again understand that I am not joking and not exaggerating. So the first step in memory management for XML is "Don't use the DOM". One of the issues with the DOM is that every node is rigidly locked in place by cross-links pointing every which way. This makes naive reference counting useless. Oddly enough, a more space-efficient approach could make very effective use of reference counting. On the other hand, it does mean that if any node in a document is garbage, they are all garbage. So if you were doing the DOM in C++, "pointers" into a document should be pairs consisting of a pointer to a root node and a pointer to a subnode. Reference counting operations would affect the root node only, and when its count reached zero the whole tree could go. The traversal features in Level 2 DOM are downright nasty. They are complex to implement, and complex to understand. Implementing traversal by simply constructing a list object of some kind would be easier to use correctly, and when you consider the implementation overheads of the DOM, almost certainly rather more efficient. I found that supporting traversal slowed a test DOM implementation I wrote down by about 20%. From kanderson@bbn.com Thu, 01 Mar 2001 20:30:43 -0500 Date: Thu, 01 Mar 2001 20:30:43 -0500 From: Ken Anderson kanderson@bbn.com Subject: [gclist] Garbage collection and XML At 04:30 PM 3/1/2001 , Richard A. O'Keefe wrote: >"Ji-Yong D. Chung" wrote: > I have just begun my effort, and I am curious > what garbage collector/memory management > forum had to say about XML DOM and SAX > specification and/or parser implementations. > >The first thing is to say that the DOM specification is one of the >worst specifications I've seen in a long time. If your goal is to >process XML, there are other approaches which are *far* better, >especially in terms of memory consumption. The only known reasons >for using the DOM interface for *anything* are > - market requirement > - required interface to an existing DOM implementation. > >When I say that a naive implementation of an XML tree in C (using >either UTF-8 or TR-6 for strings) is likely to take *half* the memory of >any plausible DOM implementation, you must understand that I am not >joking and not exaggerating. > >When I say that a clever (but not by any means innovative) implementation >based on hash consing ideas can shrink memory requirements still further, >you must again understand that I am not joking and not exaggerating. > >So the first step in memory management for XML is "Don't use the DOM". > >One of the issues with the DOM is that every node is rigidly locked in place >by cross-links pointing every which way. This makes naive reference counting >useless. Oddly enough, a more space-efficient approach could make very >effective use of reference counting. On the other hand, it does mean that >if any node in a document is garbage, they are all garbage. So if you were >doing the DOM in C++, "pointers" into a document should be pairs consisting >of a pointer to a root node and a pointer to a subnode. Reference counting >operations would affect the root node only, and when its count reached zero >the whole tree could go. > >The traversal features in Level 2 DOM are downright nasty. They are >complex to implement, and complex to understand. Implementing traversal >by simply constructing a list object of some kind would be easier to use >correctly, and when you consider the implementation overheads of the DOM, >almost certainly rather more efficient. I found that supporting traversal >slowed a test DOM implementation I wrote down by about 20%. I thought i'd support you with some numbers, though in Java. Here's a summary of loading the CIM_Schema23.xml file from http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM. This XML file describe a class hierarchy of 732 classes of the DMTF/CIM standard. Some Statistics: File sizes: MBytes of What 3.76 XML 0.27 zipped 40.2 DOM (IBM) Sharing strings can save a significant amount, and some XML parsers will do that. Vector's are given a default size of 10, while 74% have size 1, and 12% have size 3. Trimming Vectors to the right size saves 5.6MB. I wrote a SAX parser that read the same file but produced an s-expression with structure sharing. It required only 4.0MB, slightly more than the ASCII XML size. Its not so much an issue of GC as watching were the bytes go. k From virtualcyber@erols.com Thu, 1 Mar 2001 20:34:32 -0500 Date: Thu, 1 Mar 2001 20:34:32 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Garbage collection and XML Thank you for your replies. I will have to re-evaluate my goals and what I am attempting to do, in light of your email messages. I had suspected a few problems with XML parsers (having heard of complaints) but I did not know that experts would be this harsh toward DOM. Take Care Ji-Yong D. Chung From ok@atlas.otago.ac.nz Fri, 2 Mar 2001 16:48:35 +1300 (NZDT) Date: Fri, 2 Mar 2001 16:48:35 +1300 (NZDT) From: Richard A. O'Keefe ok@atlas.otago.ac.nz Subject: [gclist] Garbage collection and XML Ken Anderson backed up my recommendation against the DOM with some figures. I have some of my own. This is from a collection of Computer Science exam papers marked up as XML (well, actually, SGML, automatically transcoded to XML). f% wc exams.xml 26731 279088 2634021 exams.xml f% sgml -xml exams.xml | a.out There are 251215 references to 1341 strings, 187.334 references each. 69915 bytes were used, with 6029 bytes wasted, for an average of 52.1365 bytes used and 4.4959 bytes wasted per string, or 0.278307 bytes used and 0.0239994 bytes wasted per copy. Here a "string" is any of - an element name - an attribute name - an attribute value - character content The program puts all of these things in a hash table, so all strings are unique. When I wrote this I was concerned with the effectiveness of the hash table, and whether it would be a good idea to do my own mallocking out of blocks instead of allocating each string separately. (Yes it was.) The program does NOT report the amount of space needed for the tree, so the rather impressive total of 75,944 bytes required to hold the string content of 2,634,021 bytes of XML (about 3%) is misleading. Assuming that there's no sharing in the tree structure (which is unlikely, because things like Java occur quite often), the data structure I use would charge 1 word per string reference + 2 words per element. There are 61517 elements, so there'd be 374249 * 4 = 1,496,996 bytes for the tree structure (AT MOST). 1,496,996 bytes for tree structure 75,944 bytes for strings ------------------- 1,572,940 bytes total storage. BUT 2,634,021 bytes for the XML sources. From dave@cs.adelaide.edu.au Fri, 2 Mar 2001 14:23:44 +1030 Date: Fri, 2 Mar 2001 14:23:44 +1030 From: Dave Munro dave@cs.adelaide.edu.au Subject: [gclist] Real-Time GC for high-level languages At 3:20 PM +0000 30/1/01, Andrew Cheadle wrote: >I seem to remember a theoretical paper: > Guy E. Blelloch, Perry Cheng: On Bounding Time and Space for >Multiprocessor Garbage Collection. PLDI 1999: 104-117 > >which makes claims of bounded pause times. I believe, but I'm not sure, >that Perry Cheng was looking at implementing the techniques mentioned in >the above paper in the TILT ML compiler: Just a note to say that William Brodie-Tyrrell, one of my students, implemented a version of the Blelloch and Cheng bounded GC which we reported in Vaughan, Francis A., Brodie-Tyrrell, William F., Falkner, Katrina E. and Munro, David S., "Bounded Parallel Garbage Collection:Implementation and Adaptation", To appear in Proceedings of 7th Australian Parallel and Real Time PART'2000 Sydney. This can be picked up from http://www.cs.adelaide.edu.au/users/jacaranda/publications.html Cheers Dave -- -------------------------------------------------------------------------- David Munro _--_|\ phone: +61 8 8303 6173 Department of Computer Science, / \ fax: +61 8 8303 4366 University of Adelaide, \_.--*_/ South Australia 5005 v Australia http://www.cs.adelaide.edu.au/~dave From plakal@cs.wisc.edu Thu, 1 Mar 2001 21:57:22 -0600 Date: Thu, 1 Mar 2001 21:57:22 -0600 From: Manoj Plakal plakal@cs.wisc.edu Subject: [gclist] Real-Time GC for high-level languages Dave Munro wrote (Fri, Mar 02, 2001 at 02:23:44PM +1030) : > At 3:20 PM +0000 30/1/01, Andrew Cheadle wrote: > >I seem to remember a theoretical paper: > > Guy E. Blelloch, Perry Cheng: On Bounding Time and Space for > >Multiprocessor Garbage Collection. PLDI 1999: 104-117 > > > >which makes claims of bounded pause times. I believe, but I'm not sure, > >that Perry Cheng was looking at implementing the techniques mentioned in > >the above paper in the TILT ML compiler: > > Just a note to say that William Brodie-Tyrrell, one of my students, > implemented a version of the Blelloch and Cheng bounded GC which we > reported in > > Vaughan, Francis A., Brodie-Tyrrell, William F., Falkner, Katrina E. > and Munro, David S., "Bounded Parallel Garbage > Collection:Implementation and Adaptation", To appear in Proceedings > of 7th Australian Parallel and Real Time PART'2000 Sydney. > > This can be picked up from > http://www.cs.adelaide.edu.au/users/jacaranda/publications.html It seems that Cheng & Blelloch have also implemented their idea since they have a paper in the upcoming PLDI-2001 titled "A Parallel, Real-Time Garbage Collector". See http://www.cs.pitt.edu/~soffa/pldi01/pldi_program.html Manoj From virtualcyber@erols.com Thu, 1 Mar 2001 23:20:52 -0500 Date: Thu, 1 Mar 2001 23:20:52 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Garbage collection and XML :Hi, > [Ken Anderson wrote] > File sizes: > > MBytes of What > 3.76 XML > 0.27 zipped > 40.2 DOM (IBM) This is bad -- I need to break up with Xerces (IBM DOM), and real soon. P.S. By the way, this is a bit off the topic, but it seems, from Robert O'Keefe's email and yours, implementing an XML parser does not seem to be that difficult -- am I right? (or perhaps you two are just super coders ...) From danwang@CS.Princeton.EDU 01 Mar 2001 23:03:58 -0500 Date: 01 Mar 2001 23:03:58 -0500 From: Daniel Wang danwang@CS.Princeton.EDU Subject: [gclist] Canonical citation for "memory pools" Does anyone have a canonical citation for "memory pools/arenas? i.e. The memory management scheme where you allocate several objects in one big chunck of space and deallocate all the objects in one go. I need it for a related work section of a paper. Ideally, someone can claim credit for being the first to publish this idea. Larger works that include this kind of scheme as a part of a whole are just as good. TIA. From virtualcyber@erols.com Thu, 1 Mar 2001 23:45:05 -0500 Date: Thu, 1 Mar 2001 23:45:05 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Garbage collection and XML Hi, you wrote > The program puts all of these things in a hash table, so all strings are > unique. When I wrote this I was concerned with the effectiveness of the > hash table, and whether it would be a good idea to do my own mallocking > out of blocks instead of allocating each string separately. (Yes it was.) This is a strong argument in favor of hashtables indeed -- but how do you determine the dimension of the hashtable? My guess here would be that you chose the size based on the XML file size. I'd think that a static hashtable with a predetermined size would not work, because file sizes can really vary, > 1,496,996 bytes for tree structure > 75,944 bytes for strings > ------------------- > 1,572,940 bytes total storage. > BUT > 2,634,021 bytes for the XML sources. This is what I call efficient use of memory! Actually, this makes sense, because XML contains many built-in inefficiencies. (such as the matching tags). The preceding figures you provide gives pretty good idea of what to shoot for in designing a good parser.. One last detail -- from looking at your previous email messages, I'd guess that you used reference counting, right? Also, from what Ken Anderson said, I'd guess that one could obtain similar results (provided the code is of the similar quality) from a parser that uses garbage collector. Take Care, Ji-Yong D. Chung From emery@cs.utexas.edu Thu, 1 Mar 2001 23:43:43 -0600 Date: Thu, 1 Mar 2001 23:43:43 -0600 From: Emery Berger emery@cs.utexas.edu Subject: [gclist] Canonical citation for "memory pools" > Does anyone have a canonical citation for "memory pools/arenas? i.e. The > memory management scheme where you allocate several objects in one big > chunck of space and deallocate all the objects in one go. I need it for a > related work section of a paper. Ideally, someone can claim > credit for being > the first to publish this idea. Larger works that include this kind of > scheme as a part of a whole are just as good. Aiken & Gray's PLDI 1998 paper "Memory Management with Explicit Regions" cites a number of authors for (regions|zones|groups|arenas). http://www.acm.org/pubs/articles/proceedings/pldi/277650/p313-gay/p313-gay.p df The earliest citation is for "zones" [D. T. Ross. The AED free storage package. Communications of the ACM, 10(8):481-492, August 1967]. See also Paul Wilson's DSA survey -- page 48 describes the Ross paper and mentions (with a citation) that others had earlier used similar schemes. ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps Regards, -- Emery -- Emery Berger emery@cs.utexas.edu http://www.cs.utexas.edu/users/emery From jerrold.leichter@smarts.com Fri, 2 Mar 2001 10:36:35 -0500 (EST) Date: Fri, 2 Mar 2001 10:36:35 -0500 (EST) From: Jerrold Leichter jerrold.leichter@smarts.com Subject: [gclist] Garbage collection and XML | > The program puts all of these things in a hash table, so all strings are | > unique. When I wrote this I was concerned with the effectiveness of the | > hash table, and whether it would be a good idea to do my own mallocking | > out of blocks instead of allocating each string separately. (Yes it was.) | | This is a strong argument in favor of hashtables indeed -- but how do | you determine the dimension of the hashtable? My guess here would be that | you chose the size based on the XML file size. I'd think that a static | hashtable with a predetermined size would not work, because file sizes | can really vary.... Dynamically adjustable hash tables have been around for 25 years. They never seem to have made it into the standard textbooks, and most programmers are unaware of them. To this day, textbooks continue to list fixed size as a minus for hash tables. (The contents of textbooks gets frozen in place: The textbooks define the course requirements, which in turn define the constraints on newer textbooks for the same courses. Compare a data structures textbook published today to one published in 1980 and what you'll find is that the new one uses object-oriented concepts and C++ or Java, while the older one uses Pascal or C - but the actual algorithms presented are pretty much the same, and generally all those algorithms were familiar by the mid-70's. Tying this back to GC: You can see the same phenomenon in compiler texts, or any other texts that might be expected to discuss garbage collection techniques: They say little - and much of what they *do* say is long obsolete - because they've *always* said little.) I've implemented and used with very good results the "linear hashing" algo- rithm, described by Witold Litwin in Proc. 6th International Conf. on Very Large Databases, 1980. A more easily accessible reference is Per-\AA ke Larson in CACM 31 (1988), pg 446--457. Linear hashing gives you expected constant time access using a table whose size is (expected) bounded as a constant multiple of the number of elements in the table. It grows and shrinks dynamically in small increments - the (expected) cost per growth/shrinkage again a constant. All these constants are linearly related to a user-settable parameter which is basically the target load factor for the table. I can't distribute my code because it's part of a product. However, to give you an idea of what's involved, the basic implemention is about 500 lines of extensively commented C++ code. This is for a templated class with your usual insert, lookup, and delete operations, plus an embedded class that implements a "cursor" abstraction for building iterators, and even a function to compute and print in a nice format a number of statistics about instances of the class. Since it doesn't come up in my application, I never implemented the code to shrink a hash table. (Then again, the function to grow the table is all of about 35 commented lines; the code to shrink would be about the same.) The individual hash buckets use a class somewhat like an STL deque, but simpler and much more space-efficient. The externally-visible classes that use this as a base are container classes analogous to, but rather different from, the STL. -- Jerry From johnl@iecc.com Fri, 2 Mar 2001 11:21:46 -0500 (EST) Date: Fri, 2 Mar 2001 11:21:46 -0500 (EST) From: John R Levine johnl@iecc.com Subject: [gclist] Canonical citation for "memory pools" > Does anyone have a canonical citation for "memory pools/arenas? i.e. The > memory management scheme where you allocate several objects in one big > chunk of space and deallocate all the objects in one go. PL/I has AREAs, storage pools in which you can allocate variables, and use either POINTERs or OFFSETs from the base of the area to refernce them. The IBM documentation implies that the main reason to have them is so you can write an area out to disk or tape, read it back later, and have the offsets still valid, but it makes it quite clear that when you free the area, all of its contents are freed as well. I believe this was in the original version of PL/I which was written in about 1963. It's such an obvious and useful technique that I'd be surprised if it wasn't used in the 1950s. Regards, John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies", Information Superhighwayman wanna-be, http://iecc.com/johnl, Sewer Commissioner Finger for PGP key, f'print = 3A 5B D0 3F D9 A0 6A A4 2D AC 1E 9E A6 36 A3 47 From hans_boehm@hp.com Fri, 2 Mar 2001 09:23:15 -0800 Date: Fri, 2 Mar 2001 09:23:15 -0800 From: Boehm, Hans hans_boehm@hp.com Subject: [gclist] Garbage collection and XML Dynamically resizable hash tables exist in other places, too. They're discussed in "The Design and Analaysis of Computer Algorithms", Aho, Hopcroft, and Ullman, 1974. They're used in SGI's STL implementation. (See http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is unfortunately uglified to be namespace-correct.) They're also used to keep track of things like finalizable objects in our collector. (See http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/boehm-gc/finalize.c?rev=1.5&conten t-type=text/x-cvsweb-markup for the gory details.) The above implementations use chained hash buckets, but I think that's pretty much an orthogonal issue, unless you want clever implementations to bring down the constant in the running time of the resize operation. Hans > -----Original Message----- > From: Jerrold Leichter [mailto:jerrold.leichter@smarts.com] > Sent: Friday, March 02, 2001 7:37 AM > To: Ji-Yong D. Chung > Cc: Richard A. O'Keefe; gclist@iecc.com > Subject: Re: [gclist] Garbage collection and XML > > > | > The program puts all of these things in a hash table, so > all strings are > | > unique. When I wrote this I was concerned with the > effectiveness of the > | > hash table, and whether it would be a good idea to do my > own mallocking > | > out of blocks instead of allocating each string > separately. (Yes it was.) > | > | This is a strong argument in favor of hashtables indeed > -- but how do > | you determine the dimension of the hashtable? My guess > here would be that > | you chose the size based on the XML file size. I'd think > that a static > | hashtable with a predetermined size would not work, because > file sizes > | can really vary.... > > Dynamically adjustable hash tables have been around for 25 > years. They never > seem to have made it into the standard textbooks, and most > programmers are > unaware of them. To this day, textbooks continue to list > fixed size as a > minus for hash tables. (The contents of textbooks gets > frozen in place: The > textbooks define the course requirements, which in turn > define the constraints > on newer textbooks for the same courses. Compare a data > structures textbook > published today to one published in 1980 and what you'll find > is that the new > one uses object-oriented concepts and C++ or Java, while the > older one uses > Pascal or C - but the actual algorithms presented are pretty > much the same, > and generally all those algorithms were familiar by the > mid-70's. Tying this > back to GC: You can see the same phenomenon in compiler > texts, or any other > texts that might be expected to discuss garbage collection > techniques: They > say little - and much of what they *do* say is long obsolete > - because they've > *always* said little.) > > I've implemented and used with very good results the "linear > hashing" algo- > rithm, described by Witold Litwin in Proc. 6th International > Conf. on Very > Large Databases, 1980. A more easily accessible reference is > Per-\AA ke Larson > in CACM 31 (1988), pg 446--457. Linear hashing gives you > expected constant > time access using a table whose size is (expected) bounded as > a constant > multiple of the number of elements in the table. It grows and shrinks > dynamically in small increments - the (expected) cost per > growth/shrinkage > again a constant. All these constants are linearly related > to a user-settable > parameter which is basically the target load factor for the table. > > I can't distribute my code because it's part of a product. > However, to give > you an idea of what's involved, the basic implemention is > about 500 lines of > extensively commented C++ code. This is for a templated > class with your > usual insert, lookup, and delete operations, plus an embedded > class that > implements a "cursor" abstraction for building iterators, and > even a function > to compute and print in a nice format a number of statistics > about instances > of the class. Since it doesn't come up in my application, I > never implemented > the code to shrink a hash table. (Then again, the function > to grow the table > is all of about 35 commented lines; the code to shrink would > be about the > same.) The individual hash buckets use a class somewhat like > an STL deque, > but simpler and much more space-efficient. The > externally-visible classes > that use this as a base are container classes analogous to, but rather > different from, the STL. > -- Jerry > From eliot@parcplace.com Fri, 2 Mar 2001 11:22:28 -0800 Date: Fri, 2 Mar 2001 11:22:28 -0800 From: eliot@parcplace.com eliot@parcplace.com Subject: [2] [gclist] Garbage collection and XML -covering message- +----------------------------- | Date: Fri, 02 Mar 2001 10:36:35 -0500 (EST) | From: Jerrold Leichter | Subject: Re: [gclist] Garbage collection and XML | Dynamically adjustable hash tables have been around for 25 years. They n= ever | seem to have made it into the standard textbooks, and most programmers are | unaware of them. To this day, textbooks continue to list fixed size as a | minus for hash tables. (The contents of textbooks gets frozen in place: = The | textbooks define the course requirements, which in turn define the constr= aints | on newer textbooks for the same courses. Compare a data structures textb= ook | published today to one published in 1980 and what you'll find is that the= new | one uses object-oriented concepts and C++ or Java, while the older one us= es | Pascal or C - but the actual algorithms presented are pretty much the sam= e, | and generally all those algorithms were familiar by the mid-70's. Tying = this | back to GC: You can see the same phenomenon in compiler texts, or any ot= her | texts that might be expected to discuss garbage collection techniques: T= hey | say little - and much of what they *do* say is long obsolete - because th= ey've | *always* said little.) | I've implemented and used with very good results the "linear hashing" alg= o- | rithm, described by Witold Litwin in Proc. 6th International Conf. on Very | Large Databases, 1980. A more easily accessible reference is Per-\AA ke = Larson | in CACM 31 (1988), pg 446--457. Linear hashing gives you expected consta= nt | time access using a table whose size is (expected) bounded as a constant | multiple of the number of elements in the table. It grows and shrinks | dynamically in small increments - the (expected) cost per growth/shrinkage | again a constant. All these constants are linearly related to a user-set= table | parameter which is basically the target load factor for the table. You can find an implementation of dynamic hash tables in Smalltalk, which h= as had them from the mid 70's. There are a number of free Smalltalk implem= entations, and they come with all source code. See e.g. www.squeak.org or = www.cincom.com/smalltalk/. --- Eliot Miranda ,,,^..^,,, mailto:eliot@parcplac= e.com ParcPlace division, Cincom Smalltalk: scene not herd Tel +1 408 216= 4581 3350 Scott Boulevard, Building 36, Santa Clara, CA 95054 USA Fax +1 408 216= 4500 From fw@deneb.enyo.de 02 Mar 2001 22:58:24 +0100 Date: 02 Mar 2001 22:58:24 +0100 From: Florian Weimer fw@deneb.enyo.de Subject: [gclist] Garbage collection and XML "Boehm, Hans" writes: > Dynamically resizable hash tables exist in other places, too. > > They're discussed in "The Design and Analaysis of Computer Algorithms", Aho, > Hopcroft, and Ullman, 1974. > > They're used in SGI's STL implementation. (See > http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is > unfortunately uglified to be namespace-correct.) That seems to be a rather brute force approach: when the hash table is resized, the entries are newly distributed to a different number of hash buckets. From jerrold.leichter@smarts.com Fri, 2 Mar 2001 18:15:04 -0500 (EST) Date: Fri, 2 Mar 2001 18:15:04 -0500 (EST) From: Jerrold Leichter jerrold.leichter@smarts.com Subject: [gclist] Garbage collection and XML | Dynamically resizable hash tables exist in other places, too. | | They're discussed in "The Design and Analaysis of Computer Algorithms", Aho, | Hopcroft, and Ullman, 1974. | | They're used in SGI's STL implementation. (See | http://www.sgi.com/tech/stl/stl_hashtable.h for the implementation, which is | unfortunately uglified to be namespace-correct.) | | They're also used to keep track of things like finalizable objects in our | collector. (See | http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/boehm-gc/finalize.c?rev=1.5&conten | t-type=text/x-cvsweb-markup for the gory details.) The design in AHU - also used in your GC - is the basic "when the table gets too big, make a table of double the size and copy everything". (I've looked at the STL implementation, but I can't find the appropriate code. However, it looks as if it uses another classical variant: There's a table of primes, so I'm guessing that the hash value is taken modulo some prime, and when the total number of elements exceeds some threshold, we move to the next prime, make that many buckets, and rehash.) While this approach works in some appropriate senses, it has many problems. Most immediately noticeable is the cost of doubling when the size gets large. Growing something like a vector by copying everything is relatively cheap. Here, you have to rehash everything. That can be quite expensive. Linked- list buckets significantly increase the space overhead for tables of small objects. Splitting a very large hash table will require you to walk through all the linked lists - which will have poor memory locality. The neat thing about linear hashing - and similar algorithms - is that they can grow in small increments. (Linear hashing works by splitting one bucket at a time. When all buckets have been split, you double the number of buckets, split one, and leave the rest empty for later.) This gives much more uniform behavior. -- Jerry From virtualcyber@erols.com Fri, 2 Mar 2001 20:20:24 -0500 Date: Fri, 2 Mar 2001 20:20:24 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Synchronization of finalization tables Hi, . I was thinking about two ways to improve the use of dynamic hashtable for storing and accessing finalization methods. I just wanted to hear what others thought of them. If a garbage collector uses a dynamic hashtables for finalization, it *might* suffer from lock contention if it is run in multi-threaded mode, with lots of objects that require finalization. For static tables, one can lock just a handful of buckets, so that other buckets are free to be accessed. For dynamic tables this is not possible, because, without the locks, the buckets may be re-malloced during the reads. Here are my "improvements" for reducing lock contention in finalization tables. (1) If the collector uses object class *type* as key to finalization methods, finalizing each object need not remove its finalization method from the table. This means finalization method lookup is basically a read operations (no deletes unless the host program is terminating and the table is being destroyed). In such cases, one can reduce the lock contention by using shared semaphores that distinguish read-locks and write-locks (I am thinking of databases here). Reading the table with one thread will not prevent other threads from reading the table. Bad idea? (2) if one uses the "type" of object (not the object reference itself) to hash the finalization method, one can register all finalization methods at the beginning of the host application.at runtime. Generally, one would not need to perform locked inserts and deletes, because at runtime, one just needs to read-access the methods. Look ma! No locks! . Take Care Ji-Yong D. Chung From ok@atlas.otago.ac.nz Mon, 5 Mar 2001 12:05:12 +1300 (NZDT) Date: Mon, 5 Mar 2001 12:05:12 +1300 (NZDT) From: Richard A. O'Keefe ok@atlas.otago.ac.nz Subject: [gclist] Garbage collection and XML I wrote (concerning strings in XML): > The program puts all of these things in a hash table, > so all strings are unique. "Ji-Yong D. Chung" asked: This is a strong argument in favor of hashtables indeed -- but how do you determine the dimension of the hashtable? Why should I do any such thing? A hash table is as big as it is. I wouldn't *dream* of "determining the dimension of" a hash table. Dynamic hashing is by now a fairly old technique; the code I use is based on Per-Ake Larson's April 1988 CACM paper. (Actually, one ejp@ausmelb.oz wrote the code, and I rewrote it.) Perl and TCL both have dynamically resizing hash table code that you can rip out and use in other programs. I haven't tried the TCL code, but the Perl code is about as fast as the Larson/ejp code. One last detail -- from looking at your previous email messages, I'd guess that you used reference counting, right? Also, from what Ken Anderson said, I'd guess that one could obtain similar results (provided the code is of the similar quality) from a parser that uses garbage collector. In fact my code was written for a group of applications that load a document into memory, walk over it a couple of times in order to extract and format information, and then exit. My "garbage collector" was thus the C exit() function. When you have stuff in a hash table like this, you have to be careful to ensure that the hash table structure itself does not deceive the garbage collector into believing strings/nodes are still live. From ok@atlas.otago.ac.nz Mon, 5 Mar 2001 18:11:17 +1300 (NZDT) Date: Mon, 5 Mar 2001 18:11:17 +1300 (NZDT) From: Richard A. O'Keefe ok@atlas.otago.ac.nz Subject: [gclist] Garbage collection and XML Ken Anderson provided these numbers about the DOM: Here's a summary of loading the CIM_Schema23.xml file from http://www.dmtf.org/spec/cim_schema_v23.html into a Java DOM. This XML file describe a class hierarchy of 732 classes of the DMTF/CIM standard. Some Statistics: File sizes: MBytes of What 3.76 XML 0.27 zipped 40.2 DOM (IBM) Sharing strings can save a significant amount, and some XML parsers will do that. Vector's are given a default size of 10, while 74% have size 1, and 12% have size 3. Trimming Vectors to the right size saves 5.6MB. I wrote a SAX parser that read the same file but produced an s-expression with structure sharing. It required only 4.0MB, slightly more than the ASCII XML size. I have added more code to my program to report where the space is going. Measurements were made on four moderately large files: - my collection of CS examination papers - SigmodRecord.xml; an example I picked up off the net somewhere and "cleaned", reducing it to 72% of its original size. - the collected plays of Shakespeare (XML version). I must say that the markup here is not very idiomatic; ID/IDREF(S) attributes could have been used to excellent effect but weren't. - DocBook; the SGML source of the O'Reilly DocBook book (it's on the CD-ROM) that comes with the book. The space required *with* sharing is measured. It does NOT include space for the hash tables themselves, but does include hash links. The space required *without* sharing is partly measured and partly calculated (the main calculation is subtracting off 4 bytes for a hash link from each node). the DOM estimate assumes that strings are arrays of 16-bit characters (required) + 4 byte length field, padded to multiple of 4 bytes; other nodes are all 10 4-byte words (the smallest I could get that would do everything DOM2). The file size is as reported by ls -l or wc (which agree). EXAMS: a collection of CS examination papers (my DTD) strings + attrs + elements = total Without sharing: 3,755,464 + 312,936 + 1,288,364 = 5,356,764 With sharing: 65,216 + 1,328 + 751,248 = 817,792 DOM estimate: 5,778,556 + 1,043,120 + 2,460,680 = 9,282,356 Source exams.xml: = 2,634,021 DOM/shared = 11.3 SIGMOD: the SigMod Record catalogue strings + attrs + elements = total Without sharing: 648,824 + 82,548 + 221,100 = 952,472 With sharing: 131,784 + 16,752 + 217,748 = 366,284 DOM estimate: 1,189,308 + 275,160 + 332,680 = 1,797,148 Source SigmodRecord.xml: = 360,329 DOM/shared = 4.9 PLAYS: collected plays of Shakespeare (as found on net) strings + attrs + elements = total Without sharing: 10,677,460 + 0 + 4,305,996 = 14,983,456 With sharing: 4,901,656 + 0 + 3,751,320 = 8,652,976 DOM estimate: 19,214,762 + 0 + 7,187,600 = 26,402,362 Source plays.xml: = 7,648,502 DOM/shared = 3.1 % DOCBOOK: source of DocBook book, parsed by nsgmls strings + attrs + elements = total Without sharing: 30,266,564 + 497,196 + 1,458,096 = 32,221,856 With sharing: 641,168 + 40,272 + 1,108,608 = 1,790,048 DOM estimate: 60,857,232 + 1,657,320 + 2,665,160 = 65,179,712 Source docbook.sgm: = 2,896,326 DOM/shared = 36.4 One thing that makes the DOM look bad is that it specifies that strings are arrays of 16-bit UCS-16-encoded Unicode characters. It happens that all of these files use ASCII, or possibly just a few Latin-1 or Unicode symbols, so that using UTF-8 would shrink the string space by a factor of 2. In fact it is possible to devise a Unicode encoding which is guaranteed to take no more than 3 bytes for any character in planes 0,1,2,14 and to take 1 byte for any ISO Latin 1 character, an encoding which is if anything easier than UTF-8, as long as you don't require special treatment for all NUL bytes (which Java, Javascript, and my C code do not.) If you poke around in the statistics a bit, you find that long strings are unique anyway, but short strings have very many references indeed, so that shared strings really pay off. Note that with the worst ratio here, I could go through three successive versions of a document without bothering to garbage collection, and still take a wee bit less space than the DOM would require for one copy, so that for several reasonable filtering/rearranging applications, *not* garbage collecting is a perfectly workable strategy. I think the first reference to hash consing was in a paper of Ershov's in the early 60's; the method was used in a compiler. This is not a new idea (except to the people who invented the DOM, seemingly.) From virtualcyber@erols.com Mon, 5 Mar 2001 03:55:27 -0500 Date: Mon, 5 Mar 2001 03:55:27 -0500 From: Ji-Yong D. Chung virtualcyber@erols.com Subject: [gclist] Garbage collection and XML Hi, > [ your results were ...] > > EXAMS: a collection of CS examination papers (my DTD) > > strings + attrs + elements = total > Without sharing: 3,755,464 + 312,936 + 1,288,364 = 5,356,764 > With sharing: 65,216 + 1,328 + 751,248 = 817,792 > DOM estimate: 5,778,556 + 1,043,120 + 2,460,680 = 9,282,356 > Source exams.xml: = 2,634,021 > DOM/shared = 11.3 > > SIGMOD: the SigMod Record catalogue > > strings + attrs + elements = total > Without sharing: 648,824 + 82,548 + 221,100 = 952,472 > With sharing: 131,784 + 16,752 + 217,748 = 366,284 > DOM estimate: 1,189,308 + 275,160 + 332,680 = 1,797,148 > Source SigmodRecord.xml: = 360,329 > DOM/shared = 4.9 > > PLAYS: collected plays of Shakespeare (as found on net) > > strings + attrs + elements = total > Without sharing: 10,677,460 + 0 + 4,305,996 = 14,983,456 > With sharing: 4,901,656 + 0 + 3,751,320 = 8,652,976 > DOM estimate: 19,214,762 + 0 + 7,187,600 = 26,402,362 > Source plays.xml: = 7,648,502 > DOM/shared = 3.1 > > % DOCBOOK: source of DocBook book, parsed by nsgmls > > strings + attrs + elements = total > Without sharing: 30,266,564 + 497,196 + 1,458,096 = 32,221,856 > With sharing: 641,168 + 40,272 + 1,108,608 = 1,790,048 > DOM estimate: 60,857,232 + 1,657,320 + 2,665,160 = 65,179,712 > Source docbook.sgm: = 2,896,326 > DOM/shared = 36.4 > Really interesting results. What you have shown here, though, does not seem to be that DOM API is terribly bad for memory savings. Rather, it seems to show that DOM implementations should use shared strings/attributes. My observations are as follows: (1) If you look at the first result, and assume that you were using shared strings for DOM -- it would save nearly 1 meg. In fact, if Java DOM used shared strings, it would be far more memory efficient than the non-shared case.. This holds for the rest of the examples as well. (2) I noticed that DOM's memory use for attributes is bad. But it uses always about 3 times as much as the unshared model. Again, this seems to show that it is sharing which has more effect than the fact one is following the DOM API spec. These observations lead to the following conclusion: While DOM API is not designed for writing memory efficient apps, the biggest problems in Java DOM stem not from the API spec, but the underlying implementation. Your parser saves memory because is just well implemented. The main theme here is:to use shared objecs as much as possible. You might ask me, how do you know if Java does not use shared strings? I think your numbers clearly indicate this -- DOM's string memory consumption is always approximately 2 x the amount for non-shared case. That they are proportionate shows DOM is probably using the similar algorithm as the unshared case. Similar observation holds for attributes. From ok@atlas.otago.ac.nz Tue, 6 Mar 2001 10:48:19 +1300 (NZDT) Date: Tue, 6 Mar 2001 10:48:19 +1300 (NZDT) From: Richard A. O'Keefe ok@atlas.otago.ac.nz Subject: [gclist] Garbage collection and XML Really interesting results. What you have shown here, though, does not seem to be that DOM API is terribly bad for memory savings. Rather, it seems to show that DOM implementations should use shared strings/attributes. If you follow the letter of the DOM specification (the CORBA IDL, not the Java and Javascript bindings) that is not *allowed*. I think you may have overlooked one point I made, which is that the DOM flatly and unconditionally *requires* that strings be sequences of 16-bit characters. That's a factor of two overhead in space, even if a DOM implementation *were* to use shared strings. I think you may also have overlooked that for some kinds of documents, there is a substantial saving to be made from sharing attribute triples (Name,Type,Value), which the DOM forbids. Not "the DOM does not require" or "the DOM does not discuss", but "the DOM *forbids*". You cannot have shared attribute nodes and still call your interface "DOM". (1) If you look at the first result, and assume that you were using shared strings for DOM -- it would save nearly 1 meg. In fact, if Java DOM used shared strings, it would be far more memory efficient than the non-shared case.. This holds for the rest of the examples as well. In fact the space overheads for Java are considerably worse than that. A Java object typically has 2 words of overhead. A Java String object has 4 words of local data. One of those words is a pointer to an array of Unicode characters. (The idea is if you chop a string into substrings, the space cost per substring is constant.) An array of n Unicode characters will be (3 + ceiling(n/2)) words long. The total is thus 36+4*ceiling(n/2) bytes for a string, where I assumed 4+4*ceiling(n/2) bytes. To implement unique strings in Java would add to the space overhead per string. Having read the ECMAscript standard and Netscape's Javascript Reference manual, it is clear that the space overheads for Javascript strings must be comparable. Note that in Java, string operations are guaranteed to return new objects, so it is possible for a Java program to *tell* whether strings are shared or not. So it really isn't clear whether/how much sharing is allowed. (2) I noticed that DOM's memory use for attributes is bad. But it uses always about 3 times as much as the unshared model. Again, this seems to show that it is sharing which has more effect than the fact one is following the DOM API spec. Yes, but the DOM treats attributes (Name,Value) like any other kind of node. Every attribute node knows which Element node owns it; no kind of sharing is allowed AT ALL. To follow the DOM API spec *is* to refuse to share attribute nodes. I suggest reading the DOM specification. These observations lead to the following conclusion: While DOM API is not designed for writing memory efficient apps, the biggest problems in Java DOM stem not from the API spec, but the underlying implementation. Three kinds of item: 1) strings Requiring that strings be sequences of 16-bit characters rather than UTF-8 or TR-8 encoded guarantees AT LEAST A FACTOR OF TWO compared with a more compact representation, even if strings are shared. 2) attributes AT LEAST A FACTOR OF TWO space cost would be incurred by all the cross-links you need to store, even if the DOM allowed sharing, which it doesn't. 3) elements AT LEAST A FACTOR OF TWO space cost would be incurred by all the cross-links you need to store, even if the DOM allowed charing, which it doesn't. So AT LEAST A FACTOR OF TWO space cost is due simply to the requirements of the DOM, however clever the implementor might be. You *can't* match the results I quoted and still have something even close to the DOM. You might ask me, how do you know if Java does not use shared strings? I don't think you noticed the word "estimate". I was careful to say that the "DOM" numbers, unlike the others, were NOT measured, but estimated. (As it happens, I have a DOM in C which I wrote to be sure I understood it, but once I had, the measured time overheads convinced me not to use it.) In fact I did not account for all the overheads in Java. The figures for a Java implementation of the DOM would be considerably worse. From Bill.Foote@eng.sun.com Tue Mar 6 13:56:44 2001 From: Bill.Foote@eng.sun.com (Bill Foote) Date: Tue, 06 Mar 2001 14:56:44 +0100 Subject: [gclist] Garbage collection and XML References: <200103052148.KAA31951@atlas.otago.ac.nz> Message-ID: <3AA4EC9C.16D1260E@eng.sun.com> "Richard A. O'Keefe" wrote: > > Really interesting results. What you have shown here, though, > does not seem to be that DOM API is terribly bad for memory savings. > Rather, it seems to show that DOM implementations should use shared > strings/attributes. > > If you follow the letter of the DOM specification (the CORBA IDL, not the > Java and Javascript bindings) that is not *allowed*. > > I think you may have overlooked one point I made, which is that the > DOM flatly and unconditionally *requires* that strings be sequences of > 16-bit characters. That's a factor of two overhead in space, even if > a DOM implementation *were* to use shared strings. How on Earth did they manage to word a normative requirement that does that? Surely, if I store strings UTF-8 encoded in an immutable string type, there's no way for an application to tell I'm sharing strings behind the scenes. I have trouble imagining wording that could place a testable normative requirement like this on an API. I'm genuinely curious; what wording in the DOM spec says this? Confused, Bill -- Bill Foote bill.foote @ sun.com Java TV Standards Engineer http://java.sun.com/products/javatv From ok@atlas.otago.ac.nz Tue Mar 6 22:46:03 2001 From: ok@atlas.otago.ac.nz (Richard A. O'Keefe) Date: Wed, 7 Mar 2001 11:46:03 +1300 (NZDT) Subject: [gclist] Garbage collection and XML Message-ID: <200103062246.LAA04709@atlas.otago.ac.nz> I wrote: >If you follow the letter of the DOM specification (the CORBA IDL, not the >Java and Javascript bindings) that is not *allowed*. From: David Chase asked Pardon my potential ignorance here, but who would care if there were sharing, especially if: 1 - the binding was done to a GC'd language, where last-owner-of-a string is less of an issue. The two bindings in the DOM specifications are to Java and Javascript, where strings are immutable. It's really difficult to figure out *what* the DOM specifies, because - the primary specification is in CORBA IDL, in which every time you ask for a string the remote system sends you back a new copy - the object chosen to represent strings in the CORBA IDL for the DOM is a *mutable* array of 16-bit characters - the object chosen to represent strings in the Java and Javascript bindings is an *immutable* String of 16-bit characters. I don't know about Javascript, but in Java it is perfectly possible to have two String objects with the same (immutable!) state which must act the same for all future time, but have distinct identities. A Java program which tried to keep track of which nodes strings came from by using String identities as keys could be confused if strings were shared. I came to hate the DOM when I tried to implement it in Smalltalk. Since Smalltalk strings are *mutable*, it was important to know whether I was allowed to return the string object already inside a text node, or whether I had to copy it. I couldn't figure out *what* to do, and amongst other things discovered the contradiction above, that according to the IDL you get a new mutable array whenever you ask about a string in the model, but according to the Java and Javascript bindings you get an immutable object. An explicit statement about sharing in the DOM specification would help a LOT, as would explicit advice about what to do in languages like Eiffel, Lisp, and Smalltalk, where strings are mutable. However, it *is* absolutely clear that no sharing of non-string objects is allowed at all. The figures I have show that you save a useful amount of space by sharing attribute=value bindings. 2 - the resulting implementation were much smaller/faster. It is clear that there is a substantial space saving from sharing strings. It's not just "structural" strings like element names and attribute names either. There are a lot of repetitions of "content" strings like attribute values and #PCDATA nodes. Since the DOM absolutely requires the use of UTF-16-encoded strings (UTF-8 is *NOT ALLOWED*, still less anything more compact than that), there is still at least a factor of two compared with what you can get in C or Smalltalk. Why does the DOM require UTF-16? Because it's *really* an attempt to pretty up something the browser vendors bodged together to be accessed from Javascript, which has UTF-16-encoded strings these days. Is this possibly just some sort of pin-headed overspecification that may safely be ignored, or do people actually write programs (in particular, Java programs) that depend on the lack of sharing? As noted above, I have to admit that the specification is actually inconsistent on this point. However, I also note that there is nothing in the Javascript reference material I recently downloaded from Netscape or the ECMA 262 standard for ECMAscript that would make sharing particularly easy to implement in Javascript. I thought there _was_ a UniqueString class in Java, but when I looked for it, I couldn't find one. Perhaps someone can correct me about that. It is far easier to implement the DOM *without* string sharing in Java and Javascript. And as noted before, any other kind of sharing is *explicitly* forbidden, and could not be provided without comprehensively wrecking the entire design. From hans_boehm@hp.com Tue Mar 6 23:18:51 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 6 Mar 2001 15:18:51 -0800 Subject: [gclist] Garbage collection and XML Message-ID: <140D21516EC2D3119EE7009027876644049B5C74@hplex1.hpl.hp.com> > I thought there _was_ a > UniqueString class in Java, but when I looked for it, I > couldn't find one. Isn't java.lang.String.intern() what you want? Hans From ok@atlas.otago.ac.nz Wed Mar 7 00:59:41 2001 From: ok@atlas.otago.ac.nz (Richard A. O'Keefe) Date: Wed, 7 Mar 2001 13:59:41 +1300 (NZDT) Subject: [gclist] Garbage collection and XML Message-ID: <200103070059.NAA06646@atlas.otago.ac.nz> I wrote: I thought there _was_ a UniqueString class in Java, but when I looked for it, I couldn't find one. Hans Boehm replied: Isn't java.lang.String.intern() what you want? It is indeed the thing that I had vaguely remembered. What I *wanted* was a UniqueString class, with a less space-hungry representation than Java's String class. A Java String requires 1 or 2 words of per-object overhead + 4 instance variables (one word each) + 2 or 3 words of per-array overhead + storage for the string proper. The overheads are pretty substantial. From gcolvin@us.oracle.com Wed Mar 7 01:04:03 2001 From: gcolvin@us.oracle.com (Greg Colvin) Date: Tue, 6 Mar 2001 18:04:03 -0700 Subject: [gclist] Garbage collection and XML References: <200103070059.NAA06646@atlas.otago.ac.nz> Message-ID: <011e01c0a6a2$81f1e3c0$37781990@us.oracle.com> From: Richard A. O'Keefe > I wrote: > I thought there _was_ a UniqueString class in Java, but when I > looked for it, I couldn't find one. > > Hans Boehm replied: > Isn't java.lang.String.intern() what you want? > > It is indeed the thing that I had vaguely remembered. > > What I *wanted* was a UniqueString class, with a less space-hungry > representation than Java's String class. A Java String requires > 1 or 2 words of per-object overhead > + > 4 instance variables (one word each) > + > 2 or 3 words of per-array overhead > + > storage for the string proper. > > The overheads are pretty substantial. If you care about overheads then you shouldn't be using java ;-> From bos@serpentine.com Wed Mar 7 01:07:57 2001 From: bos@serpentine.com (Bryan O'Sullivan) Date: Tue, 6 Mar 2001 17:07:57 -0800 (PST) Subject: [gclist] Garbage collection and XML In-Reply-To: <200103070059.NAA06646@atlas.otago.ac.nz> References: <200103070059.NAA06646@atlas.otago.ac.nz> Message-ID: <15013.35309.439591.995984@pelerin.serpentine.com> r> What I *wanted* was a UniqueString class, with a less space-hungry r> representation than Java's String class. Since java.lang.String can't be subclassed and Java's notion of type equivalence is based on name, not structure, I fear that a UniqueString would be something of an annoyance to use in practice. If only Cardelli and company had glommed C-like syntax over Modula-3's semantics, we might inhabit a slightly happier world. Since the DOM presumably defines an interface (esp. as you're talking about a CORBA IDL) I don't see what the requirement for what strings (mutable/immutable/16-bit/8-bit/whatever) look like on the outside have to do with how the implementation stores nodes. What's to keep an implementation from doing whatever sharing and compression it wishes and just satisfying the semantics whenever a caller traverses the DOM and executes getters or setters for string valued attributes? -- Dave -----Original Message----- From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz] Sent: Tuesday, March 06, 2001 2:46 PM To: chase@world.std.com; gclist@iecc.com Cc: icis-developers@bbn.com Subject: Re: [gclist] Garbage collection and XML I wrote: >If you follow the letter of the DOM specification (the CORBA IDL, not the >Java and Javascript bindings) that is not *allowed*. =09 From: David Chase asked Pardon my potential ignorance here, but who would care if there were sharing, especially if: =09 1 - the binding was done to a GC'd language, where last-owner-of-a string is less of an issue. =09 The two bindings in the DOM specifications are to Java and Javascript, where strings are immutable. It's really difficult to figure out *what* the DOM specifies, because - the primary specification is in CORBA IDL, in which every time you ask for a string the remote system sends you back a new copy - the object chosen to represent strings in the CORBA IDL for the DOM is a *mutable* array of 16-bit characters - the object chosen to represent strings in the Java and Javascript bindings is an *immutable* String of 16-bit characters. I don't know about Javascript, but in Java it is perfectly possible to have two String objects with the same (immutable!) state which must act the same for all future time, but have distinct identities. A Java program which tried to keep track of which nodes strings came from by using String identities as keys could be confused if strings were shared. I came to hate the DOM when I tried to implement it in Smalltalk. Since Smalltalk strings are *mutable*, it was important to know whether I was allowed to return the string object already inside a text node, or whether I had to copy it. I couldn't figure out *what* to do, and amongst other things discovered the contradiction above, that according to the IDL you get a new mutable array whenever you ask about a string in the model, but according to the Java and Javascript bindings you get an immutable object. An explicit statement about sharing in the DOM specification would help a LOT, as would explicit advice about what to do in languages like Eiffel, Lisp, and Smalltalk, where strings are mutable. However, it *is* absolutely clear that no sharing of non-string objects is allowed at all. The figures I have show that you save a useful amount of space by sharing attribute=3Dvalue bindings. 2 - the resulting implementation were much smaller/faster. =09 It is clear that there is a substantial space saving from sharing strings. It's not just "structural" strings like element names and attribute names either. There are a lot of repetitions of "content" strings like attribute values and #PCDATA nodes. Since the DOM absolutely requires the use of UTF-16-encoded strings (UTF-8 is *NOT ALLOWED*, still less anything more compact than that), there is still at least a factor of two compared with what you can get in C or Smalltalk. Why does the DOM require UTF-16? Because it's *really* an attempt to pretty up something the browser vendors bodged together to be accessed from Javascript, which has UTF-16-encoded strings these days. Is this possibly just some sort of pin-headed overspecification that may safely be ignored, or do people actually write programs (in particular, Java programs) that depend on the lack of sharing? =09 As noted above, I have to admit that the specification is actually inconsistent on this point. However, I also note that there is nothing in the Javascript reference material I recently downloaded from Netscape or the ECMA 262 standard for ECMAscript that would make sharing particularly easy to implement in Javascript. I thought there _was_ a UniqueString class in Java, but when I looked for it, I couldn't find one. Perhaps someone can correct me about that. It is far easier to implement the DOM *without* string sharing in Java and Javascript. And as noted before, any other kind of sharing is *explicitly* forbidden, and could not be provided without comprehensively wrecking the entire design. From ok@atlas.otago.ac.nz Wed Mar 7 02:32:05 2001 From: ok@atlas.otago.ac.nz (Richard A. O'Keefe) Date: Wed, 7 Mar 2001 15:32:05 +1300 (NZDT) Subject: [gclist] Garbage collection and XML Message-ID: <200103070232.PAA04727@atlas.otago.ac.nz> From: "David Bakin" Since the DOM presumably defines an interface (esp. as you're talking about a CORBA IDL) I don't see what the requirement for what strings (mutable/immutable/16-bit/8-bit/whatever) look like on the outside have to do with how the implementation stores nodes. *Strings* and *Nodes* are indeed different things, with different sharing payoffs and possibilities. But it is precisely the job of an interface to state the properties of the objects it is an interface *to*. The CORBA IDL in the DOM specifications (which I keep on-line and check when I make claims about the DOM) state that when you ask a document model for a string, what you get back is a sequence of characters. With respect to encoding, the DOM explicitly says "APPLICATIONS must encode DOMString using UTF-16." Never mind the DOM: for all alphabetic and some syllabic scripts you can do a *lot* better space-wise than using UTF-16, which is what you get in Java or Javascript. Thing is, if you *don't* implement things pretty close to the "natural" image of the DOM, you are going to do unbelievable amounts of copying. I discussed sharing at four levels: - "structural" strings HUGE payoff for sharing - "content" strings SERIOUS payoff for sharing - **XML** attribute=value nodes SERIOUS payoff for sharing - element nodes MODERATE payoff for sharing If anyone will take the trouble to read the DOM, they will discover that sharing the last two of those are unambiguously ruled out, in the sense that if you have Ho Ho Ho then you *must* have three distinct Ho Element nodes with three distinct identities and distinct relationships to other nodes. If a DOM implementation implements the DocumentType at all well, it is likely that it will share "structural" strings. A DOM implementation cannot share attribute=value or Element nodes without some very fancy and rather expensive behind-the-scenes calculation, which would more than offset the space savings. The one open area is sharing "content" strings (values of attributes and PCData nodes). The DOM actually envisages that these will be too big for your programming language so that you have to get the information out in pieces (CharacterData::substringData()). A DOM implementor who hasn't done the measurements might think (mistakenly) that it didn't pay to share content strings. Does anyone actually *know* what actual DOM implementations do? The reported high space cost of DOM use (I have heard 8 to 10 times as many bytes in core as XML on disc) is most simply explained on the hypothesis that popular DOM implementations *don't* share content strings. What's to keep an implementation from doing whatever sharing and compression it wishes and just satisfying the semantics whenever a caller traverses the DOM and executes getters or setters for string valued attributes? A four-letter word: T I M E. Amongst other things, the Document Object Model is an *Object* model and is totally oriented to manipulating documents by mutating data structures. When you select a collection of nodes, a NodeList is stated to be "live"; mutations to the original document show up in the list. Until you've actually read the DOM2, you wouldn't believe how complex things can get when someone makes a serious attempt to define iteration over mutating data structures. This is the garbage collection list, not the DOM list, so I'd like to close with a number of general storage management observations. 1. Garbage collection or no garbage collection, garbage *avoidance* in the design is a good thing if you can do it without compromising other goals. 2. Hash consing is astonishingly effective, but only with data you aren't mutating a lot, and which is not required to have its own unique identity. 3. There is no substitute for measurements on real data. 4. Seemingly unimportant design decisions can have huge effects on memory requirements. 5. There is a tradeoff between memory and time, but it doesn't pay to assume you could afford both ends of the tradeoff. 6. Sturgeon's Law applies to standards, even W3C recommendations. From chase@world.std.com Wed Mar 7 04:57:06 2001 From: chase@world.std.com (David Chase) Date: Tue, 06 Mar 2001 23:57:06 -0500 Subject: [gclist] Garbage collection and XML In-Reply-To: <15013.35309.439591.995984@pelerin.serpentine.com> References: <200103070059.NAA06646@atlas.otago.ac.nz> <200103070059.NAA06646@atlas.otago.ac.nz> Message-ID: <4.3.2.7.0.20010306215829.02038008@pop.std.com> At 05:07 PM 3/6/2001 -0800, Bryan O'Sullivan wrote: actually, Richard O'Keefe wrote: >r> What I *wanted* was a UniqueString class, with a less space-hungry >r> representation than Java's String class. The best you're likely to get out of most Java implementations for any type is 2 words of header, plus one or two for data, depending on how they deal with possible alignment of doubles and longs. Java strings are also not necessarily quite as costly as you make them out to be. The basic object is header + array pointer + offset + count (5 or 6 words, depending on padding) but it is entirely possible to share the array portion of equal strings. You could, for instance, say new String(s.intern()) to ensure that you get a string that is mostly shared, yet not equal to any other string. That's perhaps only 5 words per string after the first one is created, versus maybe 3 words per whatever object you might come up with for your unique-string type. However, you could also play the game of indexing your entities, and indexing instances of entities. That is, map objects to integers. That way, you can store any object in 2 words, one identifying the value, the other identifying the instance ID, with full sharing under the covers (where "under the covers" is in a sort of a hash table, where the "value" associated with each object stored in the table is the integer for the object. When the instance id wraps, you grab another slot for the same object in the table.) This is probably too horrible to contemplate for most people, given that you've got untyped integers instead of typed objects, and no garbage collection at all under the covers. A loon might even push it to the bit level, and reserve 8 bits for the instance ID, and 24 bits for the value index. (If Fortran is outlawed, only outlaws will use Fortran.) >Since java.lang.String can't be subclassed and Java's notion of type >equivalence is based on name, not structure, I fear that a >UniqueString would be something of an annoyance to use in practice. final public class U { public static String u(String s) { return new String(s.intern()); } } .... U.u("a string") ... Less typing than those clunky old Modula-3 keywords :-). >If only Cardelli and company had glommed C-like syntax over Modula-3's >semantics, we might inhabit a slightly happier world. I'm not sure, but I think some of us in the "peanut gallery" raised the issue at the time. I may have my old email still from back then; maybe someday I'll see what I can find. Java's got one other thing that Modula-3 didn't, which is an answer to the multiple inheritance question. The problem, for the M-3 definers, was that most of the people who wanted MI were unable or unwilling to explain what it was that they wanted in any sort of a sound semantic framework (William Cook was an exception, I think) and the attitude of the M-3 people toward inheritance in general would probably have come up with something different. Not clear if better or worse, but different. Unsurprisingly, Java's type system is at its flakiest where it deals with multiple inheritance, but to an engineering approximation nobody cares, and nobody I know has figured out how to turn the theoretical glitch into a security hole. David Chase From bos@serpentine.com Wed Mar 7 05:19:37 2001 From: bos@serpentine.com (Bryan O'Sullivan) Date: Tue, 6 Mar 2001 21:19:37 -0800 (PST) Subject: [gclist] Garbage collection and XML In-Reply-To: <4.3.2.7.0.20010306215829.02038008@pop.std.com> References: <200103070059.NAA06646@atlas.otago.ac.nz> <4.3.2.7.0.20010306215829.02038008@pop.std.com> Message-ID: <15013.50409.380653.862521@pelerin.serpentine.com> [Claims that this thread is relevant to garbage collection are starting to feel a little weak. Oh well.] d> However, you could also play the game of indexing your entities, d> and indexing instances of entities. That is, map objects to d> integers. [...] This is probably too horrible to contemplate for d> most people, given that you've got untyped integers instead of d> typed objects, and no garbage collection at all under the covers. I actually went off and did this for an indexing and searching app fairly recently. Provided your API doesn't reveal the integer-ness of the underlying representation to its users, and can overcome the cost of converting back and forth at method entry and exit points, it is possible to surprise people with the kind of performance and memory overhead you can sustain with this kind of Java application. The usual fearsome memory requirements lose some teeth as integers aren't as heavily-boxed as objects in Java. Granted, you now have huge tables of interned strings sitting around that won't shrink or go away until you drop all references to the entire tables, but for this kind of application, it's easy to sidle around the problem with references to "necessary time/space tradeoffs". What is less fun is writing and maintaining the code behind the pristine-looking API. Unless you want to rebox all the integers you had carefully unboxed earlier so you can use the java.util.Map interface, you're condemned to CS201 rebuild-silly-data-structures hell. Makes one yearn for parametric classes and interfaces, =E0 la GJ. of "Wed, 07 Mar 2001 11:46:03 +1300." <200103062246.LAA04709@atlas.otago.ac.nz> Message-ID: <200103070549.f275nW627336@piglet.dstc.edu.au> Richard, I believe that some of your assertions about CORBA IDL are incorrect. You wrote: > The two bindings in the DOM specifications are to Java and Javascript, > where strings are immutable. It's really difficult to figure out *what* > the DOM specifies, because > - the primary specification is in CORBA IDL, in which every time you > ask for a string the remote system sends you back a new copy If you treat IDL as an abstract interface specification language ... as DOM does ... it doesn't say how data type values are passed across an interface. Such details only need to be considered when the IDL is mapped to some target language(s) in some implementation context. Even when IDL is used describe a client / server interaction; e.g. using a conventional ORB and standard language mappings, data types (like strings) are not always passed by copying. If the client and object are colocated, the ORB may pass values by reference ... even in the case of C++ where C++ types that represent the IDL data types are mutable. [One of the rules that you must obey to write portable CORBA C++ code is that a 'server' must not change values that have been passed as 'in' args.] In the case of Java, it hardly matters which way 'string' values are passed in the colocated case unless your code compares java.lang.String values using '==' ... which is dodgy at the best of times! > - the object chosen to represent strings in the CORBA IDL for the DOM > is a *mutable* array of 16-bit characters CORBA IDL doesn't define strings (or any other data type) as mutable or immutable. Such issues belong in the CORBA language mappings, and different mappings make different decisions. Furthermore, if you hand-map the IDL to native APIs, it is entirely up to you how you address the issue. > - the object chosen to represent strings in the Java and Javascript > bindings is an *immutable* String of 16-bit characters. The standard CORBA IDL -> Java mapping ALSO maps CORBA 'string' to Java's 'java.lang.String'; i.e. immutable arrays of UTF-16 characters. You might argue that DOM should use a non-standard mapping for 'string'; e.g. to some object wrapper for an immutable array of bytes. This would be more space efficient, but it would be a right royal pain to write Java applications that used such a DOM API. -- Steve From virtualcyber@erols.com Wed Mar 7 07:05:05 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Wed, 7 Mar 2001 02:05:05 -0500 Subject: [gclist] CORBA C++ bindings and garbage collection References: <200103070549.f275nW627336@piglet.dstc.edu.au> Message-ID: <002b01c0a6d4$f0d89fe0$0100007f@cradle> Hi, A number of people in this forum seem to be well acquainted with CORBA. Having just written a small app using it, I have two question about C++ CORBA and garbage collection: (1) Does IDL to C++ language mapping rule out the use of garbage collector in designing and implementing an ORB? (2) If the mapping does rule it out, then wasn't it a mistake for the original mapping designers to not to consider garbage collection? If the language mapping does not rule out the use of a garbage collector, does anyone know of an C++ ORB implementation which does use a garbage collector? When I was programming a bit with Java RMI and C++ CORBA, and I kept thinking how simple memory cleanup task for Java RMI is, compared to C++ CORBA ORB I was using. I will appreciate any comments, answers and perspectives, unless this topic has been beaten on before -- in which case I will be happy and grateful just to be directed to an existing mail archive dungeon. Take Care, Ji-Yong D. Chung From bos@serpentine.com Wed Mar 7 08:28:09 2001 From: bos@serpentine.com (Bryan O'Sullivan) Date: Wed, 7 Mar 2001 00:28:09 -0800 (PST) Subject: [gclist] CORBA C++ bindings and garbage collection In-Reply-To: <002b01c0a6d4$f0d89fe0$0100007f@cradle> References: <200103070549.f275nW627336@piglet.dstc.edu.au> <002b01c0a6d4$f0d89fe0$0100007f@cradle> Message-ID: <15013.61721.784591.670436@pelerin.serpentine.com> j> Does IDL to C++ language mapping rule out the use of garbage j> collector in designing and implementing an ORB? It's not the language mapping that rules out GC, it's the programming model and the wire protocol. IIOP has no facility for tracking the number of clients talking to a server. In order for a client be able to talk to an object over the wire, it has to be explicitly exported on the server side. j> If the mapping does rule it out, then wasn't it a mistake for the j> original mapping designers to not to consider garbage collection? I don't think so. Distributed garbage collection is a nice idea in the abstract, but it comes with far too many problems to be something you really want in a commercial setting. Also, at the time CORBA was being agglutinated, there were no commercially-noticeable languages that supported GC in existence. As to my claims of problems with distributed GC: 1. See figure 1. 2. No matter what algorithm you choose, it's really, really hard to get right. I have never seen a distributed GC (and I've worked with, and near, several) that didn't have serious bugs, no matter how bright the people were who worked on it. Of course, the bugs only show up in situations where you can't even log the problems, much less reproduce them later. See figure 1. 3. Beyond simple reference counting (and its problems with messaging overhead and cyclic structures), any of the DGC algorithms I used to look at back when I stayed abreast of the literature were really, really hard to even understand. None of the professors or grad students I knew around 1993 or so claimed to understand the most widely-respected DGC algorithm of the day (written, as I recall, by John Hughes). See figure 1. 4. Since #2 means that everyone (to a first approximation) uses reference counting with heartbeats or lease renewals and a few other frills, you end up with a lot of cross-chatter in a large network that is nominally idle. If your network gets busy, GC traffic starts to add appreciable overhead. Oh, and now you have to think about debugging GC bugs in a 16-node cluster of live stock market trading servers. See figure 1. Perhaps you're getting the picture. DGC is still a fruitful source of research papers. This should scare you. The last project I worked on that used DGC was BEA's WebLogic Server, a very profitable web application server. When I left BEA, we had been talking seriously for several months about turning off DGC altogether. The Enterprise JavaBeans programming model didn't require DGC, even though it was nominally implemented on top of RMI, and EJB has almost entirely displaced RMI as the distributed programming model of choice for large Java apps. The popularity of EJB made it very tempting to kill off all of our DGC infrastructure and its horrible Heisenbugs. Prior to WLS, I worked on Jini (remember that?), where we effectively handwaved away the intractable problems of DGC in large, semi-coherent systems by requiring that clients explicitly maintain leases to server objects. j> When I was programming a bit with Java RMI and C++ CORBA, and I j> kept thinking how simple memory cleanup task for Java RMI is, j> compared to C++ CORBA ORB I was using. There's no doubt that DGC makes programming seem nicer. Right up until it breaks irreproducibly in deployment or doesn't scale beyond a handful of participants, at which point you can take your app out back and shoot it. j> I will appreciate any comments, answers and perspectives, unless j> this topic has been beaten on before -- in which case I will be j> happy and grateful just to be directed to an existing mail archive j> dungeon. Actually, I haven't seen much written about the theory-vs-practice dichotomy in the DGC world. Am I merely scaring the kids, or have others also found it to be a tremendous headache in non-trivial cases? Message-ID: <3AA60006.9F153370@eng.sun.com> "Richard A. O'Keefe" wrote: > I don't know about Javascript, but in Java it is perfectly possible to > have two String objects with the same (immutable!) state which must act > the same for all future time, but have distinct identities. A Java > program which tried to keep track of which nodes strings came from by > using String identities as keys could be confused if strings were shared. Ah, good point. It is theoretically possible to use string identities to discriminate String objects. You'd have to be mad to try, and it's not easy, but you can: public class IdentityStringKey { private String value; ... public int hashCode() { return value.hashCode(); } public boolean equals(Object other) { if (other instanceof IdentityStringKey) { return ((IdentityStringKey) other).value == value; } else { return false; } } } So I take back what I said. With enough work, once actually could word a normative, testable requirement of no String object sharing in a Java API. It's hard to do and doesn't makes sense, but it can be done. Anyway, this is moot as it sounds like DOM didn't go that far. Cheers, Bill -- Bill Foote bill.foote @ sun.com Java TV Standards Engineer http://java.sun.com/products/javatv From kanderson@bbn.com Wed Mar 7 14:39:57 2001 From: kanderson@bbn.com (Ken Anderson) Date: Wed, 07 Mar 2001 09:39:57 -0500 Subject: [gclist] Garbage collection and XML In-Reply-To: <4.3.2.7.0.20010306215829.02038008@pop.std.com> References: <15013.35309.439591.995984@pelerin.serpentine.com> <200103070059.NAA06646@atlas.otago.ac.nz> <200103070059.NAA06646@atlas.otago.ac.nz> Message-ID: <4.1.20010307093704.00a67100@zima.bbn.com> At 11:57 PM 3/6/2001 , David Chase wrote: >At 05:07 PM 3/6/2001 -0800, Bryan O'Sullivan wrote: > >final public class U { > public static String u(String s) { return new String(s.intern()); } >} > >.... U.u("a string") ... > >Less typing than those clunky old Modula-3 keywords :-). > Unfortunately, the String() constructor copies the underlying char[]. I think this will work the way you intended. final public class U { public static String u(String s) { return s.intern(); }} From hans_boehm@hp.com Wed Mar 7 17:16:33 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 7 Mar 2001 09:16:33 -0800 Subject: [gclist] Garbage collection and XML Message-ID: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com> > -----Original Message----- > From: Bill Foote [mailto:Bill.Foote@eng.sun.com] > "Richard A. O'Keefe" wrote: > > I don't know about Javascript, but in Java it is perfectly > possible to > > have two String objects with the same (immutable!) state > which must act > > the same for all future time, but have distinct identities. A Java > > program which tried to keep track of which nodes strings > came from by > > using String identities as keys could be confused if > strings were shared. > > Ah, good point. It is theoretically possible to use string > identities to > discriminate String objects. You'd have to be mad to try, > and it's not > easy, but you can: ... You could presumably also synchronize on the strings, effectively turning them into locks. In that case sharing might result in unexpected lock contention or deadlock. The fact that every object can be used for synchronization means that in some sense nothing is immutable. Hans From hans_boehm@hp.com Wed Mar 7 17:30:50 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 7 Mar 2001 09:30:50 -0800 Subject: [gclist] Garbage collection and XML Message-ID: <140D21516EC2D3119EE7009027876644049B5C7B@hplex1.hpl.hp.com> > -----Original Message----- > From: David Chase [mailto:chase@world.std.com] > The best you're likely to get out of most Java implementations > for any type is 2 words of header, plus one or two for data, > depending on how they deal with possible alignment of doubles > and longs. > > Java strings are also not necessarily quite as costly > as you make them out to be. The basic object is > header + array pointer + offset + count (5 or 6 words, depending > on padding) but it is entirely possible to share the array > portion of equal strings. ... A lot of this clearly varies greatly with the implementation. I believe that gcj (with a patch that hasn't yet made it into the official tree) will in the best case represent a String as a single chunk of memory containing: 1 word object header (vtable pointer only, objects are not moved, synchronization is handled with a separate table) 1 word pointer to array (in the best case points to the string object itself) 1 "int" byte offset to start of string. 1 "int" length Sequence of 16 bit characters Thus strings up to 4 characters are 4 words on a 64 bit machine, and 6 on a 32 bit machine. (Object sizes are even numbers of words for alignment reasons.) Disclaimer: I didn't write the String implementation. This is based on my reading of the code. Hans From fjh@cs.mu.oz.au Wed Mar 7 17:35:34 2001 From: fjh@cs.mu.oz.au (Fergus Henderson) Date: Thu, 8 Mar 2001 04:35:34 +1100 Subject: [gclist] Garbage collection and XML In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com> References: <140D21516EC2D3119EE7009027876644049B5C7A@hplex1.hpl.hp.com> Message-ID: <20010308043534.A13328@hg.cs.mu.oz.au> On 07-Mar-2001, Boehm, Hans wrote: > You could presumably also synchronize on the strings, effectively turning > them into locks. In that case sharing might result in unexpected lock > contention or deadlock. The fact that every object can be used for > synchronization means that in some sense nothing is immutable. How would you synchronize on the strings? The java.lang.String class is declared `final', so you can't inherit from it, and AFAIK there are no synchronized methods in java.lang.String. -- Fergus Henderson | "I have always known that the pursuit | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From Bob.Kerns@brightware.com Wed Mar 7 17:42:07 2001 From: Bob.Kerns@brightware.com (Bob Kerns) Date: Wed, 7 Mar 2001 09:42:07 -0800 Subject: [gclist] Garbage collection and XML Message-ID: <4B946AD84FD3D2119696009027463F9D019C0E2F@bwnvfs16.brightware.com> String foo = dosomething(); synchronized (foo) { .... } As he said, every object can be used for synchronization. Synchronized methods effectively wrap synchronized (this) { ... } around the body of the method, but you can synchronize on any object at any time. -----Original Message----- From: Fergus Henderson [mailto:fjh@cs.mu.oz.au] Sent: Wednesday, March 07, 2001 9:36 AM To: Boehm, Hans Cc: 'Bill Foote'; Richard A. O'Keefe; chase@world.std.com; gclist@iecc.com; icis-developers@bbn.com Subject: Re: [gclist] Garbage collection and XML On 07-Mar-2001, Boehm, Hans wrote: > You could presumably also synchronize on the strings, effectively turning > them into locks. In that case sharing might result in unexpected lock > contention or deadlock. The fact that every object can be used for > synchronization means that in some sense nothing is immutable. How would you synchronize on the strings? The java.lang.String class is declared `final', so you can't inherit from it, and AFAIK there are no synchronized methods in java.lang.String. -- Fergus Henderson | "I have always known that the pursuit | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From chase@world.std.com Wed Mar 7 17:48:44 2001 From: chase@world.std.com (David Chase) Date: Wed, 07 Mar 2001 12:48:44 -0500 Subject: [gclist] Garbage collection and XML In-Reply-To: <4.1.20010307093704.00a67100@zima.bbn.com> References: <4.3.2.7.0.20010306215829.02038008@pop.std.com> <15013.35309.439591.995984@pelerin.serpentine.com> <200103070059.NAA06646@atlas.otago.ac.nz> <200103070059.NAA06646@atlas.otago.ac.nz> Message-ID: <4.3.2.7.0.20010307121108.01f4d7a0@pop.std.com> This is not about garbage collection at all, unless we want to grumble about the design decisions underlying some of these things. At 09:39 AM 3/7/2001 -0500, Ken Anderson wrote: >At 11:57 PM 3/6/2001 , David Chase wrote: >>final public class U { >> public static String u(String s) { return new String(s.intern()); } > >Unfortunately, the String() constructor copies the underlying char[]. >I think this will work the way you intended. > >final public class U { > public static String u(String s) { return s.intern(); }} Nope, we're both wrong. Richard O'Keefe was looking for a way to get space-saving, but not-eq, equal strings. You're right about the String constructor, but it turns out that there IS a way: s.intern().substring(0) That will first intern s, to ensure sharing, then create a new String object that shares storage with the interned String, but is not == to it. Regarding Hans's observations about gcj -- if they want to roll their own String class, that's fine, but if they intend to interoperate with native code (JNI code), they'll need to use the same String data structures, field names, and types as Sun uses for their classes. I learned this the hard way. Hans is regrettably correct about the effects of "every-object- is-a-lock". Though this has led to some really impressive innovation in lock implementation technology, it mucks up sharing, and anyone who actually wants to make a system that is robust in the face of denial-of-service attacks (there are some cute ones involving locking) has to create their own private Objects for locking anyhow. If you DO want to take advantage of sharing, you can't lock on those objects either, since you can never keep track of who's got what lock when/where, so again there's no use for EOiaL. It's kind of a useless feature, but there it is. David Chase From hans_boehm@hp.com Wed Mar 7 18:17:31 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 7 Mar 2001 10:17:31 -0800 Subject: [gclist] Garbage collection and XML Message-ID: <140D21516EC2D3119EE7009027876644049B5C7E@hplex1.hpl.hp.com> > From: David Chase [mailto:chase@world.std.com] > > This is not about garbage collection at all, unless we want > to grumble about the design decisions underlying some of these > things. Ditto, though my impression is that there is a lot of interaction between strings and GC performance. Strings may be a big part of the heap, and depending on how you answer some of these questions here, the GC may or may not have to look at them very much. > > Regarding Hans's observations about gcj -- if they want to > roll their own String class, that's fine, but if they intend > to interoperate with native code (JNI code), they'll need > to use the same String data structures, field names, and types > as Sun uses for their classes. I learned this the hard > way. > My impression is that this is only an issue for programs that rely on features not documented in the Java or JNI spec? If so, it seems to me the other answer is "fix the client code", which seems to be an easier thing to say with open source code than with commercial code, though it's never easy. Hans From ok@atlas.otago.ac.nz Wed Mar 7 22:54:47 2001 From: ok@atlas.otago.ac.nz (Richard A. O'Keefe) Date: Thu, 8 Mar 2001 11:54:47 +1300 (NZDT) Subject: [gclist] Garbage collection and XML Message-ID: <200103072254.LAA08975@atlas.otago.ac.nz> Bill Foote wrote: So I take back what I said. With enough work, once actually could word a normative, testable requirement of no String object sharing in a Java API. It's hard to do and doesn't makes sense, but it can be done. Anyway, this is moot as it sounds like DOM didn't go that far. The DOM goes rather further than you might suppose. Recall that I divided strings into two kinds: - structural strings (element names and attribute names) - content strings (#PCDATA and attribute values). If you know a bit of SGML or HTML you might think an attribute value is just a string, e.g.

But SGML allows a string to contain macro references, e.g.

No worries, the only macros allowed in HTML are the predefined character names, and in SGML the macros are supposed to expand out and just be strings (roughly speaking). XML is different. XML allows a document to be processed "half way", where macros are not expanded. So

would, in the DOM, be something like Element (name = "p", attributes = Attr (name = "class", first child = EntityReference (name = "ack", next sibling = Text (value = " for support", next sibling = NIL )))) So in the DOM, there are two ways to get at the value of an attribute: Attr.value - a string Attr.childNodes - a sequence of Text and/or EntityReference nodes. (By the way, if an attribute's .childNodes include an EntityReference node, it is by no means clear what the .value should be. I cannot find any clear explanation of this.) "On setting" the .value of an Attr "this creates a Text node". Now the smallest I can get a Text node down to is 6 words. The Document Value Model that I've been talking about is intended only for valid SGML/XML documents, wherein all macros must have been expanded, so that there is no need for the Document Object Model's partially digested attribute values. So *every* content string in the DOM pays at least a 6-word overhead that is not paid in the DVM. While it is arguable that the strings themselves might be shared, it is unarguable that these Text nodes must NOT be shared. It would be possible to use lazy initialisation for the children of an Attr node. What we see here that has general application is 1. The DOM is an API for HTML/XML *editors*, designed to support frequent tiny changes to an incompletely processed document. The DVM is a data structure for SGML/XML *processors*, designed to support efficient storage and traversal of completely parsed and validated documents and efficent creation of transformed documents. 2. A data structure designed for one use need not be expected to be good for other uses. The DOM is horribly clumsy and inefficient for processing documents; the DVM cannot represent incompletely parsed documents at all. 3. The rather large storage costs of the DOM compared with the DVM (even *with* shared strings) can be traced to its requirements combined with the decision to use mutation as a primary programming tool. The general point therefore is that supporting functionality you don't need (in my case, in-place edits and half-parsed documents) can cost you a lot; garbage avoidance starts with data structures that are no more capable than they need to be. From chase@world.std.com Thu Mar 8 01:23:16 2001 From: chase@world.std.com (David Chase) Date: Wed, 07 Mar 2001 20:23:16 -0500 Subject: [gclist] Garbage collection and XML In-Reply-To: <200103072254.LAA08975@atlas.otago.ac.nz> Message-ID: <4.3.2.7.0.20010307193523.0263f230@pop.std.com> At 11:54 AM 3/8/2001 +1300, Richard A. O'Keefe wrote: >3. The rather large storage costs of the DOM compared with the DVM > (even *with* shared strings) can be traced to its requirements > combined with the decision to use mutation as a primary programming > tool. How does it go if you play the simulated mutation game? For instance, supposing you work with applicative data structures (e.g., splay trees, or red-black trees) so that you never modify anything, instead only reallocating along the spine? Yes, it does generate garbage (so there is some fractional relevance to gc-list :-) but only expected-O(log N) garbage per update operation. One advantage of applicative data structures is that if the assignment of the root is atomic, then you only have to lock for modification. It is more frustrating than amusing to watch people learn lessons already well-understood over a decade ago. As soon as you start working on a big project in a garbage-collected language, any data that "escapes" your little sandbox really has to be regarded as immutable, and (in the case of Java) unlockable. In a multi-threaded world, mutability also has the annoying overhead of synchronization (for a competently designed memory allocator, synchronization on a multiprocessor can be 10-20 times as expensive as allocation (*)). Sure, mutable data structures are great for programming in the small, but build something big, and they become a major pain, and given the overheads they're not necessarily any faster. (*) non-recursive synchronization costs two bus locks, or (my machine, with cpu clocked 6x memory bus) 120 cycles. Heap memory allocation is load, add, compare, conditional branch (predicted not taken), store, store, followed by field initialization. It's also possible, again if you are a loon, and not in Java, to create applicative speculatively updated data structures -- there, you reallocate a spine to "modify" (say for a red-black tree) and attempt to compare-and-swap in the new root. If you fail, simply retry the entire operation (you can, optionally, attempt to reuse some of the previous operation if you saved the old spine and compare with the new as you recompute -- as long as threads are modifying in different places, this should cut the cost of a retry). If contention is low, the synchronization costs are only half what you pay in the conventional lock-while-modifying data structure (and if contention isn't low, then your synchronization costs get grim anyway). David Chase From virtualcyber@erols.com Thu Mar 8 04:10:30 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Wed, 7 Mar 2001 23:10:30 -0500 Subject: [gclist] Garbage collection and XML References: <4.3.2.7.0.20010307193523.0263f230@pop.std.com> Message-ID: <00a101c0a785$b7598480$0100007f@cradle> Hi, > [David Chase wrote] > It's also possible, again if you are a loon, and not in > Java, to create applicative speculatively updated data > structures -- there, you reallocate a spine to "modify" > (say for a red-black tree) and attempt to compare-and-swap > in the new root. This is a side issue, but is it possible to apply locks on subnodes, and rather than the root? If one locks a node, then, sibling nodes should be accessible to multiple threads. Also, is it possible to apply shared locks (read locks, intent-read locks), that is locking concepts, available from database? Or are these types of locks too expensive? Just being curious. From virtualcyber@erols.com Thu Mar 8 05:20:27 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Thu, 8 Mar 2001 00:20:27 -0500 Subject: [gclist] CORBA C++ bindings and garbage collection References: <200103070549.f275nW627336@piglet.dstc.edu.au><002b01c0a6d4$f0d89fe0$0100007f@cradle> <15013.61721.784591.670436@pelerin.serpentine.com> Message-ID: <00ad01c0a78f$7d376c40$0100007f@cradle> Hi > > [I asked] Does IDL to C++ language mapping rule out the use of garbage > > collector in designing and implementing an ORB? > > [you replied] It's not the language mapping that rules out GC, it's the programming > model and the wire protocol. IIOP has no facility for tracking the > number of clients talking to a server. In order for a client be able > to talk to an object over the wire, it has to be explicitly exported > on the server side. Here, what you are saying is that the collector has no easy way of knowing when there are no more live remote references to servants, right? > j> [I asked] If the mapping does rule it out, then wasn't it a mistake for the > j> original mapping designers to not to consider garbage collection? > > [you replied] I don't think so. Distributed garbage collection is a nice idea in > the abstract, but it comes with far too many problems to be something > you really want in a commercial setting. Also, at the time CORBA was > being agglutinated, there were no commercially-noticeable languages > that supported GC in existence. I was not thinking of applying DGC -- rather I was thinking of using GC locally only, and treating (1) references and (2) servants in special ways. If an object is a servant, we can just save if from being gc'ed and use a specialized threads to perform eviction (or whatever). If an object is a reference to a remote object, then we simply look check to see if the host of the reference's target object is in the list of hosts that are reachable (the list is pre-computed prior to GCing, in another thread) and remove or finalize on these references. This approach basically treats all local objects (other than servants) as vanilla garbage collectible (including references to remote objects). The hardest problem (how do you GC a servant) is obviously not solved by the preceding method. I was not thinking that GC would be a way to fix that. The problem of not knowing when clients have dropped references seems to be inherent problem to most distributed systems (I am not sure if it is realistic to have a protocol that would keep track of all client connection on per servant basis). I was just hoping though, that local GCs would be good enough to simplify the semantics of memory allocation/deallocation for C++ CORBA systems. For example, take a simple CORBA string. Even managing this requires one to use string_dupe() and string_free(). If you deallocate a string that an ORB has given you, you can easily get a core dump. If you could locally GC, then, you just receive a "reference" to that object -- no string dupe, no string free. With local GC, I was wondering, ll this local bookkeeping which come with C++ CORBA might be eliminated. In Mitch Henning and Vinosky's book, the authors devote large chapters to explain client- and server-side C++ mappings for memory management. All this seems just too complicated to me. > DGC is still a fruitful source of > research papers. This should scare you. I am scared of doing DGC -- you bet. > The last project I worked on that used DGC was BEA's WebLogic Server, > a very profitable web application server. When I left BEA, we had > been talking seriously for several months about turning off DGC > altogether. The Enterprise JavaBeans programming model didn't require > DGC, even though it was nominally implemented on top of RMI, and EJB > has almost entirely displaced RMI as the distributed programming model > of choice for large Java apps. The popularity of EJB made it very > tempting to kill off all of our DGC infrastructure and its horrible > Heisenbugs. > > Prior to WLS, I worked on Jini (remember that?), where we effectively > handwaved away the intractable problems of DGC in large, semi-coherent > systems by requiring that clients explicitly maintain leases to server > objects. What you say above make a lot of sense to me (unless I am totally confused). > There's no doubt that DGC makes programming seem nicer. Right up > until it breaks irreproducibly in deployment or doesn't scale beyond a > handful of participants, at which point you can take your app out back > and shoot it. Does Java RMI suffer from this problem? While I thought that Java's RMI looked good, I also heard a little voice in my head saying "this is too good to be real." Given that many people have spent much energy over years to tackle distributed computing problems, I wondered whether it was realistic to believe that Java RMI simply made these problems vanish. (I suppose I could have looked at Java source code ... but that is a huge source and I was scared off by its size). Ji-Yong D. Chung From bos@serpentine.com Thu Mar 8 05:56:24 2001 From: bos@serpentine.com (Bryan O'Sullivan) Date: Wed, 7 Mar 2001 21:56:24 -0800 (PST) Subject: [gclist] CORBA C++ bindings and garbage collection In-Reply-To: <00ad01c0a78f$7d376c40$0100007f@cradle> References: <200103070549.f275nW627336@piglet.dstc.edu.au> <002b01c0a6d4$f0d89fe0$0100007f@cradle> <15013.61721.784591.670436@pelerin.serpentine.com> <00ad01c0a78f$7d376c40$0100007f@cradle> Message-ID: <15015.7944.42569.334661@pelerin.serpentine.com> j> Here, what you are saying is that the collector has no easy way of j> knowing when there are no more live remote references to servants, j> right? Yes. If one client passes an IOR (CORBA-speak for a reference to a server-side object) to another, but the second doesn't actually open a connection to the server, then the server has no way of knowing that the second has a reference to it. j> I was just hoping though, that local GCs would be good enough to j> simplify the semantics of memory allocation/deallocation for C++ j> CORBA systems. For client-side code, you could simply try linking in something like Boehm-Weiser and seeing if it worked. I would be surprised if it didn't work for purely client-side memory management. b> There's no doubt that DGC makes programming seem nicer. Right up b> until it breaks irreproducibly in deployment or doesn't scale b> beyond a handful of participants, at which point you can take your b> app out back and shoot it. j> Does Java RMI suffer from this problem? Yes. Jini (which I mentioned in my earlier article) uses leasing of activatable objects to sidestep the problems of DGC in large RMI systems. j> Given that many people have spent much energy over years to tackle j> distributed computing problems, I wondered whether it was realistic j> to believe that Java RMI simply made these problems vanish. Many aspects of RMI are fantastically useful, including DGC. You just have to assume that DGC will cause your scalability curve to assume unexpected and catastrophic shapes after a point, and that said point will always occur earlier in the curve than you'd like. If your app never reaches that point, then all is peachy. > In a multi-threaded world, mutability also has > the annoying overhead of synchronization (for a competently > designed memory allocator, synchronization on a multiprocessor > can be 10-20 times as expensive as allocation (*)). Sure, > mutable data structures are great for programming in the > small, but build something big, and they become a major > pain, and given the overheads they're not necessarily any > faster. I agree with the general conclusion, but you need to be careful about the costs. Reallocating objects along a path in a large applicative data structure tends to involve allocating and dropping relatively long-lived objects. I suspect the garbage collection costs will outweigh the allocation costs by a fair margin, no matter what garbage collector you use. (This might not hold if you can keep the heap very sparsely occupied. But even then the allocation cost wil be dominated by the cache miss to first write to the object.) But sharing can still be a huge advantage of the applicative data structure. > > (*) non-recursive synchronization costs two bus locks, > or (my machine, with cpu clocked 6x memory bus) 120 cycles. > Heap memory allocation is load, add, compare, conditional > branch (predicted not taken), store, store, followed by > field initialization. Is that an X86 machine? I just timed a Pentium III/500/100 machine at something near 25 cycles per "lock; cmpxchgl". I'm interested because I've sometimes heard the claim that X86 is particularly bad at this, but that hasn't really been consistent with my experience. Is this chipset dependent, perhaps? Hans From chase@world.std.com Fri Mar 9 00:43:10 2001 From: chase@world.std.com (David Chase) Date: Thu, 08 Mar 2001 19:43:10 -0500 Subject: [gclist] Garbage collection and XML In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C8B@hplex1.hpl.hp.com > Message-ID: <4.3.2.7.0.20010308184655.027e7ea0@pop.std.com> At 02:31 PM 3/8/2001 -0800, Boehm, Hans wrote: >> (*) non-recursive synchronization costs two bus locks, >> or (my machine, with cpu clocked 6x memory bus) 120 cycles. >> Heap memory allocation is load, add, compare, conditional >> branch (predicted not taken), store, store, followed by >> field initialization. >Is that an X86 machine? An x86 machine. I was given to understand that the cost was 10 memory bus cycles per lock-cmpxchgl, and I thought that was what I measured on my old 200Mhz PPro. Not sure if it was clocking at 2x or 3x bus speed. New machine is (two) 800Mhz P-II, 133Mhz bus. It's also somewhat chip(set) dependent; I've been told that the Xeon chips lock only a subset of memory (a "line", for some definition of the word) versus the entire bus. >I just timed a Pentium III/500/100 machine at >something near 25 cycles per "lock; cmpxchgl". >I'm interested because I've sometimes heard the claim >that X86 is particularly bad at this, but that >hasn't really been consistent with my experience. Maybe all the other chips are bad, too, but 25 cycles seems pretty horrible to me. We made some benchmarks to measure two versions of a tight little loop (one always doing the cmpxchgl, the other always branching around it) and it seemed like a pair of locked instructions cost two or three times as much as the entire rest of the loop, which was not completely empty. You're probably right about the cost of abandoning that long-lived spine. What wins depends on the relative mix of probes versus updates, since the applicative data structure let you avoid locking on probes. Don't forget, in some systems, if you are modifying data structures in place, you are probably doing some card-marking and creating old-to-young pointers. Those carry their own extra costs. (Avoiding the card-marking on newly allocated memory is a cute trick, if you can manage to do it.) David Chase From emery@cs.utexas.edu Fri Mar 9 02:51:10 2001 From: emery@cs.utexas.edu (Emery Berger) Date: Thu, 8 Mar 2001 20:51:10 -0600 Subject: [gclist] Garbage collection and XML In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C8B@hplex1.hpl.hp.com> Message-ID: > Is that an X86 machine? I just timed a Pentium III/500/100 machine at > something near 25 cycles per > "lock; cmpxchgl". I'm interested because I've sometimes heard the claim > that X86 is particularly bad at this, but that hasn't really been > consistent > with my experience. Is this chipset dependent, perhaps? Timing just the "lock; cmpxchgl" doesn't give you the whole picture. The problem is that the Pentium flushes the pipeline when it encounters a locked instruction. The performance penalty is pretty spectacular. I'm told the P4 has a 24-stage pipeline, so locked instructions will become effectively even more expensive. Regards, -- Emery -- Emery Berger emery@cs.utexas.edu http://www.cs.utexas.edu/users/emery From hans_boehm@hp.com Fri Mar 9 18:07:08 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Fri, 9 Mar 2001 10:07:08 -0800 Subject: [gclist] synchronization cost (was: Garbage collection and XM L) Message-ID: <140D21516EC2D3119EE7009027876644049B5C92@hplex1.hpl.hp.com> Does anyone know if this is documented somewhere? My experience with using "lock; cmpxchgl" to atomically set a mark bit in a bit vector was that it didn't seem to have that much of an impact. But the only measurement I made was a comparison to using mark bytes instead, which appeared to be slower on a Pentium III, presumably as a result of the larger data structure and hence added cache misses. Since these are out-of-order machines, the other question is whether subsequent instructions that don't depend on later memory references will continue to execute during the wait. If so, this might explain some of the diffferences in measurements. Hans > -----Original Message----- > From: Emery Berger [mailto:emery@cs.utexas.edu] > Sent: Thursday, March 08, 2001 6:51 PM > To: Boehm, Hans; 'David Chase'; gclist@iecc.com > Cc: icis-developers@bbn.com > Subject: RE: [gclist] Garbage collection and XML > > > > Is that an X86 machine? I just timed a Pentium III/500/100 > machine at > > something near 25 cycles per > > "lock; cmpxchgl". I'm interested because I've sometimes > heard the claim > > that X86 is particularly bad at this, but that hasn't really been > > consistent > > with my experience. Is this chipset dependent, perhaps? > > Timing just the "lock; cmpxchgl" doesn't give you the whole > picture. The > problem is that the Pentium flushes the pipeline when it > encounters a locked > instruction. The performance penalty is pretty spectacular. > I'm told the P4 > has a 24-stage pipeline, so locked instructions will become > effectively even > more expensive. > > Regards, > -- Emery > > -- > Emery Berger > emery@cs.utexas.edu > http://www.cs.utexas.edu/users/emery > From emery@cs.utexas.edu Fri Mar 9 20:09:41 2001 From: emery@cs.utexas.edu (Emery Berger) Date: Fri, 9 Mar 2001 14:09:41 -0600 Subject: [gclist] synchronization cost (was: Garbage collection and XML) In-Reply-To: <140D21516EC2D3119EE7009027876644049B5C92@hplex1.hpl.hp.com> Message-ID: > -----Original Message----- > From: Boehm, Hans [mailto:hans_boehm@hp.com] > Sent: Friday, March 09, 2001 12:07 PM > To: 'Emery Berger'; Boehm, Hans; 'David Chase'; gclist@iecc.com > Cc: icis-developers@bbn.com > Subject: RE: [gclist] synchronization cost (was: Garbage collection and > XML) > > > Does anyone know if this is documented somewhere? > http://developer.intel.com/design/pentium4/manuals/24547203.pdf See Chapter 7.1. "For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 processor, with one exception: load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. " -- Emery > Since these are out-of-order machines, the other question is whether > subsequent instructions that don't depend on later memory references will > continue to execute during the wait. If so, this might explain > some of the > diffferences in measurements. > > Hans > > > -----Original Message----- > > From: Emery Berger [mailto:emery@cs.utexas.edu] > > Sent: Thursday, March 08, 2001 6:51 PM > > To: Boehm, Hans; 'David Chase'; gclist@iecc.com > > Cc: icis-developers@bbn.com > > Subject: RE: [gclist] Garbage collection and XML > > > > > > > Is that an X86 machine? I just timed a Pentium III/500/100 > > machine at > > > something near 25 cycles per > > > "lock; cmpxchgl". I'm interested because I've sometimes > > heard the claim > > > that X86 is particularly bad at this, but that hasn't really been > > > consistent > > > with my experience. Is this chipset dependent, perhaps? > > > > Timing just the "lock; cmpxchgl" doesn't give you the whole > > picture. The > > problem is that the Pentium flushes the pipeline when it > > encounters a locked > > instruction. The performance penalty is pretty spectacular. > > I'm told the P4 > > has a 24-stage pipeline, so locked instructions will become > > effectively even > > more expensive. > > > > Regards, > > -- Emery > > > > -- > > Emery Berger > > emery@cs.utexas.edu > > http://www.cs.utexas.edu/users/emery > > > From hans_boehm@hp.com Fri Mar 9 22:31:23 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Fri, 9 Mar 2001 14:31:23 -0800 Subject: [gclist] synchronization cost (was: Garbage collection and XM L) Message-ID: <140D21516EC2D3119EE7009027876644049B5C96@hplex1.hpl.hp.com> > -----Original Message----- > From: Emery Berger [mailto:emery@cs.utexas.edu] > > http://developer.intel.com/design/pentium4/manuals/24547203.pdf > > See Chapter 7.1. "For the P6 family processors, locked > operations serialize > all outstanding load and store operations (that is, wait for them to > complete). This rule is also true for the Pentium 4 > processor, with one > exception: load operations that reference weakly ordered > memory types (such > as the WC memory type) may not be serialized. " > Thanks for the pointer. This is very interesting. I read the above statement as dealing more with the memory model than the implementation. The processor is normally allowed to move reads to before logically earlier writes, assuming this is locally consistent. It may not do this if the read is part of the atomic operation or a later read. Thus I assume it basically has to wait for any store buffers to drain to the cache before beginning the read. That seems like an unavoidable cost given the way the operation is defined. It doesn't imply to me that the rest of the processor necessarily has to be idle. The following statement is also enlightening: "Because frequently used memory locations are often cached in a processor's L1 or L2 caches, atomic operations can often be carried out inside a processor's caches without asserting the bus lock. Here the processor's cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations." Later text is explicit that for P6 and later, the bus is NOT locked for atomic operations if the processor already has exclusive access to the cache line. I believe this is similar to most other recent processors. Hans From virtualcyber@erols.com Sat Mar 10 00:28:24 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Fri, 9 Mar 2001 19:28:24 -0500 Subject: [gclist] synchronization cost (was: Garbage collection and XML) References: Message-ID: <024701c0a8f9$0593cfb0$0100007f@cradle> Hi, > > Is that an X86 machine? I just timed a Pentium III/500/100 machine at > > something near 25 cycles per > > "lock; cmpxchgl". I'm interested because I've sometimes heard the claim > > that X86 is particularly bad at this, but that hasn't really been > > consistent > > with my experience. Is this chipset dependent, perhaps? (1) A few years ago, I had opportunity to do some measurements on CMPXCHG and if I remember correctly, the preceding figure is pretty close to what I got -- I was reading about 30-40 instructions per cmpxchg and more on cmpxchg8b on pentium II, 200 Mhz. (Windows NT4.0). (2) If one is trying to use a faster locking mechanism for the garbage collector on Windows NT (single process, multithreaded), one might consider EnterCriticalSection. For many cases, it is MUCH faster than using mutexes and other synchronization mechanisms. (likely to be based on CMPXCHG). However, see http://www.cs.wustl.edu/~schmidt/win32-cv-1.html (3) Does anyone know how EnterCriticalSeciton is implemented? I tried writing semaphores, mutexses, shared semaphores, based on CMPXCHG, CMPXCHG8, but my implementations were always much slower than EnterCriticalSection. I had suspicion that it was not using CMPXCHG, and that was the reason why it could be so fast. But I could never be sure. From virtualcyber@erols.com Sat Mar 10 00:50:37 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Fri, 9 Mar 2001 19:50:37 -0500 Subject: [gclist] synchronization cost (was: Garbage collection and XML) References: <024701c0a8f9$0593cfb0$0100007f@cradle> Message-ID: <000a01c0a8fc$1fa3b020$0100007f@cradle> And I forgot to mention that EnterCriticalSection takes (I have read) about 6 CPU cycles in optimal case. I have seen implementations of non-reentrant spin locks that take a 5 cycles per lock (implemented using LOCK and XCHG and MOV. LOCK takes 1 CPU cycle, XCHG takes 3 CPU cycles and MOV takes 1 cycle). From hans_boehm@hp.com Sat Mar 10 01:36:32 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Fri, 9 Mar 2001 17:36:32 -0800 Subject: [gclist] synchronization cost (was: Garbage collection and XM L) Message-ID: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> I just tried this on a 500 MHz Pentium III. I get about 23 cycles for lock; cmpxchg and about 19 or 20 cycles for xchg (which has an implicit lock prefix). I got consistent results by timing a loop and by looking at an instruction level profile. Putting other stuff in the loop didn't seem to affect the time taken by xchg much. Here's the code in case someone else wants to try. (This requires Linux/gcc) (Compile with gcc -static -O -DPROF swap.c prof.c to get profile.) swap.c: -------------------------------------------- #include typedef int GC_bool; inline static GC_bool GC_test_and_set(volatile unsigned *addr) { int oldval; /* Note: the "xchg" instruction does not need a "lock" prefix */ __asm__ __volatile__("xchgl %0, %1" : "=r"(oldval), "=m"(*(addr)) : "0"(1), "m"(*(addr)) : "memory"); return oldval; } volatile unsigned lock; int main() { int i; # ifdef PROF init_profiling(); # endif for (i = 0; i < 10000000; ++i) { int result; result = GC_test_and_set(&lock); lock = 0; if (result) printf("Failed\n"); } # ifdef PROF dump_profile(); # endif return 0; } ---------------------------------------------------- prof.c: ---------------------------------------------------- #include #include #include /* A very simple profiler. Note that it should be possible to */ /* get function level information by concatenating this with nm */ /* output and running the result through the sort utility. */ /* This assumes that all interesting parts of the executable */ /* are statically linked. */ static size_t buf_size; static u_short *profil_buf; # ifdef __i386__ # ifndef COMPRESSION # define COMPRESSION 1 # endif # define TEXT_START 0x08000000 # define PTR_DIGS 8 # endif # ifdef __ia64__ # ifndef COMPRESSION # define COMPRESSION 8 # endif # define TEXT_START 0x4000000000000000 # define PTR_DIGS 16 # endif extern int etext; /* * Note that the ith entry in the profile buffer corresponds to * a PC value of TEXT_START + i * COMPRESSION * 2. * The extra factor of 2 is not apparent from the documentation, * but it is explicit in the glibc source. */ void init_profiling() { buf_size = ((size_t)(&etext) - TEXT_START + 0x10)/COMPRESSION/2; profil_buf = calloc(buf_size, sizeof(u_short)); if (profil_buf == 0) { fprintf(stderr, "Could not allocate profile buffer\n"); } profil(profil_buf, buf_size * sizeof(u_short), TEXT_START, 65536/COMPRESSION); } void dump_profile() { size_t i; size_t sum = 0; for (i = 0; i < buf_size; ++i) { if (profil_buf[i] != 0) { fprintf(stderr, "%0*lx\t%d !PROF!\n", PTR_DIGS, TEXT_START + i * COMPRESSION * 2, profil_buf[i]); sum += profil_buf[i]; } } fprintf(stderr, "Total number of samples was %ld !PROF!\n", sum); } ------------------------------------------------- > -----Original Message----- > From: Ji-Yong D. Chung [mailto:virtualcyber@erols.com] > Sent: Friday, March 09, 2001 4:51 PM > To: gclist@iecc.com > Subject: Re: [gclist] synchronization cost (was: Garbage > collection and > XML) > > > And I forgot to mention that EnterCriticalSection takes > (I have read) > about 6 CPU cycles in optimal case. > > I have seen implementations of non-reentrant spin locks > that take a 5 cycles per lock (implemented using > LOCK and XCHG and MOV. LOCK takes 1 CPU cycle, XCHG takes > 3 CPU cycles and MOV takes 1 cycle). > > From chrisd@reservoir.com Sat Mar 10 08:43:24 2001 From: chrisd@reservoir.com (Chris Dodd) Date: Sat, 10 Mar 2001 00:43:24 -0800 (Pacific Standard Time) Subject: [gclist] synchronization cost (was: Garbage collection and XML) Message-ID: > (1) A few years ago, I had opportunity to do some measurements > on CMPXCHG and if I remember correctly, the preceding figure is > pretty close to what I got -- I was reading about 30-40 instructions per > cmpxchg > and more on cmpxchg8b on pentium II, 200 Mhz. (Windows NT4.0). Was this a simple cmpxchg or a lock+cmpxchg? The former is much faster, but of course isn't atomic on a multiprocessor machine. > (2) If one is trying to use a faster locking mechanism > for the garbage collector on Windows NT (single process, > multithreaded), one might consider EnterCriticalSection. > For many cases, it is MUCH faster than > using mutexes and other synchronization mechanisms. > (likely to be based on CMPXCHG). > > However, see > > http://www.cs.wustl.edu/~schmidt/win32-cv-1.html > > > (3) Does anyone know how EnterCriticalSeciton is implemented? > > I tried writing semaphores, mutexses, shared semaphores, > based on CMPXCHG, CMPXCHG8, but my implementations > were always much slower than EnterCriticalSection. I had > suspicion that it was not using CMPXCHG, and > that was the reason why it could be so > fast. But I could never be sure. Well, one important "optimization" that WinNT does is to have TWO DIFFERENT versions of EnterCriticalSection -- one for uniprocessor machines and one for multiprocessors. The UP version is considerably faster. I'm pretty sure the way they do that is to not use lock prefixes in the UP version, since on a UP x86 machine, individual instructions are (mostly) atomic. On an MP machine, you need a lock prefix to make them atomic. I certainly had no difficultly writing a lock+cmpxchg based mutex that was faster than the MP version of EnterCriticalSection. The exact same code without the lock prefix was faster than the UP version of EnterCriticalSection. Chris Dodd chrisd@reservoir.com From virtualcyber@erols.com Sat Mar 10 21:55:16 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Sat, 10 Mar 2001 16:55:16 -0500 Subject: [gclist] synchronization cost (was: Garbage collection and XML) References: Message-ID: <000b01c0a9ac$e1e9d140$0100007f@cradle> Hi, I am embarassed to admit that, on my prev. posting, I should have checked my facts before commenting on spin locks and XCHG. After seeing Boehm's email, I ran tests on XCHG, and I am getting slightly worse results, at about 22 instruction cycles for each XCHG. I have run the tests on Pentium II (200 MHZ), WindowsNT4.0, SP4. The test is compiled using VC++6. The test has to be run multiple times either with a VC++6.0 profiler or with a timer, and with minimum no of processes on the host machine. ||====================================== || Here is my code for running the test #include #include #include void main() { struct _timeb begin, end; int locker = 0; int* lock_addr = &locker; _ftime(&begin); __asm { mov ebx, [lock_addr] mov ecx, 10000000 RETRY: mov edx, ebx mov eax, 1 xchg eax, [edx] // Here is the XCHG dec ecx jnz RETRY }; _ftime(&end); cout << (end.time - begin.time) * 1000 + end.millitm - begin.millitm << endl; }; From virtualcyber@erols.com Sat Mar 10 23:12:10 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Sat, 10 Mar 2001 18:12:10 -0500 Subject: [gclist] collector optimization References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> Message-ID: <000501c0a9b7$895b1150$0100007f@cradle> Hi, I just finished replacing my copying collector with Boehm's collector. (I used the included C++ interface on VC++6.0, NT platform). Eventually, I would like to try optimizing it for speed. Does anyone know if there are application specific optimizations I can try with Boehm's collector? More specifically, I am wondering if there are parts of Boehm's code that are known to be hackable for application specific optimization -- I mean no disrespect to Boehm or to Boehm's collector, here :) I do not mean just changing values of the tuning hooks that are provided, as I have done much of that. Thanks in advance, for any information related to the collector optimization. Take Care Ji-Yong D. Chung From chase@world.std.com Sat Mar 10 23:35:56 2001 From: chase@world.std.com (David Chase) Date: Sat, 10 Mar 2001 18:35:56 -0500 Subject: [gclist] collector optimization In-Reply-To: <000501c0a9b7$895b1150$0100007f@cradle> References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> Message-ID: <4.3.2.7.0.20010310182211.02510990@pop.std.com> At 06:12 PM 3/10/2001 -0500, Ji-Yong D. Chung wrote: > Does anyone know if there are application specific >optimizations I can try with Boehm's collector? >More specifically, I am wondering if there are parts of Boehm's >code that are known to be hackable for application >specific optimization -- I mean no disrespect to >Boehm or to Boehm's collector, here :) > > I do not mean just changing values of the tuning >hooks that are provided, as I have done much of that. Depends upon what you mean by application-specific. Long ago, when I used the BW collector for a Modula-3 implementation, I took care (in compiler-generated code) to use "gc_malloc_atomic" for pointer-free data structures. We got bit once when someone loopholed pointers into an array of integers; the collector recycled the memory, and it was quite confusing. Another possibility is to open-code the free list selection code, for allocations of constant size. Not a gigantic win, but every little bit helps. Another "hack" you can apply, again through a compiler, is to use the predict-free call (if it still exists). The reason for this is that if the collector is reliably informed about how much free space it might expect to reclaim, it can more sensibly choose between collecting and simply growing the heap. If you do this, you must do it pretty well, else you just waste memory, but if you do it right, you avoid the thrashing that some systems will give you when you grow big data structures -- before they are willing to expand the heap, they do an expensive and useless collection (everything is still live), and repeat that until the heap is large enough for the data structure being built. David Chase From virtualcyber@erols.com Mon Mar 12 01:50:15 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Sun, 11 Mar 2001 20:50:15 -0500 Subject: [gclist] collector optimization References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> <000501c0a9b7$895b1150$0100007f@cradle> <20010311100242.A19346@goop.org> Message-ID: <001a01c0aa96$c9adc9c0$0100007f@cradle> Hi, > [you wrote[ I used BW as part of a JIT-compiling Java runtime. > Apart from the things David mentioned I also: > > - hacked to the codegen to use the delay slots (SPARC target) to > zero known-dead pointers in registers. If you know enough about the > instruction scheduling a compiler can probably often find a good spot > to stomp a dead reference. I never quantified the improvement, but it > was basically free (the delay slots would have been nops otherwise), > and its hard to see how it couldn't help. I don't deal with code generator for SPARC, so my knowledge here is limited -- but if I understand you right, you are basically replacing dead code with instructions for zeroing pointers that will no longer be referenced? That seems to make sense. I will try to zero out all useless pointers/values from local variables. in function calls. (Does this gain you much, though, I still wonder) > - used the typed-allocation interface. The class-layout routines would > always clump pointer and non-pointer class members, so it was > reasonably easy to generate a descriptor with a number of clumps of > pointers (sub-class pointers could not be clumped with the super-class > pointers). This seems ... just a bit painful, as you have mentioned that the performance gain may not be much :) To use the typed allocator interface, I would need to invoke a bit-map factory, in a static function, for every class I have created to be used with GC. Thats significant amount of work for gain that maybe marginal. From David.Chase@naturalbridge.com Mon Mar 12 02:06:44 2001 From: David.Chase@naturalbridge.com (David Chase) Date: 11 Mar 2001 21:06:44 -0500 Subject: [gclist] collector optimization In-Reply-To: <001a01c0aa96$c9adc9c0$0100007f@cradle> References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com> <000501c0a9b7$895b1150$0100007f@cradle> <20010311100242.A19346@goop.org> Message-ID: <4.3.2.7.0.20010311210337.0254e828@pop.std.com> At 08:50 PM 3/11/2001 -0500, Ji-Yong D. Chung wrote: > I don't deal with code generator for SPARC, so my knowledge here >is limited -- but if I understand you right, you are basically replacing >dead code with instructions for zeroing pointers that will no longer be >referenced? That seems to make sense. > > I will try to zero out all useless pointers/values from local variables. >in function calls. (Does this gain you much, though, I still wonder) One thing to watch out for here -- if you are generating code at the C level, and you insert assignments to zero out dead pointers, and you feed it to a decent C compiler, it will turn right around and remove those nulling assignments. After all, you're assigning values to a DEAD VARIABLE, right? Most times, you'd want the compiler to get rid of those assignments :-). David Chase -- David.Chase@NaturalBridge.com From hans_boehm@hp.com Mon Mar 12 17:56:26 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Mon, 12 Mar 2001 09:56:26 -0800 Subject: [gclist] collector optimization Message-ID: <140D21516EC2D3119EE7009027876644049B5C9C@hplex1.hpl.hp.com> As David pointed out, it's very helpful to tell the collector which objects are completely pointer-free. Besides reducing the potential for false pointers, this reduces the number of cache lines and pages touched during GC, sometimes by a large fraction. There are several ways to pass more detailed layout information to the collector. The C typed allocation interface is one. Gcj uses another that's more geared towards a world in which every object has a "vtable" pointer anyway. They help primarily in reducing the potential for false pointers. The impact on typical GC time is usually minimal, since this often doesn't change the set of cache lines that need to be touched during GC by much. (If you end up using primarily one of these facilities, it may be worth restructuring the mark loop to check for the most common case first. Currently it assumes that it will most often be asked to scan sequential ranges of memory, as opposed to ranges described by bit maps.) As David also points out, there are hooks for controlling the triggering of garbage collections, if you notice that you are spending significant amounts of time in badly timed collections, i.e. when nothing gets reclaimed. In your environment, you could probably get a significant win (10% of GC time for X86?) if you are willing to tie the GC object code to a specific machine. Enabling prefetching in the marker often results in a significant reduction of GC time. (See my ISMM 2000 paper.) Code to do this currently exists for Linux/X86, but not NT. It should be easy to add, assuming there is a way to tell the compiler to generate a prefetch instruction. The problem is that you need either a Pentium II+ or a recent AMD processor, and the Intel and AMD prefetch instructions are incompatible. I've been considering optionally including all versions, and switching them based on a dynamic test for the processor type. But that's not yet there. I would expect that for something like Scheme implementation, versions 6.x will outperform the 5.x versions of the collector, due to a more refined GC triggering heuristic. I've found that under Linux the collector is now occasionally faster in incremental/generational mode. That's application dependent. I'm not sure whether that's true under NT, since (based on obsolete anecdotal evidence only) I believe the signal/exception handling overhead for the VM write barrier is higher under NT. Hans > -----Original Message----- > From: Ji-Yong D. Chung [mailto:virtualcyber@erols.com] > Sent: Saturday, March 10, 2001 3:12 PM > To: gclist@iecc.com > Subject: [gclist] collector optimization > > > Hi, > > I just finished replacing my copying collector with > Boehm's collector. (I used the included C++ interface > on VC++6.0, NT platform). > > Eventually, I would like to try optimizing it for > speed. > > Does anyone know if there are application specific > optimizations I can try with Boehm's collector? > More specifically, I am wondering if there are parts of Boehm's > code that are known to be hackable for application > specific optimization -- I mean no disrespect to > Boehm or to Boehm's collector, here :) > > I do not mean just changing values of the tuning > hooks that are provided, as I have done much of that. > > Thanks in advance, for any information related > to the collector optimization. > > > Take Care > Ji-Yong D. Chung > > > > > From virtualcyber@erols.com Tue Mar 13 01:33:25 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Mon, 12 Mar 2001 20:33:25 -0500 Subject: [gclist] collector optimization References: <140D21516EC2D3119EE7009027876644049B5C9B@hplex1.hpl.hp.com><000501c0a9b7$895b1150$0100007f@cradle><20010311100242.A19346@goop.org> <4.3.2.7.0.20010311210337.0254e828@pop.std.com> Message-ID: <006301c0ab5d$9bcd86e0$0100007f@cradle> Hi > One thing to watch out for here -- if you are generating > code at the C level, and you insert assignments to zero > out dead pointers, and you feed it to a decent C compiler, > it will turn right around and remove those nulling assignments. Thank you for pointing that out. Doing code work only to have a compiler rub them out -- that not only would have been an example of highly inefficient coding, but also embarassing. :) -- More embarassing than making a mistaken claim on the execution cost of assembly instruction XCHG, and then emailing your claim to 10000000000000 people. From virtualcyber@erols.com Tue Mar 13 03:16:31 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Mon, 12 Mar 2001 22:16:31 -0500 Subject: [gclist] collector optimization References: <140D21516EC2D3119EE7009027876644049B5C9C@hplex1.hpl.hp.com> Message-ID: <007e01c0ab6c$00eb6b60$0100007f@cradle> Hi Thanks for your reply -- I will pay careful attention to your suggestions. > I've found that under Linux the collector is now occasionally faster in > incremental/generational mode. That's application dependent. Though, I wonder if it is a sign that Linux has some personal issues to deal with. Linux, I have read, does not manage threads too well, at least as a real-time OS. See http://www.ittc.ukans.edu/~zvishal/courses/800/RT_ORB/ From jmaessen@mit.edu Wed Mar 14 16:15:08 2001 From: jmaessen@mit.edu (Jan-Willem Maessen) Date: Wed, 14 Mar 2001 11:15:08 -0500 Subject: [gclist] RE: synchronization cost (was: Garbage collection and XML) Message-ID: <200103141615.LAA00703@lauzeta.mit.edu> Much discussion of CMPXCHG on Intels has gone by. However, it's worth pointing out that Pentium-class processors allow you to do all sorts of locked operations. From the IA-64 manual, Pentium ISA section [p 5-261, LOCK, if you care; the older Pentium manuals agree on this information]: The LOCK prefix can be prepended only to the following instructions and to those forms of the instructions that use a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. ... That's a pretty long list, and one of the above instructions is usually what you actually want. For example (getting back to GC here), BTC/BTR/BTS allow you to atomically update a shared allocation/mark bitmap efficiently. As a datapoint, when I was first designing the synchronization for Eager Haskell (lots of shared updates and a global GC'd heap on an SMP), I took a look at the synchronization in the Linux kernel. The result was enlightening: $ cat /proc/version Linux version 2.2.12-20 (root@lauzeta.mit.edu) [...] $ cd /usr/src/linux $ find . -name '*.[ch]' -print | xargs fgrep 'cmpxchg' ./drivers/usb/uhci.c: asm volatile("lock ; cmpxchg %4,%2 ; sete %0" ./drivers/usb/uhci.c: asm volatile("lock ; cmpxchg %0,%1" Only one file with a cmpxchg---in the [I believe] then-experimental usb code. There are tons of calls to xchg, and many atomic increments, bit set/clear operations, and the like (you can find these using a similar command, though it's a bit more work separating wheat from chaff). For me, the result of this observation was simple. I wrote interfaces which provide the atomic operations I actually require for my system: exchange, compare and swap, bit vector operations, and so forth. On a Pentium, these each have their own inline assembly. On another architecture, such as SPARC, I roll them using compare and swap. I'm in the odd situation of doing some synchronization directly in compiler-generated code (I'm generating C); I suspect most similar systems restrict synchronization to run-time routines. In my case gcc seems to do a noticeably better job of optimization if I avoid "asm volatile" entirely in favor of "asm" and a correct set of instruction effects. This isn't surprising, really; what is surprising is how infrequently it seems to be done in others' code. I'll close with a question I have not yet managed to answer. Our GC uses an unshared nursery. Right now, I test for nusery-ness on some paths in order to determine whether to perform a write barrier. The same test allows me to perform synchronization (in this case a Store/Store fence) only on shared objects. The open question: is it worth checking for locality before _every_ such synchronization? If so, it is not worth eliminating the test from my write barrier code, and I should add a similar test to other code. Otherwise, I should use a test-free write barrier (blind card marking of some sort), perform synchronization all over the place and rely on the fact that the local stuff will happen in cache. Has anyone done an experiment like this on a multiprocessor? -Jan-Willem Maessen Eager Haskell project jmaessen@mit.edu From hans_boehm@hp.com Wed Mar 14 17:21:02 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 14 Mar 2001 09:21:02 -0800 Subject: [gclist] RE: synchronization cost (was: Garbage collection an d XML) Message-ID: <140D21516EC2D3119EE7009027876644049B5CB0@hplex1.hpl.hp.com> > Much discussion of CMPXCHG on Intels has gone by. However, it's worth > pointing out that Pentium-class processors allow you to do all sorts > of locked operations. From the IA-64 manual, Pentium ISA section [p > 5-261, LOCK, if you care; the older Pentium manuals agree on this > information]: > > The LOCK prefix can be prepended only to the following instructions > and to those forms of the instructions that use a memory operand: > ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, > SUB, XOR, XADD, and XCHG. ... > > ... > > For me, the result of this observation was simple. I wrote interfaces > which provide the atomic operations I actually require for my system: > exchange, compare and swap, bit vector operations, and so forth. On a > Pentium, these each have their own inline assembly. On another > architecture, such as SPARC, I roll them using compare and swap. Extrapolating from the posted results, I would guess that, assuming all cache hits, implementing say fetch-and-add using CMPXCHG costs somewhere in the 25-27 cycle range, where LOCK; ADD is probably around 20. I would guess that on architectures using LL-SC, the difference is vaguely comparable. And it's tiny compared to the difference between either of these and wrapping the operation in a lock. It's probably still worthwhile in many cases, and I will probably go back and do it in my collector code, but it currently seems to be an engineering tradeoff between less machine-dependent code and saving a few cycles. Now if there were a standard library that implemented all the variants efficiently so that everyone didn't have to reimplement them ... > I'm in the odd situation of doing some synchronization directly in > compiler-generated code (I'm generating C); I suspect most similar > systems restrict synchronization to run-time routines. In my case gcc > seems to do a noticeably better job of optimization if I avoid "asm > volatile" entirely in favor of "asm" and a correct set of instruction > effects. This isn't surprising, really; what is surprising is how > infrequently it seems to be done in others' code. Does it really make much difference if one of the effects is "memory"? For updating a mark bit it wouldn't need to include that. In many other cases, you do want the compiler to treat it as a memory barrier. I have recently tended to specify both "memory" and "volatile", largely out of paranoia. I think linuxthreads is similar. Is there a cheaper way to ensure that gcc preserves the memory barrier semantics? (I agree that it's not needed when you don't need the barrier.) Hans From plakal@cs.wisc.edu Wed Mar 14 17:26:45 2001 From: plakal@cs.wisc.edu (Manoj Plakal) Date: Wed, 14 Mar 2001 11:26:45 -0600 Subject: [gclist] RE: synchronization cost (was: Garbage collection and XML) In-Reply-To: <200103141615.LAA00703@lauzeta.mit.edu>; from Jan-Willem Maessen on Wed, Mar 14, 2001 at 11:15:08AM -0500 References: <200103141615.LAA00703@lauzeta.mit.edu> Message-ID: <20010314112645.B12837@cs.wisc.edu> Jan-Willem Maessen wrote (Wed, Mar 14, 2001 at 11:15:08AM -0500) : > Much discussion of CMPXCHG on Intels has gone by. However, it's worth > pointing out that Pentium-class processors allow you to do all sorts > of locked operations. From the IA-64 manual, Pentium ISA section [p > 5-261, LOCK, if you care; the older Pentium manuals agree on this > information]: > > The LOCK prefix can be prepended only to the following instructions > and to those forms of the instructions that use a memory operand: > ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, > SUB, XOR, XADD, and XCHG. ... > > That's a pretty long list, and one of the above instructions is > usually what you actually want. For example (getting back to GC > here), BTC/BTR/BTS allow you to atomically update a shared > allocation/mark bitmap efficiently. If you look at the part of the Intel manuals describing optimizations for the Pentium-II/III/IV, I think you'll find that they deprecate the use of prefixes like this. Manoj From chase@world.std.com Wed Mar 14 18:32:03 2001 From: chase@world.std.com (David Chase) Date: Wed, 14 Mar 2001 13:32:03 -0500 Subject: [gclist] RE: synchronization cost (was: Garbage collection and XML) In-Reply-To: <200103141615.LAA00703@lauzeta.mit.edu> Message-ID: <4.3.2.7.0.20010314104030.02731da8@pop.std.com> At 11:15 AM 3/14/2001 -0500, Jan-Willem Maessen wrote: >Much discussion of CMPXCHG on Intels has gone by. However, it's worth >pointing out that Pentium-class processors allow you to do all sorts >of locked operations. >Only one file with a cmpxchg---in the [I believe] then-experimental >usb code. There are tons of calls to xchg, and many atomic >increments, bit set/clear operations, and the like (you can find these >using a similar command, though it's a bit more work separating wheat >from chaff). I can only speak for myself, but designing a VM for Java, I worked with the instructions that I knew were available on most architectures (Sparc V9, Pentium, PowerPC, M68040, MIPS, notably NOT Sparc V8) and not all of those have a general class of locked instructions. It's possible to simulate everything else with CAS or LL/SC, but in those cases you probably would have been better off in the first place designing algorithms that worked directly with CAS or LL/SC. I was also interested in building a system that was starvation-free, and also that would scale efficiently. Given limited intellectual bandwidth, and these constraints, I decided to use wait-free internal data structures designed around CAS or LL/SC. I'd read the Herlihy papers on this, and given the tricky world I was working on, this seemed like the most practical choice. It's possible I could have come up with something faster that used (for instance) exchange or atomic add, but what we've got now is quite fast (we benchmark against the competition, obviously) and quite scalable (synchronization never, ever, requires allocation of additional anything, a huge simplification, and a decoupling of two pieces of the VM that I am really want to be decoupled). The use of CAS also allows us to maintain always-globally- consistent data structures, which is a very good thing -- for instance, our VM runs with deadlock-detection always enabled for Java locks. (In fact, this sped things up a little on most benchmarks, which indicates that we ought to do just a hair more busy-waiting before putting a thread properly to sleep.) In addition, using CAS we were able to design synchronization that is slow only in the following cases: 1) contended first acquisition 2) contended release Anything else goes fast (single CAS) or faster (no CAS). The internals of a parallel garbage collector are another animal of course, as is the card-marking code. That, we generate directly from a compiler. On Pentium, we simply store bytes with no additional synchronization. On a machine with much weaker memory consistency, we might do something else altogether. My assumption is that the conditionals and comparisons necessary to determine that a card mark could be avoided are more expensive than simply doing the mark blind. I could be wrong, but modern processors aren't very happy about conditional branches in general. This decision might be different on processors that didn't provide consistency of byte stores across processors (that is, if one cache line could overwrite another). I don't care about apparent ordering, as long as a write never disappears (because card-marking is monotonic until a GC -- only the GC unmarks cards). BUT, if you can avoid a memory bus lock, at least on Pentium, you can afford the conditionals. This I know from benchmarks, though I also know that the branches are all extremely well-predicted. The conditional branch seems to add 5-10% to the cost of the locked CAS, but subtracts 90-95% if the lock can be avoided. We detect when we are running on a uniprocessor, and simply avoid using the locked version of CAS in that case. A JIT could compile this in-line; we impose what is essentially a small tax on the MP case. >[ telegraphically described experiment elided] >Has anyone done an experiment >like this on a multiprocessor? Not yet :-). One experiment we could run, but have not yet, is to determine the overhead of (optimized in various ways) card marking. We support two garbage collectors; we could take a set of benchmarks, run them under full-copying with card marks enabled, then recompile them w/o card marks and rerun and measure the difference. David Chase From fjh@cs.mu.oz.au Wed Mar 14 20:37:03 2001 From: fjh@cs.mu.oz.au (Fergus Henderson) Date: Thu, 15 Mar 2001 07:37:03 +1100 Subject: [gclist] Boehm GC & Linux/SPARC In-Reply-To: <140D21516EC2D3119EE700902787664401E3A7F7@hplex1.hpl.hp.com> Message-ID: <20010315073703.A26598@hg.cs.mu.oz.au> Q1: is there a boehm-gc-developers mailing list? Should there be? Q2: Is anyone using the Boehm et al collector with Linux/SPARC? There seems to be some code in it to handle that combination, but it doesn't work. I tried it (using cf.sourceforge.net) and found that it crashes very early. The definition of DATASTART using LINUX_DATA_START doesn't work, because __data_start is not defined (both __data_start and data_start are zero). I had a look at the linker script (output by `ld -v'), and based on that, I tried using __etext for DATASTART. However, that didn't work, because there are some unmapped pages between the rodata (which follows __etext) and the other data. There didn't seem to be any linker-defined symbol I could use to find the end of rodata or the start of the remaining data. -- Fergus Henderson | "I have always known that the pursuit | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From hans_boehm@hp.com Wed Mar 14 23:12:40 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 14 Mar 2001 15:12:40 -0800 Subject: [gclist] Boehm GC & Linux/SPARC Message-ID: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com> > Q2: Is anyone using the Boehm et al collector with Linux/SPARC? > There seems to be some code in it to handle that combination, but it > doesn't work. I tried it (using cf.sourceforge.net) and found that it > crashes very early. The definition of DATASTART using > LINUX_DATA_START > doesn't work, because __data_start is not defined (both __data_start > and data_start are zero). > > I had a look at the linker script (output by `ld -v'), > and based on that, I tried using __etext for DATASTART. > However, that didn't work, because there are some unmapped pages > between the rodata (which follows __etext) and the other data. > There didn't seem to be any linker-defined symbol I could use > to find the end of rodata or the start of the remaining data. > This seems to be somewhat distribution dependent. (Dependent on both glibc and the linker script, to be more precise.) I believe there is now a consensus that __data_start should be defined, and it is defined in glibc on most of the architectures, I believe. If it still isn't defined on SPARC, there's a good chance the glibc maintainer (drepper@redhat.com) would appreciate a patch. In the latest 6.0alpha versions, if you define SEARCH_FOR_DATA_START the collector will look for a nonzero definition of first __data_start, then data_start, and then search backward from _end if they both fail. That should work OK if someone also submits the glibc patch. My version actually uses GC_SysVGetDataStart on Linux/SPARC. Does that not work? Hans From fjh@cs.mu.oz.au Thu Mar 15 04:40:04 2001 From: fjh@cs.mu.oz.au (Fergus Henderson) Date: Thu, 15 Mar 2001 15:40:04 +1100 Subject: [gclist] Boehm GC & Linux/SPARC In-Reply-To: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com> References: <140D21516EC2D3119EE7009027876644049B5CB7@hplex1.hpl.hp.com> Message-ID: <20010315154004.A26458@hg.cs.mu.oz.au> On 14-Mar-2001, Boehm, Hans wrote: > > My version actually uses GC_SysVGetDataStart on Linux/SPARC. Does that not > work? Before sending my mail, I checked that gc6.0alpha6 fails too. It too seg faults in the same place (at `deferred = *limit', line 654 in mark.c). But I didn't notice that it was using GC_SysVGetDataStart(); it looks like it is failing for a different reason. I found a work-around: compile with `-O0'. -- Fergus Henderson | "I have always known that the pursuit | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From cef@geodesic.com Thu Mar 15 21:35:20 2001 From: cef@geodesic.com (Charles Fiterman) Date: Thu, 15 Mar 2001 15:35:20 -0600 Subject: [gclist] Java Phantom references. Message-ID: <3.0.1.32.20010315153520.014b67b0@pop3.geodesic.com> First is there a Java Manual other than the Sun Website new enough to discuss phantom references? What are they supposed to be used for? How do collectors handle them? My guess is when the object is toast the phantom reference is put on a queue. From DICK@watson.ibm.com Thu Mar 15 21:49:06 2001 From: DICK@watson.ibm.com (DICK@watson.ibm.com) Date: Thu, 15 Mar 01 16:49:06 EST Subject: [gclist] Java Phantom references. Message-ID: <200103152155.QAA38986@sp1n189at0.watson.ibm.com> ***** Reply to your note of: Thu, 15 Mar 2001 15:35:20 -0600 ************* The Addison=Wesley book "The Java Class Libraries Second Edition, Volume 1 Supplement for the Java 2 Platform Standard Edition, v1.2" ISBN 0 201 48552 4 discusses phantom references - p 698. I think they're supposed to be a generalization of finalizable objects (with wrinkles - see soft/weak references also) - rather than just a finalize() method of a finalizable object, _any_ object can be notified when some object of interest becomes garbage. C.R. Attanasio From virtualcyber@erols.com Fri Mar 16 08:07:08 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Fri, 16 Mar 2001 03:07:08 -0500 Subject: [gclist] My copying collector or Boehm's? References: <20010315073703.A26598@hg.cs.mu.oz.au> Message-ID: <000901c0adf0$19af2980$0100007f@cradle> Hi, This question is regarding a design decision, whether to use Boehm's collector or my copying collector. I have written a scheme interpreter in C++, soley to manipulate XML parsing. I have not implemented the XML parser yet, but I am planning to write one, as a C++ extension to the scheme interpreter. Its memory management will be done by the garbage collector for the interpreter itself. To get ready to write this extension, I have just replaced my super simple collector (which uses Cheney's algorithm) with Boehm's collector. After a number of test runs comparing both collectors, I cannot decide whether to keep my original collector or use Boehm's. With Boehm's collector, my interpreter runs about 2.5 times slower than before. This is not a knock on Boehm's collector. The drop off in performance was expected, because (1) my original collector is turned on and off at precise points in my C++ code to minimize collection (2) my collector uses type information all the time, (3) it uses no locks for allocation, because it has a separate heap for each thread.and (4) heap residency was low for the test cases. -- which favors copying collector over mark-sweep. Of course, Boehm's collector offers things other than raw performance. First of all, it is written for C/C++, so that if I writeXML extensions in C++, Boehm's collector fits in nicely with my C++ implementations of Scheme. My collector, on the other hand, is a copying collector, and it is not generalized for C/C++. If I write any scheme extensions in C++, I must "protect" local C++ variables that might lose its pointer to Scheme objects (again, implemented in C++) due to the garbage collector. This makes C++ code writing more painful than it would be if I were using Boehm's collector.. Secondly, for each class, I must implement a static "Move" function (for copying one object from FromSpace to ToSpace). Writing support structures to dispatch this at high speed makes my code much more complex (and ugly). I have difficult time deciding whether to use Boehm's collector or not because (1) on one hand, I have read that I am generally supposed to sacrifice performance for design improvement. (2) on the other hand, with Boehm's collector, my interpreter runs 2.5 times as slow -- which maybe too much to sacrifice. In summary, using Boehm's collector would simplify C++ code writing, but it would make my code regrettably slower. . Which one should I use? I have been banging my head against walls for a few days. I would appreciate any comments or insights that would alleviate my headache. P.S. In case anyone asks me what are my "user requirement" -- Because I am writing a "new" type of application, I am not sure at this point if there is such a thing. There is some chance that performance will become an issue, so that I have always been trying not to be ad-hoc in my design decisions. From pekka@harlequin.co.uk Fri Mar 16 13:58:40 2001 From: pekka@harlequin.co.uk (Pekka P. Pirinen) Date: Fri, 16 Mar 2001 13:58:40 GMT Subject: [gclist] Java Phantom references. In-Reply-To: <3.0.1.32.20010315153520.014b67b0@pop3.geodesic.com> (message from Charles Fiterman on Thu, 15 Mar 2001 15:35:20 -0600) Message-ID: <200103161358.NAA20538@zaphod.cam.harlequin.co.uk> > First is there a Java Manual other than the Sun Website new enough to > discuss phantom references? That's canonical. I suppose there must be other JDK 1.2 manuals by now. There's a nice article on the technicalities of weak refs at . I wrote precise definitions of the Java weakness concepts for the MM Reference, see . > What are they supposed to be used for? Cleanup after the finalize method (which might well be inherited) has run. Also, I think it's nice for doing finalization (in the generic sense) without opening the door for resurrection: you just subclass PhantomReference to hold the info you need for finalization. > How do collectors handle them? My guess is when the object is toast the > phantom reference is put on a queue. That's pretty much given, since one has to implement the ReferenceQueue anyway. -- Pekka P. Pirinen Harlequin Limited From hans_boehm@hp.com Fri Mar 16 17:19:22 2001 From: hans_boehm@hp.com (Boehm, Hans) Date: Fri, 16 Mar 2001 09:19:22 -0800 Subject: [gclist] My copying collector or Boehm's? Message-ID: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com> A few observations: > With Boehm's collector, my interpreter runs about 2.5 > times slower than before. This is not a knock on Boehm's > collector. I still find that surprising, since it suggests that your interpreter is spending >60% of its time allocating/collecting with our collector, something I don't normally see. A profile and GC log would be interesting to me. In particular, it would be nice to know: a) How much time is spent in the marker? b) How much time is spent locking around allocation? (The win32 threads port uses EnterCriticalSection() and LeaveCriticalSection(). On many other platforms a custom locking scheme is used instead, since the standard one exhibited serious performance problems, some form of convoying being the most common. I've heard from several sources that at least some implementations off the win32 primitives have similar problems, so a custom solution should be used as well. That would be fairly easy, but I haven't gotten around to it yet. A confirmation that it really is a problem would help.) c) What fraction of the time is spent context switching (see (b)). d) Does the amount of live data in the GC log look right? > The drop off in performance was expected, > because (1) my original collector is turned on and off > at precise points in my C++ code to minimize > collection That's potentially a big win, clearly. You can also do that with our collector, though it may be much less effective with a global heap than per-thread heaps. > (2) my collector uses type information all the time, My experience has been that for small objects this matters in the expected case only to the extent that it reduces the overhead of checking a real pointer, i.e. only if you can actually reduce checking on a "pointer" field to a comparison against null. That requires that you disallow pointers to statically allocated data, which typically requires some copying for constants. And even the significance of that has been decreasing as GC costs become more dominated by the costs of cache misses on the data being traced/copied. > (3) it uses no locks for allocation, because it has a > separate heap for each thread.and Potentially a large win, but very restrictive on the client code. > (4) heap residency > was low for the test cases. -- which favors copying > collector over mark-sweep. Somewhat. But it also helps our collector a lot, provided you similarly increase the heap size. (This should work better in 6.0 than in the 5.x releases.) If it doesn't help, I would definitely suspect the locking code. Hans From virtualcyber@erols.com Fri Mar 16 23:20:59 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Fri, 16 Mar 2001 18:20:59 -0500 Subject: [gclist] My copying collector or Boehm's? References: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com> Message-ID: <001101c0ae6f$c3237e60$0100007f@cradle> Hi, There is always a good chance, given my relative inexperience, that I really could be screwing up with my use of your collector. At least it is not crashing :) > [I wrote] > > With Boehm's collector, my interpreter runs about 2.5 > > times slower than before. This is not a knock on Boehm's > > collector. > [you replied] I still find that surprising, since it suggests that your interpreter is > spending >60% of its time allocating/collecting with our collector, > something I don't normally see. A profile and GC log would be interesting > to me. In particular, it would be nice to know: > a) How much time is spent in the marker? > b) How much time is spent locking around allocation? (The win32 threads > port uses EnterCriticalSection() and LeaveCriticalSection(). On many other > platforms a custom locking scheme is used instead, since the standard one > exhibited serious performance problems, some form of convoying being the > most common. Only 1 thread was really active in my test runs, so I'd guess that convoy did not have a chance to form. In any case, I did two profile runs on VC++6.0. One for my Scheme interpreter's main evaluation loop, and another one for GC_malloc. (While it would have been preferable to profile them in the same run, I had slight difficulty with getting my profiler to run correctly). Both profiles were taken, with my interpreter running the same scheme source code. ===================================================================== || This profile run is on the Evaluation loop of my Scheme interpreter. || This is dificult to read, but there as supposed to be 6 columns. || They are: || (1) function time in milliseconds, || (2) function time as percent of overall measured time || (3) function+child time, || (4) function+child time as percent of overall measured time || (5) hit count || (6) name of the function. Profile: Function timing, sorted by time Date: Fri Mar 16 17:37:57 2001 Program Statistics ------------------ Command line at 2001 Mar 16 17:34: tpscript.exe Start function: _GC_malloc Total time: 18652.854 millisecond Time outside of functions: 7747.360 millisecond Call depth: 13 Total functions: 3854 Total hits: 5216935 Function coverage: 2.9% Overhead Calculated 6 Overhead Average 6 Module Statistics for tpscript.exe ---------------------------------- Time in module: 6488.312 millisecond Percent of time in module: 59.5% Functions in module: 3589 Hits in module: 2337349 Module function coverage: 0.1% Func Func+Child Hit Time % Time % Count Function --------------------------------------------------------- 6475.395 59.4 10726.211 98.4 2336757 _GC_malloc (gc.dll) 11.505 0.1 28.241 0.3 148 _GC_default_push_other_roots (gc.dll) 0.838 0.0 1.082 0.0 148 _GC_push_all_stack (gc.dll) 0.573 0.0 1.086 0.0 148 _GC_push_finalizer_structures (gc.dll) 0.000 0.0 0.000 0.0 148 Script::SchemeVM::PushRoots(void) Note* -- As you indicated in your email, my program spends most the time in _GC_malloc. I think this included the time spent on garbage collection as well. ============================================================ || Here is the measurement on your GC.dll Module Statistics for gc.dll ---------------------------- Time in module: 4417.182 millisecond Percent of time in module: 40.5% Functions in module: 265 Hits in module: 2879586 Module function coverage: 39.6% Func Func+Child Hit Time % Time % Count Function --------------------------------------------------------- 2097.812 19.2 2112.566 19.4 10188 _GC_mark_from (mark.obj) 559.504 5.1 559.504 5.1 13777 _GC_reclaim_clear (reclaim.obj) 125.837 1.2 171.310 1.6 2380112 _GC_malloc (malloc.obj) 111.131 1.0 133.912 1.2 2093 _GC_build_fl (new_hblk.obj) 105.330 1.0 384.709 3.5 296 _GC_apply_to_all_blocks (headers.obj) 99.517 0.9 4245.413 38.9 18596 _GC_generic_malloc (malloc.obj) 75.057 0.7 863.804 7.9 18744 _GC_continue_reclaim (reclaim.obj) 74.471 0.7 74.471 0.7 1794 _GC_reclaim_clear2 (reclaim.obj) 66.135 0.6 66.135 0.6 25577 _GC_clear_hdr_marks (mark.obj) 65.022 0.6 4058.541 37.2 18596 _GC_allocobj (alloc.obj) 56.553 0.5 188.229 1.7 24667 _GC_reclaim_block (reclaim.obj) 53.679 0.5 750.716 6.9 16503 _GC_reclaim_generic (reclaim.obj) 49.125 0.5 4107.725 37.7 18596 _GC_generic_malloc_inner (malloc.obj) 48.739 0.4 73.931 0.7 22284 _GC_block_nearly_full (reclaim.obj) 45.451 0.4 300.788 2.8 148 _GC_finish_collection (alloc.obj) 44.964 0.4 111.610 1.0 14328 _GC_allochblk_nth (allchblk.obj) 42.626 0.4 2286.872 21.0 11816 _GC_mark_some (mark.obj) 38.236 0.4 38.236 0.4 24815 _GC_next_used_block (headers.obj) 38.181 0.4 38.181 0.4 18596 _GC_write_hint (os_dep.obj) 38.171 0.4 38.171 0.4 18596 _GC_invoke_finalizers (finalize.obj) 38.030 0.3 788.746 7.2 16503 _GC_reclaim_small_nonempty_block (reclaim.obj) 36.689 0.3 36.689 0.3 148 _GC_read_changed (stubborn.obj) 35.890 0.3 91.150 0.8 24667 _clear_marks_for_block (mark.obj) 32.599 0.3 2382.769 21.8 148 _GC_stopped_mark (alloc.obj) 31.552 0.3 50.877 0.5 18596 _GC_clear_stack (misc.obj) 26.944 0.2 26.944 0.2 780 _GC_reclaim_clear4 (reclaim.obj) 26.059 0.2 81.437 0.7 1332 _GC_push_next_marked_uncollectable (mark.obj) 25.969 0.2 25.969 0.2 25406 _GC_block_empty (reclaim.obj) 21.913 0.2 21.913 0.2 445 _GC_build_fl_clear4 (new_hblk.obj) 21.005 0.2 135.505 1.2 2241 _GC_allochblk (allchblk.obj) 19.420 0.2 19.420 0.2 18744 _GC_approx_sp (mark_rts.obj) 17.333 0.2 17.333 0.2 9792 _GC_add_to_black_list_stack (blacklst.obj) 16.263 0.1 17.142 0.2 1184 _GC_push_marked (mark.obj) 15.211 0.1 252.143 2.3 148 _GC_start_reclaim (reclaim.obj) 14.620 0.1 14.620 0.1 296 _GC_get_lo_stack_addr (win32_threads.obj) 13.995 0.1 13.995 0.1 148 _GC_clear_bl (blacklst.obj) 12.799 0.1 12.799 0.1 11073 _GC_block_nearly_full3 (reclaim.obj) 12.394 0.1 12.394 0.1 10472 _GC_block_nearly_full1 (reclaim.obj) 12.031 0.1 15.668 0.1 2118 _GC_add_to_fl (allchblk.obj) 11.344 0.1 11.344 0.1 148 _GC_stop_world (win32_threads.obj) 10.281 0.1 16.618 0.2 2093 _GC_get_first_part (allchblk.obj) 8.093 0.1 32.655 0.3 1939 _GC_freehblk (allchblk.obj) 7.453 0.1 7.453 0.1 11816 _GC_default_oom_fn (misc.obj) 7.453 0.1 7.453 0.1 11816 _GC_never_stop_func (misc.obj) 7.368 0.1 276.784 2.5 2241 _GC_new_hblk (new_hblk.obj) 7.141 0.1 7.141 0.1 148 _GC_start_world (win32_threads.obj) 6.898 0.1 22.951 0.2 2094 _setup_header (allchblk.obj) 6.653 0.1 6.653 0.1 2102 _GC_is_black_listed (blacklst.obj) 6.282 0.1 11.214 0.1 2093 _GC_install_counts (headers.obj) 6.064 0.1 6.981 0.1 1939 _GC_free_block_ending_at (allchblk.obj) ============================================================ > c) What fraction of the time is spent context switching (see (b)). I have just two threads, one of which is taking a nice nap while the other is running. (So there is not much overhead on that). > d) Does the amount of live data in the GC log look right? I have not yet looked into this -- I have to dig into your package to find how to use GC_log. > > The drop off in performance was expected, > > because (1) my original collector is turned on and off > > at precise points in my C++ code to minimize > > collection > > That's potentially a big win, clearly. You can > also do that with our collector, though it may be much > less effective with a global heap than per-thread heaps. Do you mean, provided that I can make your collector to use per-thread heap, I can disallow GC_malloc from checking whether it should garbage collect? > > (2) my collector uses type information all the time, > My experience has been that for small objects this matters in the expected > case only to the extent that it reduces the overhead of checking a real > pointer, i.e. only if you can actually reduce checking on a "pointer" field > to a comparison against null. [snipped, as the rest is a bit over my > head] And indeed, there is no need for pointer checking, as it does use the type information. > > (4) heap residency > > was low for the test cases. -- which favors copying > > collector over mark-sweep. > Somewhat. But it also helps our collector a lot, provided you similarly > increase the heap size. (This should work better in 6.0 than in the 5.x > releases.) If it doesn't help, I would definitely suspect the locking code. I have increased the starting heap size for your collector to about 1 Meg, but I have not yet seen substantial speedup yet. I will try 2 Megs. From jmaessen@mit.edu Sat Mar 17 01:37:10 2001 From: jmaessen@mit.edu (Jan-Willem Maessen) Date: Fri, 16 Mar 2001 20:37:10 -0500 Subject: [gclist] Re: RE: synchronization cost (was: Garbage collection and XML) Message-ID: <200103170137.UAA01053@lauzeta.mit.edu> Manoj Plakal replied to my post on Intel synchronization: > > The LOCK prefix can be prepended only to the following instructions > > and to those forms of the instructions that use a memory operand: > > ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, DEC, INC, NEG, NOT, OR, SBB, > > SUB, XOR, XADD, and XCHG. ... > > > > That's a pretty long list, and one of the above instructions is > > usually what you actually want. For example (getting back to GC > > here), BTC/BTR/BTS allow you to atomically update a shared > > allocation/mark bitmap efficiently. > > If you look at the part of the Intel manuals describing > optimizations for the Pentium-II/III/IV, I think you'll > find that they deprecate the use of prefixes like this. After a bit of back-and-forth with Manoj, my conclusion is that this was probably a misreading of (paraphrasing) "Don't use prefixes except for 0F". This seems to refer specifically to 0F introducing multi-byte instructions such as MMX, SIMD, CMPXCHG, and the like. My impression is that prefixes bottleneck instruction fetch/decode. There were various other warnings about avoiding the LOCK prefix wherever possible, but I was unable to find anything specifically blessing LOCK; CMPXCHG or depracating other uses. That being said, kabbalistic readings of Intel documents are a blood sport in multiprocessor memory model circles. Manoj helpfully provided a link to the appropriate documentation: > I was looking at the Pentium III Architecture Optimization > Manual at this URL: > http://developer.intel.com/design/pentiumii/manuals/245127.htm You may be best off reading it and drawing your own conclusions! -Jan-Willem Maessen Eager Haskell project jmaessen@mit.edu From virtualcyber@erols.com Sat Mar 17 05:36:38 2001 From: virtualcyber@erols.com (Ji-Yong D. Chung) Date: Sat, 17 Mar 2001 00:36:38 -0500 Subject: [gclist] Ooops, sent the wrong profile result References: <140D21516EC2D3119EE7009027876644049B5CC0@hplex1.hpl.hp.com> Message-ID: <000001c0aea4$4f689f20$0100007f@cradle> HI, Sorry about this -- I included wrong profile result for my Evaluator in my earlier email. The correct profiling result is included below. I also included the profile of run for my Evaluator when it was usinng my copying collector. If you look at the results, it looks as though the version using my copying collector is slower than one using Boehm collector. That is not the case -- I remeasured relative speeds, and the inclusion of Boehm collector slows my program by a factor very close to 2. (Not 2.5 as I said in earlier email). The fact that Boehm collector uses only 32% (GC_malloc + GC_malloc_atomic) of overall execution time does not convey the whole picture. After I started using it, my whole program slowed down, even in sections that was not calling GC_malloc. I have no idea why it should be so -- this is where doing computer work feels like voodoo. Perhaps part of my problem is that I am using C++ interface rather than C interface, rather than using the Boehm collector.When I ported to Boehm collector, I included more C++ syntax -- but most statements I changed are C equivalent or even faster in terms of speed (??). I must be missing something. ============================================================ || This profile run is the correct one. || Module Statistics for tpscript.exe ---------------------------------- Time in module: 26528.648 millisecond Percent of time in module: 100.0% Functions in module: 3589 Hits in module: 6629027 Module function coverage: 3.7% Func Func+Child Hit Time % Time % Count Function --------------------------------------------------------- 11369.062 42.9 26528.578 100.0 187 Script::Evaluator::Evaluate(class Script::Scheme *,class Script::Environment *) (eval1.obj) 7582.324 28.6 7608.891 28.7 2329467 _GC_malloc (gc.dll) 2987.185 11.3 26528.166 100.0 1578097 Script::Primitive::Invoke(class Script::Scheme * *,int,class Script::Environment *,class Script::SchemeVM *) (primitives.obj) 878.060 3.3 878.783 3.3 341136 _GC_malloc_atomic (gc.dll) 859.823 3.2 1400.314 5.3 139338 Script::PrimitiveBase::Fun(class Script::Scheme * *,int,class Script::Environment *,class Script::SchemeVM *) (primitivesnumber.obj) 462.704 1.7 462.704 1.7 379151 Script::PrimitiveBase::Fun(class Script::Scheme * *,int,class Script::Environment *,class Script::SchemeVM *) (primitivesvector.obj) 279.420 1.1 279.420 1.1 338680 Script::Integer::Integer(int) (objnumbers.obj) ============================================================ || This profile run is one for my program using copying collector. Module Statistics for tpscript.exe ---------------------------------- Time in module: 33731.923 millisecond Percent of time in module: 100.0% Functions in module: 4067 Hits in module: 7616101 Module function coverage: 5.6% Func Func+Child Hit Time % Time % Count Function --------------------------------------------------------- 13619.320 40.4 33731.807 100.0 227 Script::Evaluator::Evaluate(struct Script::Scheme *,struct Script::Environment *) (eval1.obj) 6265.637 18.6 33730.767 100.0 1578140 Script::PrimitiveFT::Invoke(struct Script::Primitive *,struct Script::Scheme * *,int,class Script::SchemeVM *) (primitives.obj) 3255.299 9.7 3255.299 9.7 1748575 Script::Recycler::AllocateNOGC(int,int &) (recycler.obj) 1564.422 4.6 2526.844 7.5 379151 Script::PrimitiveBase::Fun(struct Script::Primitive *,struct Script::Scheme * *,int,class Script::SchemeVM *) (primitivesvector.obj) 1059.304 3.1 1328.214 3.9 139338 Script::PrimitiveBase::Fun(struct Script::Primitive *,struct Script::Scheme * *,int,class Script::SchemeVM *) (primitivesnumber.obj) 818.749 2.4 818.749 2.4 558899 Script::VectorFT::Dimension(struct Script::Scheme *) (objvector.obj) 633.502 1.9 808.520 2.4 344876 Script::Recycler::Copy(struct Script::Scheme *,void * &) (recycler.obj) 549.137 1.6 549.137 1.6 499194 Script::VectorFT::ElementAt(struct Script::Scheme *,int) (objvector.obj) From dfb@watson.ibm.com Mon Mar 19 23:19:39 2001 From: dfb@watson.ibm.com (David F. Bacon) Date: Mon, 19 Mar 2001 18:19:39 -0500 Subject: [gclist] quantitative comparison of garbage collectors Message-ID: <3AB6940B.6C1EF068@watson.ibm.com> does anyone know of work that has been done quantitatively comparing different GCs? whether or not you know of such work, if (just for example) someone were writing a paper on that topic, what is the parameter space that you think ought to be explored? application speed? memory utilization? others?? thanks in advance, david From mwh@dsl.cis.upenn.edu Sun Mar 25 02:48:21 2001 From: mwh@dsl.cis.upenn.edu (Michael Hicks) Date: Sat, 24 Mar 2001 21:48:21 -0500 (EST) Subject: [gclist] Re: Daily gclist MIME digest V3 #150 In-Reply-To: <200103231020.f2NAKLp04362@gradient.cis.upenn.edu> from "owner-gclist@lists.iecc.com" at Mar 23, 2001 05:20:21 am Message-ID: <200103250248.f2P2mL506553@codex.cis.upenn.edu> > does anyone know of work that has been done quantitatively comparing > different GCs? We wrote some papers comparing different GC *mechanisms*, but not different GC's. The metric we were interested in was GC speed, and we hypothesized about effects to the client (mutator). See http://www.cis.upenn.edu/~oscar for papers and details. Mike -- Michael Hicks Ph.D. Candidate, the University of Pennsylvania http://www.cis.upenn.edu/~mwh mailto://mwh@dsl.cis.upenn.edu "In essential things, unity; in doubtful things, liberty; in all things, charity." --Pope John XXIII, Ad Petri Cathedram, and popularly attributed to St. Augustine From will@ccs.neu.edu Wed Mar 28 16:58:18 2001 From: will@ccs.neu.edu (William D Clinger) Date: Wed, 28 Mar 2001 11:58:18 -0500 (EST) Subject: [gclist] older-first garbage collection Message-ID: <200103281658.f2SGwIg13274@electra.ccs.neu.edu> [I sent this last week, but it didn't show up so I'm trying a different email address.] Lars Hansen has completed his PhD thesis, titled "Older-First Garbage Collection in Practice". This thesis describes and compares the performance of several interchangeable garbage collectors in Larceny, our implementation of Scheme for the SPARC. The abstract and gzip'ed Postscript for this thesis are online at http://www.ccs.neu.edu/home/will/GC/lth-thesis/index.html This URL also contains a link to the development version of Larceny that Lars used to collect the data in his thesis. This development version of Larceny is newer than version 1.0a1 but contains several compiler bugs that we elected not to fix until Lars had completed his benchmarking. When (most of) these bugs are fixed, we will release Larceny 2.0a1. Will Clinger Northeastern University (on sabbatical at Sun Labs East) From moss@cs.umass.edu Wed Mar 28 20:53:01 2001 From: moss@cs.umass.edu (Eliot Moss) Date: Wed, 28 Mar 2001 15:53:01 -0500 (EST) Subject: [gclist] Post Doc position at UMass/Amherst Message-ID: <15042.20269.246229.615670@kiwi.cs.umass.edu> The Architecture and Language Implementation group of the Department of Computer Science, University of Massachusetts at Amherst seeks a post-doctoral researcher to join our laboratory. Our current researcher, Dr. Stephen Blackburn, will be leaving in about a year and we would prefer a person who could overlap with him, starting sometime between June 2001 and January 2002. We have funding for about 2.5 - 3 years. We are engaged in work on garbage collection, compiler optimization, and persistent and distributed systems. We have our own compiler infrastructure (Scale, written in Java) as well as a source license to IBM's Jalapeno Java Virtual Machine. Focus for this researcher is to be Java performance, for garbage collection, persistence, or distributed shared memory -- or, something else if you can convince me :-). We have a group of about 15, including 3 professors, a staff programmer, and graduate student research assistants, a significant portion of which are engaged in Scale or Jalapeno work. We are also concerned with developing and simulating new hardware architectural ideas, simulators to support such work, dynamic management of energy/power, and dynamic and adaptive performance improvement (particularly of memory accesses) in general. We have a good track record in developing and evaluating gc, persistence, and OO language implementation mechanisms, including the Mature Object Space (Train) algorithm used in some commercial Java systems. If you have a serious interest, send me email and I'll forward application information. Regards -- Eliot Moss ============================================================================== J. Eliot B. Moss, Associate Professor http://www.cs.umass.edu/~moss www Department of Computer Science +1-413-545-4206 voice 140 Governor's Drive, Room 372 +1-413-545-1249 fax University of Massachusetts moss@cs.umass.edu email Amherst, MA 01003-4610 USA +1-413-545-3733 Priscilla Coe sec'y ==============================================================================