From ogoh@asu.edu Fri Feb 7 16:33:45 2003 From: ogoh@asu.edu (Okehee Goh) Date: Fri, 07 Feb 2003 09:33:45 -0700 Subject: [gclist] GC points and GC map Message-ID: Hello, I'm not sure whether this list is active enough to post questions. If not, please take my apology. I have questions regarding GC map and GC points to implement Exact GC( not conservative GC). When I read the paper[2], there seem GC points that can tolerate GC. Despite not considering exact GC, still there are some points which can't tolerate GC on multi-threaded system? According to the paper [1] ( or [2]), in order to implement Exact GC, it seems to need to place GC points on some instructions that generate GC map (or stack map -- contains a set of registers and stack locations that refer to live objects in heap) So, GC points and GC map are just necessary to support Exact GC? Otherwise, it is also relevant to implement incremental GC? Actually I'm trying to design incremental GC which shows deterministic behavior on multi-threaded system of single processor. It mean that one thread runs at a time. Sill there are some points that can't tolerate GC when GC thread tries to run by preempting other work threads? I appreciate any opinion. ( Forgive if this question is too basic) Regards, Okehee [1]Ole Agesen. GC Points in a Threaded Environment. SMLI TR-98-70. Sun Microsystems, Palo Alto, CA, December 1998. [2] http://wwws.sun.com/software/communitysource/j2me/cdc/ [3] J. M. Stichnoth, G.-Y. Lueh, and M. Cierniak. Support for Garbage Collection at Every Instruction in a Java Compiler. Proceedings of the ACM Conference on Programming Language Design and Implementation, May 1999, pp. 118--127 --------------------------------------- Real-Time System lab of CSE of ASU CSE Dept, College of EAS, ASU P.O.Box 875406 Tempe AZ 85287 480-727-7765 From mwh@cs.umd.edu Mon Feb 17 18:46:51 2003 From: mwh@cs.umd.edu (Michael Hicks) Date: Mon, 17 Feb 2003 13:46:51 -0500 Subject: [gclist] why malloc/free instead of GC? Message-ID: <1045507612.1760.58.camel@mwhlaptop> A number of performance studies (starting with Zorn in `92, but perhaps before?) and anecdotal evidence now suggests that there is little reason to use malloc/free over GC. Zorn states that the main shot against conservative GC is that it requires a larger memory footprint (and actually uses much of it). Another shot might be the unexpected latency resulting from GC, foiling soft real-time guarantees. My question: are there any studies that indicate under what conditions would benefit from avoiding GC for these reasons (putting aside the safety benefits of GC)? For example, what program characteristics would imply that using BDW would require significantly more memory than would using malloc/free? What sorts of programs would incur excessively long latencies during collection? I can certainly speculate about the answers to these questions (and would welcome list readers to do so), but I am curious if any published or informal studies have been done. While Zorn's study points out which benchmark programs require more memory than others when using BDW, it doesn't go into why that is the case (as far as I could see on skimming it). Thanks, Mike From lassi.tuura@cern.ch Mon Feb 17 19:12:08 2003 From: lassi.tuura@cern.ch (Lassi A. Tuura) Date: Mon, 17 Feb 2003 20:12:08 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <1045507612.1760.58.camel@mwhlaptop> References: <1045507612.1760.58.camel@mwhlaptop> Message-ID: <3E513408.9080105@cern.ch> You might want to refer to some of the recent discussions on the GCC (GNU compiler collection) mailing list. One hotly discussed topic is the changes in the memory access patterns and other subtle hidden costs. Memory access pattern costs are difficult to measure. Some applications greatly benefit from reusing freed memory quickly because of CPU cache issues; GC may work against that. Yet many allocation patterns are well suited to GC. Some apps are very sensitive to clustering of the objects due to the access patterns, and depending on how you allocate objects you can do well or horribly. Depending on your memory allocator logic it may have little or lot to do with GC costs. On the other hand, memory allocation assumptions build into designs which makes it hard to compare a system with and without GC. For an unbiased comparison you might have to rewrite the whole system. BTW, GCC doesn't use the BDW collector but its own scheme, which is another factor to fold into the impact calculations. //lat -- prototype, n.: First stage in the life cycle of a computer product, followed by pre-alpha, alpha, beta, release version, corrected release version, upgrade, corrected upgrade, etc. Unlike its successors, the prototype is not expected to work. From basile@starynkevitch.net Mon Feb 17 19:49:45 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Mon, 17 Feb 2003 20:49:45 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <1045507612.1760.58.camel@mwhlaptop> References: <1045507612.1760.58.camel@mwhlaptop> Message-ID: <15953.15577.512872.84744@hector.lesours> >>>>> "Michael" == Michael Hicks writes: Michael> A number of performance studies (starting with Zorn in Michael> `92, but perhaps before?) and anecdotal evidence now Michael> suggests that there is little reason to use malloc/free Michael> over GC. Zorn states that the main shot against Michael> conservative GC is that it requires a larger memory Michael> footprint (and actually uses much of it). Another shot Michael> might be the unexpected latency resulting from GC, Michael> foiling soft real-time guarantees. I tend to believe that people prefer malloc&free to Boehm's GC without real technical reasons, but mostly for social reasons. [The almost only soft realtime guarantee people want -outside the embedded market- is compatibility with graphical user interfaces time requirements] Most people I know that are proficient C coders did not even heard of GC techniques (in particular Boehm's GC) before I talked them about it. FWIW, I did coded a small (unrealistic & simplistic) C example, and found that malloc & free (on a 1.2 Athlon or 2 GHz P4) under Linux is significantly faster than Boehm's GC. (IIRC, a typical malloc is < 1 microsecond, while a GC_malloc is < 40 microseconds). Apparently, simple GC techniques are no more taught in CS classes... (this was not true 20 years ago, at least not in France - I learnt about GC in a lecture on Lisp in License, about the equivalent of Bachelor?). I find funny that Java rehabilitated the whole GC idea, while Java's GC (because of the synchrony & finalization properties of the language specification) is necessarily complex & slow. The perception of GC by old (even technical) managers is the GC of Lisp machines or systems, at the time when RAM was extremely expensive.. [so garbage collection used the disk swap, and was heard at that time because the disk made lot of noises] This is no more the case. Actually, I'm surprised that today's major opensource projects (like Apache, GNOME, KDE...) don't use GC [with the exception of Emacs, which used to explicitly show GC periods to user - this was a wrong decision, because it made users complain against GC]. On a related note: people usually don't believe that modern ML (eg Ocaml) or Lisp (eg CMUCL) implementations can perform about as quickly as C (ie less than 2 times slower than C). -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From basile@starynkevitch.net Mon Feb 17 20:44:55 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Mon, 17 Feb 2003 21:44:55 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <011f01c2d6c0$14e63010$1c02a8c0@watson.ibm.com> References: <1045507612.1760.58.camel@mwhlaptop> <15953.15577.512872.84744@hector.lesours> <011f01c2d6c0$14e63010$1c02a8c0@watson.ibm.com> Message-ID: <15953.18887.922393.270023@hector.lesours> >>>>> "David" == David F Bacon writes: Citing me, Basile: Basile>> FWIW, I did coded a small (unrealistic & simplistic) C example, Basile>> and found that malloc & free (on a 1.2 Athlon or 2 GHz P4) Basile>> under Linux is significantly faster than Boehm's GC. (IIRC, a Basile>> typical malloc is < 1 microsecond, while a GC_malloc is < 40 Basile>> microseconds). David> umm... 1 microsecond on a 2 GHz machine makes 2000 cycles, David> yes? let's conservatively say that you achieve only 0.1 David> instructions per cycle. a tuned allocation sequence, David> inlined by the compiler, is between 10 and 20 instructions David> for the Jikes RVM (Java VM from IBM Research). so let's David> call it 200 cycles in the absolute worst case. how is David> GC_malloc spending 40,000 cycles per alloc? Here is my test program: ################################################################ // file essm.c #include #include #include #include #ifdef USEGC #include #define malloc(S) GC_malloc(S) #ifdef USEGCFREE #define free(P) GC_free(P) #else #define free(P) {} #endif #endif void* tabptr[16]; int main (int argc, char **argv) { long long maxcnt = 1000000; long long i = 0; int r=0; int s=0; double usert, syst; struct tms t; struct st *p = 0; if (argc > 1) maxcnt = atol (argv[1])*1000; memset(&t, 0, sizeof(t)); if (maxcnt<100000) maxcnt=100000; printf ("begin maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); for (i = 0; i < maxcnt; i++) { if ((i & 0x1fffff) == 0) printf ("i=%lld\n", i); r = lrand48() & 0xf; if (tabptr[r]) free(tabptr[r]); s = (lrand48() % 100000) + 100; tabptr[r] = malloc(s); }; times(&t); usert = ((double)t.tms_utime) / sysconf(_SC_CLK_TCK); syst = ((double)t.tms_stime) / sysconf(_SC_CLK_TCK); printf ("end maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); printf ("cputime user=%g system=%g total == per iteration user=%g system=%g\n", usert, syst, usert/(double)maxcnt, syst/(double)maxcnt); return 0; } ################################################################ My machine is a Debian/Sid (the unstable, I made apt-get update & dist-upgrade today), glibc is 2.3.1, gcc is 3.2.3 20030210 (Debian prerelease), processor is an AthlonXP2000 (a tinybit overclocked), RAM is 512Mbytes (as 2*256 DDRAM 2700 memory banks): % cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(TM) XP 2000+ stepping : 0 cpu MHz : 1733.438 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3460.30 % free total used free shared buffers cached Mem: 514988 466708 48280 0 91364 209880 -/+ buffers/cache: 165464 349524 Swap: 1025000 29716 995284 ================ compilation with malloc&free from glibc2.3.1 gcc -O3 essm.c -o essm ./essm gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=0.61 system=0.8 total == per iteration user=6.1e-07 system=8e-07 ================ compilation with Boehm's GC gcc -O3 -DUSEGC essm.c -o essm_gc -lgc (the -lgc is /usr/lib/libgc.so.6 from Debian) ./essm_gc gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=310.64 system=0.42 total == per iteration user=0.00031064 system=4.2e-07 ================ compilation with Boehm's GC using explicit free gcc -O3 -DUSEGC -DUSEGCFREE essm.c -o essm_gcfr -lgc ./essm_gcfr begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=116.45 system=0.15 total == per iteration user=0.00011645 system=1.5e-07 ################################################################ actually I am surprised by the resulting times. I suppose than lrand48 happens to produce a valid pointer at inappropriate times .... So it seems that on this example a glibc malloc+free last about 0.6microsecond, while a Boehm GC_malloc+GC_free last about 116 microseconds. I'm interested if someone could reproduce the test and confirm or infirm the (approximate) timing. Regards -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From ghudson@MIT.EDU Mon Feb 17 23:47:11 2003 From: ghudson@MIT.EDU (Greg Hudson) Date: 17 Feb 2003 18:47:11 -0500 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <15953.15577.512872.84744@hector.lesours> References: <1045507612.1760.58.camel@mwhlaptop> <15953.15577.512872.84744@hector.lesours> Message-ID: <1045525631.1303.65.camel@error-messages.mit.edu> Here are some reasons not to use GC in a new project written in C: * It's a very deep requirement which is not provided by the native system. The Boehm conservative GC may be very good and portable, but it's unlikely to be 100% perfect. And if anything goes wrong with memory, now you have to suspect this piece of arcane magic in addition to your own code. * There are network effects. If you're writing a library, you may not want to require everyone who uses your library to use GC. If you're writing an application which uses libraries, you'll have to memory-manage those libraries' data objects anyway. * There's this finalization problem. Finalizers aren't run on a guaranteed schedule, so expensive objects like file descriptors can't be trusted to finalizers. But if you're not explicitly managing your memory, you may lose track of when to deconstruct objects containing expensive resources (e.g. if there is a "file-as-string" type which acts like a string type, now you have to explicitly manage all strings in order to explicitly manage your file descriptors). The first two arguments don't apply to a language like Java with built-in GC services. The third argument does, but maybe it doesn't come up very often in practice. Of course, even if these are good arguments in opposition to GC, they don't necessarily have much to do with the reasons programmers tend not to use GC in the real world. Most likely they just don't know much about it. From hans_boehm@hp.com Tue Feb 18 04:24:27 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Mon, 17 Feb 2003 20:24:27 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> I would add: GC object roundtrip times are pretty much unavoidably proportional to the object size, where malloc + free times can be nearly constant. If you allocate primarily large objects, malloc+free will be cheaper. (For sufficiently small objects, it usually isn't, at least based on my measurements. Conservative collectors like large objects even less.) On the other hand: I think finalization isn't an argument against tracing GCs. You run into fundamentally the same issues with, say, user-implemented reference counting in C++. The problems are inherent in abstracting away or hiding precise deallocation times. And the problems aren't anywhere near unsolvable. See my 2003 POPL paper (also at http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html) for details. Finalization also usually provides an easy mechanism for dealing with libraries requiring explicit deallocation calls. Thus I don't think that's a major problem. Hans From hans_boehm@hp.com Tue Feb 18 04:37:00 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Mon, 17 Feb 2003 20:37:00 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> Note that this gives you an average object size of a little over 50KB, and that the malloc/free version never touches the allocated memory. Any garbage collector will lose against malloc/free under those conditions, due to both the huge average object size, and the fact that collectors generally like to initialize at least possible pointer fields within objects. I don't think any real applications behave quite like this. There are some that are close enough that they shouldn't use a GC. Hans -----Original Message----- From: Basile STARYNKEVITCH To: David F. Bacon Cc: gclist@iecc.com Sent: 2/17/03 12:44 PM Subject: Re: [gclist] why malloc/free instead of GC? >>>>> "David" == David F Bacon writes: Citing me, Basile: Basile>> FWIW, I did coded a small (unrealistic & simplistic) C example, Basile>> and found that malloc & free (on a 1.2 Athlon or 2 GHz P4) Basile>> under Linux is significantly faster than Boehm's GC. (IIRC, a Basile>> typical malloc is < 1 microsecond, while a GC_malloc is < 40 Basile>> microseconds). David> umm... 1 microsecond on a 2 GHz machine makes 2000 cycles, David> yes? let's conservatively say that you achieve only 0.1 David> instructions per cycle. a tuned allocation sequence, David> inlined by the compiler, is between 10 and 20 instructions David> for the Jikes RVM (Java VM from IBM Research). so let's David> call it 200 cycles in the absolute worst case. how is David> GC_malloc spending 40,000 cycles per alloc? Here is my test program: ################################################################ // file essm.c #include #include #include #include #ifdef USEGC #include #define malloc(S) GC_malloc(S) #ifdef USEGCFREE #define free(P) GC_free(P) #else #define free(P) {} #endif #endif void* tabptr[16]; int main (int argc, char **argv) { long long maxcnt = 1000000; long long i = 0; int r=0; int s=0; double usert, syst; struct tms t; struct st *p = 0; if (argc > 1) maxcnt = atol (argv[1])*1000; memset(&t, 0, sizeof(t)); if (maxcnt<100000) maxcnt=100000; printf ("begin maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); for (i = 0; i < maxcnt; i++) { if ((i & 0x1fffff) == 0) printf ("i=%lld\n", i); r = lrand48() & 0xf; if (tabptr[r]) free(tabptr[r]); s = (lrand48() % 100000) + 100; tabptr[r] = malloc(s); }; times(&t); usert = ((double)t.tms_utime) / sysconf(_SC_CLK_TCK); syst = ((double)t.tms_stime) / sysconf(_SC_CLK_TCK); printf ("end maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); printf ("cputime user=%g system=%g total == per iteration user=%g system=%g\n", usert, syst, usert/(double)maxcnt, syst/(double)maxcnt); return 0; } ################################################################ My machine is a Debian/Sid (the unstable, I made apt-get update & dist-upgrade today), glibc is 2.3.1, gcc is 3.2.3 20030210 (Debian prerelease), processor is an AthlonXP2000 (a tinybit overclocked), RAM is 512Mbytes (as 2*256 DDRAM 2700 memory banks): % cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(TM) XP 2000+ stepping : 0 cpu MHz : 1733.438 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3460.30 % free total used free shared buffers cached Mem: 514988 466708 48280 0 91364 209880 -/+ buffers/cache: 165464 349524 Swap: 1025000 29716 995284 ================ compilation with malloc&free from glibc2.3.1 gcc -O3 essm.c -o essm ./essm gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=0.61 system=0.8 total == per iteration user=6.1e-07 system=8e-07 ================ compilation with Boehm's GC gcc -O3 -DUSEGC essm.c -o essm_gc -lgc (the -lgc is /usr/lib/libgc.so.6 from Debian) ./essm_gc gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=310.64 system=0.42 total == per iteration user=0.00031064 system=4.2e-07 ================ compilation with Boehm's GC using explicit free gcc -O3 -DUSEGC -DUSEGCFREE essm.c -o essm_gcfr -lgc ./essm_gcfr begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=116.45 system=0.15 total == per iteration user=0.00011645 system=1.5e-07 ################################################################ actually I am surprised by the resulting times. I suppose than lrand48 happens to produce a valid pointer at inappropriate times .... So it seems that on this example a glibc malloc+free last about 0.6microsecond, while a Boehm GC_malloc+GC_free last about 116 microseconds. I'm interested if someone could reproduce the test and confirm or infirm the (approximate) timing. Regards -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From hans_boehm@hp.com Tue Feb 18 04:37:00 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Mon, 17 Feb 2003 20:37:00 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> Note that this gives you an average object size of a little over 50KB, and that the malloc/free version never touches the allocated memory. Any garbage collector will lose against malloc/free under those conditions, due to both the huge average object size, and the fact that collectors generally like to initialize at least possible pointer fields within objects. I don't think any real applications behave quite like this. There are some that are close enough that they shouldn't use a GC. Hans -----Original Message----- From: Basile STARYNKEVITCH To: David F. Bacon Cc: gclist@iecc.com Sent: 2/17/03 12:44 PM Subject: Re: [gclist] why malloc/free instead of GC? >>>>> "David" == David F Bacon writes: Citing me, Basile: Basile>> FWIW, I did coded a small (unrealistic & simplistic) C example, Basile>> and found that malloc & free (on a 1.2 Athlon or 2 GHz P4) Basile>> under Linux is significantly faster than Boehm's GC. (IIRC, a Basile>> typical malloc is < 1 microsecond, while a GC_malloc is < 40 Basile>> microseconds). David> umm... 1 microsecond on a 2 GHz machine makes 2000 cycles, David> yes? let's conservatively say that you achieve only 0.1 David> instructions per cycle. a tuned allocation sequence, David> inlined by the compiler, is between 10 and 20 instructions David> for the Jikes RVM (Java VM from IBM Research). so let's David> call it 200 cycles in the absolute worst case. how is David> GC_malloc spending 40,000 cycles per alloc? Here is my test program: ################################################################ // file essm.c #include #include #include #include #ifdef USEGC #include #define malloc(S) GC_malloc(S) #ifdef USEGCFREE #define free(P) GC_free(P) #else #define free(P) {} #endif #endif void* tabptr[16]; int main (int argc, char **argv) { long long maxcnt = 1000000; long long i = 0; int r=0; int s=0; double usert, syst; struct tms t; struct st *p = 0; if (argc > 1) maxcnt = atol (argv[1])*1000; memset(&t, 0, sizeof(t)); if (maxcnt<100000) maxcnt=100000; printf ("begin maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); for (i = 0; i < maxcnt; i++) { if ((i & 0x1fffff) == 0) printf ("i=%lld\n", i); r = lrand48() & 0xf; if (tabptr[r]) free(tabptr[r]); s = (lrand48() % 100000) + 100; tabptr[r] = malloc(s); }; times(&t); usert = ((double)t.tms_utime) / sysconf(_SC_CLK_TCK); syst = ((double)t.tms_stime) / sysconf(_SC_CLK_TCK); printf ("end maxcnt=%lld=%e\n", maxcnt, (double)maxcnt); printf ("cputime user=%g system=%g total == per iteration user=%g system=%g\n", usert, syst, usert/(double)maxcnt, syst/(double)maxcnt); return 0; } ################################################################ My machine is a Debian/Sid (the unstable, I made apt-get update & dist-upgrade today), glibc is 2.3.1, gcc is 3.2.3 20030210 (Debian prerelease), processor is an AthlonXP2000 (a tinybit overclocked), RAM is 512Mbytes (as 2*256 DDRAM 2700 memory banks): % cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(TM) XP 2000+ stepping : 0 cpu MHz : 1733.438 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3460.30 % free total used free shared buffers cached Mem: 514988 466708 48280 0 91364 209880 -/+ buffers/cache: 165464 349524 Swap: 1025000 29716 995284 ================ compilation with malloc&free from glibc2.3.1 gcc -O3 essm.c -o essm ./essm gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=0.61 system=0.8 total == per iteration user=6.1e-07 system=8e-07 ================ compilation with Boehm's GC gcc -O3 -DUSEGC essm.c -o essm_gc -lgc (the -lgc is /usr/lib/libgc.so.6 from Debian) ./essm_gc gives: begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=310.64 system=0.42 total == per iteration user=0.00031064 system=4.2e-07 ================ compilation with Boehm's GC using explicit free gcc -O3 -DUSEGC -DUSEGCFREE essm.c -o essm_gcfr -lgc ./essm_gcfr begin maxcnt=1000000=1.000000e+06 i=0 end maxcnt=1000000=1.000000e+06 cputime user=116.45 system=0.15 total == per iteration user=0.00011645 system=1.5e-07 ################################################################ actually I am surprised by the resulting times. I suppose than lrand48 happens to produce a valid pointer at inappropriate times .... So it seems that on this example a glibc malloc+free last about 0.6microsecond, while a Boehm GC_malloc+GC_free last about 116 microseconds. I'm interested if someone could reproduce the test and confirm or infirm the (approximate) timing. Regards -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From johnsson@crt.se Tue Feb 18 08:42:47 2003 From: johnsson@crt.se (Thomas Johnsson) Date: Tue, 18 Feb 2003 09:42:47 +0100 Subject: [gclist] Games in C++ with GC? (why malloc/free instead of GC?) In-Reply-To: <15953.15577.512872.84744@hector.lesours> References: <1045507612.1760.58.camel@mwhlaptop> <15953.15577.512872.84744@hector.lesours> Message-ID: <15953.61959.985000.886450@gargle.gargle.HOWL> Further to this disscussion about the use of GC: Games, espcially 3D games, are often written in C++, using DirectX (or OpenGL). Is there any experience in using GC ()conservative or not) in such applications? Being a sort of real time program, I can imagine GC latencies is a potential problem ... ??? -- Thomas Johnsson Basile STARYNKEVITCH writes: > >>>>> "Michael" == Michael Hicks writes: > > Michael> A number of performance studies (starting with Zorn in > Michael> `92, but perhaps before?) and anecdotal evidence now > Michael> suggests that there is little reason to use malloc/free > Michael> over GC. Zorn states that the main shot against > Michael> conservative GC is that it requires a larger memory > Michael> footprint (and actually uses much of it). Another shot > Michael> might be the unexpected latency resulting from GC, > Michael> foiling soft real-time guarantees. > > I tend to believe that people prefer malloc&free to Boehm's GC without > real technical reasons, but mostly for social reasons. [The almost > only soft realtime guarantee people want -outside the embedded market- > is compatibility with graphical user interfaces time requirements] > > Most people I know that are proficient C coders did not even heard of > GC techniques (in particular Boehm's GC) before I talked them about > it. > [etc] From Nick.Barnes@pobox.com Tue Feb 18 10:57:09 2003 From: Nick.Barnes@pobox.com (Nick Barnes) Date: Tue, 18 Feb 2003 10:57:09 +0000 Subject: [gclist] Games in C++ with GC? (why malloc/free instead of In-Reply-To: Message from Thomas Johnsson of "Tue, 18 Feb 2003 09:42:47 +0100." <15953.61959.985000.886450@gargle.gargle.HOWL> Message-ID: <22350.1045565829@thrush.ravenbrook.com> At 2003-02-18 08:42:47+0000, Thomas Johnsson writes: > Games, espcially 3D games, are often written in C++, using DirectX > (or OpenGL). Is there any experience in using GC ()conservative or > not) in such applications? Being a sort of real time program, I can > imagine GC latencies is a potential problem ... ??? Many games companies these days are using Lua (a very simple embedded language), which has a very simple stop-and-copy collector. Some games use other GCed languages or sub-systems. Some of these games have latency problems, and some do not. Some of the latency problems may be due to the GC, and some may not. Crash Bandicoot, a PS1 platformer a few years back, was written in a Lisp dialect. However, this is all somewhat theoretical because most games do not allocate during game play. The X-Box snowboarding game "Amped" has very bad frame rate problems, especially in the GUI. It also uses Lua, especially in the GUI. Whether these facts are related is unclear. Other games apparently use Lua more than Amped does, and yet do not have frame rate problems. Nick Barnes Ravenbrook Limited From weigelt@metux.de Tue Feb 18 13:04:21 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 14:04:21 +0100 Subject: [gclist] glib/gtk w/ GC Message-ID: <20030218130421.GA30555@metux.de> hi folks, i'm gonna start working on an gc based derivate of the glib/gtk. anyone interested in helping ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From weigelt@metux.de Tue Feb 18 13:31:50 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 14:31:50 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> References: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> Message-ID: <20030218133150.GC30555@metux.de> On Mon, Feb 17, 2003 at 08:37:00PM -0800, Boehm, Hans wrote: > > Note that this gives you an average object size of a little over 50KB, > and that the malloc/free version never touches the allocated memory. > Any garbage collector will lose against malloc/free under those conditions, > due to both the huge average object size, and the fact that collectors > generally like to initialize at least possible pointer fields within objects. GCs can cause problems if you use really huge memory. If it has no type information, it must scan through all memory chunks and look for pointers. With type info (i.e. in oberon) it could become much faster. But there's still another problem: if your application holds many pages, which aren't accessed for quite a long time, they're possibly swapped out, but each time the GC runs over the heap, they have to be swapped in again. hmm, is there any way to avoid scanning over the whole heap each time ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From dfb@watson.ibm.com Tue Feb 18 13:56:32 2003 From: dfb@watson.ibm.com (David F. Bacon) Date: Tue, 18 Feb 2003 08:56:32 -0500 Subject: [gclist] why malloc/free instead of GC? References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> Message-ID: <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> hans, by "roundtrip" do you mean malloc+free? i don't understand your statement about proportionality to object size in GC. also, why do conservative collectors dislike large objects? is it because a floating point number could cause a dead large object to be retained? david ----- Original Message ----- From: "Boehm, Hans" To: "'Greg Hudson '" ; "'Basile STARYNKEVITCH '" Cc: Sent: Monday, February 17, 2003 11:24 PM Subject: Re: [gclist] why malloc/free instead of GC? > I would add: > > GC object roundtrip times are pretty much unavoidably proportional to the object size, where malloc + free times can be nearly constant. If you allocate primarily large objects, malloc+free will be cheaper. (For sufficiently small objects, it usually isn't, at least based on my measurements. Conservative collectors like large objects even less.) From jc@port25.com Tue Feb 18 14:11:07 2003 From: jc@port25.com (Juergen Christoffel) Date: Tue, 18 Feb 2003 15:11:07 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030218133150.GC30555@metux.de> References: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> <20030218133150.GC30555@metux.de> Message-ID: <20030218141107.GB14818@port25.com> On Tue, Feb 18, 2003 at 02:31:50PM +0100, Enrico Weigelt wrote: > hmm, is there any way to avoid scanning over the whole heap each time ? Yes, generational GC for example. Back in the eighties and early nineties, when Symbolics introduced generational GC on their Lisp Machines, large programs did run faster with GC turned on than without GC when the "Ephemeral GC" (that was their name for Generational GC, IIRC) was turned on because this reduced working sets. --jc -- Non cogitant. Ergo non sunt. -- Georg Christoph Lichtenberg From cef@geodesic.com Tue Feb 18 14:27:55 2003 From: cef@geodesic.com (Charles Fiterman) Date: Tue, 18 Feb 2003 08:27:55 -0600 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> Message-ID: <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> Consider a large online application with the following common requirement. 90% of all requests will be filled in one second. All requests will be filled in ten seconds. If you don't want to crash you must have type safety and that implies garbage collection of some sort. Large applications are written by pools of programmers some of whom are very bad. If you have mark and sweep or moving collection at some point your application will become so large that collection time causes you to violate it no matter how many CPU's you add. You must have a way to distribute free operations and not run them all at once. If you make some restrictions on class structures and control collections centrally you can have reference counting and related methods that distribute frees. These are inefficient but we are supposing you can always add more CPU's. The great advantage of reference counting is that it is scaleable to very large sizes. Reference counting also has the advantage that the destruction of objects can have rational finalizers. Finalizers must be safe, general, sure, prompt and ordered. Safe means they don't violate the type system. General means finalizers can run any code in the language and have that code produce normal results, for example exceptions can't just get discarded. Sure means if you build an object it gets to destroy itself. Prompt means finalizers aren't indefinitely postponed. Ordered means finalizers run in a determined order, if you ship the application you don't change the order creating portability bugs. From basile@starynkevitch.net Tue Feb 18 14:49:47 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Tue, 18 Feb 2003 15:49:47 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> Message-ID: <15954.18443.77701.827155@hector.lesours> >>>>> "Charles" == Charles Fiterman writes: Charles> Consider a large online application with the following Charles> common requirement. 90% of all requests will be filled Charles> in one second. All requests will be filled in ten Charles> seconds. There are two meanings of large here : 1. application with a big memory requirement at runtime 2. application with a big amount of code These 2 meanings are not related. Some applications (e.g. numerical engineering) may be a small amount of code requiring a huge amount of memory for data. And some programs have a huge amount of code (e.g. lots of special case processing), but needs a small amount of data memory to run. Charles> If you don't want to crash you must have type safety and Charles> that implies garbage collection of some sort. Large Charles> applications are written by pools of programmers some of Charles> whom are very bad. Yes. The programmer's time is an increasingly expensive resouce. Charles> If you have mark and sweep or moving collection at some Charles> point your application will become so large that Charles> collection time causes you to violate it no matter how Charles> many CPU's you add. You must have a way to distribute Charles> free operations and not run them all at once. It seems to me that large (at least in meaning 2) applications exist which are coded in a GC-ed language (like Lisp, Smalltalk, Java, Ocaml, ....) never spend more than a few consecutive seconds in garbage collection (just because a few seconds in todays machine is a lot of CPU time). Of course I would suppose that the largest software is still coded in (decades old) Cobol (or perhaps Fortran). I'm not sure it is easy to maintain. Since copying a hundred megabytes per second is realistic on today's machines, I would believe that a full major garbage collection of a gigabyte heap (which for me is a big heap) should require less than 10 seconds. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From cef@geodesic.com Tue Feb 18 15:13:36 2003 From: cef@geodesic.com (Charles Fiterman) Date: Tue, 18 Feb 2003 09:13:36 -0600 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <15954.18443.77701.827155@hector.lesours> References: <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> Message-ID: <5.1.1.6.0.20030218090549.02faf830@pop3.geodesic.com> At 03:49 PM 2/18/2003 +0100, Basile STARYNKEVITCH wrote: > >>>>> "Charles" == Charles Fiterman writes: > > Charles> Consider a large online application with the following > Charles> common requirement. 90% of all requests will be filled > Charles> in one second. All requests will be filled in ten > Charles> seconds. > >There are two meanings of large here : > > 1. application with a big memory requirement at runtime > > 2. application with a big amount of code Both. >Since copying a hundred megabytes per second is realistic on today's >machines, I would believe that a full major garbage collection of a >gigabyte heap (which for me is a big heap) should require less than 10 >seconds. The commercial world is approaching 10 gigabyte heaps. This means trouble. Programmers in such environments are starting to manage their own heaps to avoid garbage collection. This only makes storage requirements expand even faster. Languages gain power more from their restrictions than their capabilities. Functional languages gain referential transparency and composition from the loss of side effects. Type safe languages can be used in places where people fear viruses. Giving up circular data structures buys finalizers and large applications. From kanderson@bbn.com Tue Feb 18 16:19:04 2003 From: kanderson@bbn.com (Ken Anderson) Date: Tue, 18 Feb 2003 11:19:04 -0500 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <15953.15577.512872.84744@hector.lesours> References: <1045507612.1760.58.camel@mwhlaptop> <1045507612.1760.58.camel@mwhlaptop> Message-ID: <5.0.2.1.2.20030218111756.01f69278@zima.bbn.com> At 02:49 PM 2/17/2003, Basile STARYNKEVITCH wrote: >On a related note: people usually don't believe that modern ML (eg >Ocaml) or Lisp (eg CMUCL) implementations can perform about as quickly >as C (ie less than 2 times slower than C). They can even be faster http://www.ai.mit.edu/~gregs/ll1-discuss-archive-html/msg01817.html From pechtcha@cs.nyu.edu Tue Feb 18 17:28:51 2003 From: pechtcha@cs.nyu.edu (Igor Pechtchanski) Date: Tue, 18 Feb 2003 12:28:51 -0500 (EST) Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <5.1.1.6.0.20030218090549.02faf830@pop3.geodesic.com> Message-ID: On Tue, 18 Feb 2003, Charles Fiterman wrote: > At 03:49 PM 2/18/2003 +0100, Basile STARYNKEVITCH wrote: > > >>>>> "Charles" == Charles Fiterman writes: > > > > Charles> Consider a large online application with the following > > Charles> common requirement. 90% of all requests will be filled > > Charles> in one second. All requests will be filled in ten > > Charles> seconds. > > > >There are two meanings of large here : > > > > 1. application with a big memory requirement at runtime > > > > 2. application with a big amount of code > > Both. > > >Since copying a hundred megabytes per second is realistic on today's > >machines, I would believe that a full major garbage collection of a > >gigabyte heap (which for me is a big heap) should require less than 10 > >seconds. > > The commercial world is approaching 10 gigabyte heaps. This means trouble. > Programmers in such environments are starting to manage their own heaps to > avoid garbage collection. This only makes storage requirements expand even > faster. I would think that there are three major classes of applications with huge heap sizes: 1) those where the large heap size comes not from the amount of data handled in any one transaction, but rather from the number of concurrent transactions. Since each transaction is (usually) a separate entity, approaches like region-based GC or transaction-specific heaps might work well there. 2) those actually handling massive amounts of data for each transaction (such as search engines). Such applications mostly do not have a lot of simultaneous live data, just a large data stream. Since the performance of, say, copying GC is proportionate to the amount of live data, this shouldn't affect performance. 3) those with a lot of shared data. This data would probably be long-lived, and, once promoted to the oldest generation, rarely collected anyway. And there are techniques, like pretenuring, that allow the relevant data to be promoted sooner. The above are all susceptible to existing non-reference-counting GC techniques, albeit with some tuning. It'd be interesting to know if there is a fourth class of applications that actually maintain large amounts of simultaneous short-lived *live* data throughout the execution. > Languages gain power more from their restrictions than their capabilities. > Functional languages gain referential transparency and composition from > the loss of side effects. Type safe languages can be used in places where > people fear viruses. Giving up circular data structures buys finalizers and > large applications. AFAICS, the only thing non-RC GC doesn't scale to is applications with a large dynamic (i.e., fluid) working set (the fourth class above). I'm not at all sure any real applications fall into that category, although it would be interesting to be proven wrong. Igor -- http://cs.nyu.edu/~pechtcha/ |\ _,,,---,,_ pechtcha@cs.nyu.edu ZZZzz /,`.-'`' -. ;-;;,_ igor@watson.ibm.com |,4- ) )-,_. ,\ ( `'-' Igor Pechtchanski '---''(_/--' `-'\_) fL a.k.a JaguaR-R-R-r-r-r-.-.-. Meow! Oh, boy, virtual memory! Now I'm gonna make myself a really *big* RAMdisk! -- /usr/games/fortune From arlie@sublinear.org Tue Feb 18 17:32:03 2003 From: arlie@sublinear.org (Arlie Davis) Date: Tue, 18 Feb 2003 12:32:03 -0500 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030218133150.GC30555@metux.de> Message-ID: <000201c2d773$a6f685a0$5bd1dc0c@sulaco> Microsoft's CLR (.Net Framework) does a good job on "large" objects. Objects above a certain threshold (20k, I believe) are allocated in a traditional malloc/free heap, but their lifetime is still tracked through GC. Also, since the CLR has access to all type information, it only scans memory locations that are known to be pointers. This is an elegant solution, because it shows that object lifetime (explicit free vs. GC) can be separated from allocation mechanism (contiguous heap w/compaction vs. fragmentable heap). The same approach could easily be adopted by other GC implementations. Also, consider the example (brought up here) of an application which must process a high volume of transactions, with a high degree of consistency of time required per transaction. Just because you use a GC, doesn't mean you *always* allocate fresh objects for every transaction. You can still -- selectively -- use object pooling. Many large applications that are based on explicit-free gain performance by pooling instances of objects. The same can be applied to environments that use GCs. Actually, since a single reference to a small connected graph of objects will retain that entire graph, it's easy to pool entire graphs, by just holding a single reference. The only risk you run is that some piece of code retains a reference to one of the objects you build. Of course, you'll have to design your application around this. But doing so may be much easier and safer than abandoning the use of managed/GC heaps. -- arlie -----Original Message----- From: owner-gclist@lists.iecc.com [mailto:owner-gclist@lists.iecc.com] On Behalf Of Enrico Weigelt Sent: Tuesday, February 18, 2003 8:32 AM To: gclist@iecc.com Subject: Re: [gclist] why malloc/free instead of GC? On Mon, Feb 17, 2003 at 08:37:00PM -0800, Boehm, Hans wrote: > > Note that this gives you an average object size of a little over 50KB, > and that the malloc/free version never touches the allocated memory. > Any garbage collector will lose against malloc/free under those conditions, > due to both the huge average object size, and the fact that collectors > generally like to initialize at least possible pointer fields within objects. GCs can cause problems if you use really huge memory. If it has no type information, it must scan through all memory chunks and look for pointers. With type info (i.e. in oberon) it could become much faster. But there's still another problem: if your application holds many pages, which aren't accessed for quite a long time, they're possibly swapped out, but each time the GC runs over the heap, they have to be swapped in again. hmm, is there any way to avoid scanning over the whole heap each time ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From weigelt@metux.de Tue Feb 18 17:20:56 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 18:20:56 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030218141107.GB14818@port25.com> References: <75A9FEBA25015040A761C1F74975667DA136A5@hplex4.hpl.hp.com> <20030218133150.GC30555@metux.de> <20030218141107.GB14818@port25.com> Message-ID: <20030218172056.GA29805@metux.de> On Tue, Feb 18, 2003 at 03:11:07PM +0100, Juergen Christoffel wrote: > Yes, generational GC for example. > > Back in the eighties and early nineties, when Symbolics introduced > generational GC on their Lisp Machines, large programs did run faster with > GC turned on than without GC when the "Ephemeral GC" (that was their name > for Generational GC, IIRC) was turned on because this reduced working sets. How does this one work ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From hans_boehm@hp.com Tue Feb 18 17:49:12 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 18 Feb 2003 09:49:12 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A7@hplex4.hpl.hp.com> By "roundtrip" I meant "malloc+free" or "malloc+". I should really have said "tracing GC" in the following, but that was the topic of discussion, I think. Fundamentally, a tracing collector needs to do the same amount of tracing work whether a client allocates, say, 10 objects containing 100 bytes each, or a single 1000 byte objects. Large objects cost proportionately more tracing work. This is not true for malloc/free allocation, where the 1000 byte allocation+deallocation often doesn't cost more than a single 100 byte allocation. (A pure reference count collector without cycle detection behaves more or less like malloc/free here.) Of course, this almost never results in an asymptotic difference in the running time of the client program, since initializing the object will cost time proportional to its size anyway. And most programs tend to initialize at least a constant fraction of the objects they allocate. However, the posted "unrealistic and simplistic" test program which started this discussion did not. And in my experience, with normal collector tuning, the initialization time is usually a small constant factor (e.g. 2-10?) less than the object round trip time. Thus it does seem to matter in real life. This becomes even more true for a fully conservative collector for C, which really has to initialize objects itself, in order to avoid preserving stale pointers. In that case the allocation time includes initialization time. (In real life, I doubt this makes a huge difference, since the initialization time tends to be dominated by cache miss time. If the client initializes the object later, as it normally would, it thus avoids the cache miss time. But the time cost has effectively been moved from the client into the allocator.) A conservative GC for C usually worsens matters as follows: - If it needs to accommodate existing libraries or compilers, it will probably have to recognize "interior pointers", at least of they're stored on the stack or in registers. (This could be avoided with some compiler cooperation, which is really needed anyway, but is rarely implemented, since its induced failure rate tends to be less than that of other compiler bugs.) - This means that for any known non-pointer N on the stack, we can't safely allocate a large array A such that N is an address within A. - As the number of such nonpointers and/or the size of A increases, eventually we get to the point at which we can't find room in the right section of the address space to safely allocate A. Empirically, this is generally a non-issue on 64-bit hardware. On 32-bit hardware, with interior pointers recognized everywhere (the default for our collector with C code), and an otherwise favorable application, allocations larger than about 100K seem to be problematic. With interior pointer recognition only on the stack (default for gcj, for example), the threshold seems to be about a MB. As a result of both of these effects, I usually recommend that with our collector and C code, users at least consider explicitly managing very large objects. Fortunately, in most cases I've heard about, this tends to be fairly easy. Often the large objects tend to be things like IO buffers with well-defined lifetimes that are in fact easy to manage explicitly. GC pays off for complex linked structures which tend to be composed of small objects. I think the same advice applies to Java, although it's no doubt politically incorrect there. Keeping explicit pools for large, easily-managed objects will mostly get the GC out of the picture once the pool is sufficiently large. You pay a bit of space for type-safety, in that you can't reuse a given large object for a different type. (If the large objects contain pointers, the GC also still needs to trace them. But large objects seem to often be pointer-free, e.g. bitmaps.) But I would guess that so long as you use this technique only in the few cases where it's really needed, that's not a large cost. Hans > -----Original Message----- > From: David F. Bacon [mailto:dfb@watson.ibm.com] > Sent: Tuesday, February 18, 2003 5:57 AM > To: Boehm, Hans; 'Greg Hudson '; 'Basile STARYNKEVITCH ' > Cc: gclist@iecc.com > Subject: Re: [gclist] why malloc/free instead of GC? > > > hans, > > by "roundtrip" do you mean malloc+free? i don't understand > your statement > about proportionality to object size in GC. also, why do conservative > collectors dislike large objects? is it because a floating > point number > could cause a dead large object to be retained? > > david > ----- Original Message ----- > From: "Boehm, Hans" > To: "'Greg Hudson '" ; "'Basile STARYNKEVITCH '" > > Cc: > Sent: Monday, February 17, 2003 11:24 PM > Subject: Re: [gclist] why malloc/free instead of GC? > > > > I would add: > > > > GC object roundtrip times are pretty much unavoidably > proportional to the > object size, where malloc + free times can be nearly constant. If you > allocate primarily large objects, malloc+free will be cheaper. (For > sufficiently small objects, it usually isn't, at least based on my > measurements. Conservative collectors like large objects even less.) > > > From weigelt@metux.de Tue Feb 18 17:40:18 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 18:40:18 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <5.1.1.6.0.20030218080757.03080618@pop3.geodesic.com> Message-ID: <20030218174018.GC29805@metux.de> On Tue, Feb 18, 2003 at 08:27:55AM -0600, Charles Fiterman wrote: > Reference counting also has the advantage that the destruction of objects > can have rational finalizers. Finalizers must be safe, general, sure, > prompt and ordered. Safe means they don't violate the type system. General > means finalizers can run any code in the language and have that code > produce normal results, for example exceptions can't just get discarded. > Sure means if you build an object it gets to destroy itself. Prompt means > finalizers aren't indefinitely postponed. Ordered means finalizers run in a > determined order, if you ship the application you don't change the order > creating portability bugs. reference counting is tricky if you're using ring structures. if you use an 'clean' refcounting (_each_ time you're referencing, increase the counter, and always decrease on dereference), you'll get dead chunks which will never be freed, when using an ring structure. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From hans_boehm@hp.com Tue Feb 18 18:11:42 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 18 Feb 2003 10:11:42 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A8@hplex4.hpl.hp.com> > -----Original Message----- > From: Arlie Davis [mailto:arlie@sublinear.org] > > Microsoft's CLR (.Net Framework) does a good job on "large" objects. > Objects above a certain threshold (20k, I believe) are allocated in a > traditional malloc/free heap, but their lifetime is still tracked > through GC. Also, since the CLR has access to all type > information, it > only scans memory locations that are known to be pointers. > > This is an elegant solution, because it shows that object lifetime > (explicit free vs. GC) can be separated from allocation mechanism > (contiguous heap w/compaction vs. fragmentable heap). Agreed. But it doesn't solve the fundamental (and unsolvable) problem. If you allocate a 1MB object, that will still need to be considered by the GC triggering heuristic, and thus move you much closer to the next GC. If it didn't, allocating many such objects in a row would cause unacceptable heap growth. > > The same approach could easily be adopted by other GC implementations. Since ours doesn't move objects, there's no real performance distinction between allocating something in the GC heap and the malloc/free heap, and it doesn't matter. The real benefit of such a technique is that you avoid copying/moving large objects, if the collector otherwise does so. I think many copying collectors use a similar technique. Hans From pekka.p.pirinen@globalgraphics.com Tue Feb 18 18:17:42 2003 From: pekka.p.pirinen@globalgraphics.com (Pekka P. Pirinen) Date: Tue, 18 Feb 2003 18:17:42 GMT Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030218172056.GA29805@metux.de> (message from Enrico Weigelt on Tue, 18 Feb 2003 18:20:56 +0100) Message-ID: <200302181817.h1IIHgt01139@anor.cam.harlequin.co.uk> [just you] > On Tue, Feb 18, 2003 at 03:11:07PM +0100, Juergen Christoffel wrote: >> Yes, generational GC for example. [...] > > How does this one work ? Or did you want to know about Symbolics Ephemeral GC specifically? -- Pekka P. Pirinen From dfb@watson.ibm.com Tue Feb 18 18:19:46 2003 From: dfb@watson.ibm.com (David F. Bacon) Date: Tue, 18 Feb 2003 13:19:46 -0500 Subject: [gclist] why malloc/free instead of GC? References: <75A9FEBA25015040A761C1F74975667DA136A7@hplex4.hpl.hp.com> Message-ID: <00e801c2d77a$511dbac0$a2590209@watson.ibm.com> > By "roundtrip" I meant "malloc+free" or "malloc+". ok. > I should really have said "tracing GC" in the following, but that was the topic of discussion, I think. Fundamentally, a tracing collector needs to do the same amount of tracing work whether a client allocates, say, 10 objects containing 100 bytes each, or a single 1000 byte objects. Large objects cost proportionately more tracing work. This is not true for malloc/free allocation, where the 1000 byte allocation+deallocation often doesn't cost more than a single 100 byte allocation. (A pure reference count collector without cycle detection behaves more or less like malloc/free here.) you're assuming the collector has to scan the whole object to find the pointers, right? for a type-accurate collector, the tracing work is proportional to the pointer density of the program times the memory size, which is usually much smaller. > Of course, this almost never results in an asymptotic difference in the running time of the client program, since initializing the object will cost time proportional to its size anyway. And most programs tend to initialize at least a constant fraction of the objects they allocate. However, the posted "unrealistic and simplistic" test program which started this discussion did not. And in my experience, with normal collector tuning, the initialization time is usually a small constant factor (e.g. 2-10?) less than the object round trip time. Thus it does seem to matter in real life. > > This becomes even more true for a fully conservative collector for C, which really has to initialize objects itself, in order to avoid preserving stale pointers. In that case the allocation time includes initialization time. (In real life, I doubt this makes a huge difference, since the initialization time tends to be dominated by cache miss time. If the client initializes the object later, as it normally would, it thus avoids the cache miss time. But the time cost has effectively been moved from the client int o the allocator.) i keep thinking that we should be able to fix this problem, at least for objects larger than a cache line, by using the "cache line clear" operations that now exist in many cpus. has anyone expored this? > As a result of both of these effects, I usually recommend that with our collector and C code, users at least consider explicitly managing very large objects. Fortunately, in most cases I've heard about, this tends to be fairly easy. Often the large objects tend to be things like IO buffers with well-defined lifetimes that are in fact easy to manage explicitly. GC pays off for complex linked structures which tend to be composed of small objects. > > I think the same advice applies to Java, although it's no doubt politically incorrect there. Keeping explicit pools for large, easily-managed objects will mostly get the GC out of the picture once the pool is sufficiently large. You pay a bit of space for type-safety, in that you can't reuse a given large object for a different type. (If the large objects contain pointers, the GC also still needs to trace them. But large objects seem to often be pointer-free, e.g. bitmaps.) But I would guess that so long as you use this technique only in the few cases where it's really needed, that's not a large cost. object pooling is common in java systems. the problem is that it brings back all of the headaches of malloc/free. i don't know about it being "politically incorrect". there is absolutely no reason why large objects should be inefficient under gc in java. but if you create some state information, and the creation operation is expensive, then object pooling is an attractive and useful feature regardless of the physical size of the object. david From hans_boehm@hp.com Tue Feb 18 18:45:13 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 18 Feb 2003 10:45:13 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136A9@hplex4.hpl.hp.com> > From: Charles Fiterman [mailto:cef@geodesic.com] > If you have mark and sweep or moving collection at some point your > application will become so large that collection time causes > you to violate > it no matter how many CPU's you add. You must have a way to > distribute free > operations and not run them all at once. Why? All evidence I've seen suggests that a) Heap sizes grow roughly with memory access speed. The amount of time it takes to touch or trace a "large" heap on a "fast" seems to stay roughly constant over the years, even though the meaning of "large" and "fast" changes. b) Tracing collection scales quite well with processor count. I haven't done the measurement, but I strongly suspect that you can buy a machine that will collect a 10GB heap with 50% pointer-full-object-occupancy in under a second. (It won't be a cheap machine, but ...) > Reference counting also has the advantage that the > destruction of objects > can have rational finalizers. Finalizers must be safe, general, sure, > prompt and ordered. Safe means they don't violate the type > system. General > means finalizers can run any code in the language and have that code > produce normal results, for example exceptions can't just get > discarded. > Sure means if you build an object it gets to destroy itself. > Prompt means > finalizers aren't indefinitely postponed. Ordered means > finalizers run in a > determined order, if you ship the application you don't > change the order > creating portability bugs. > I don't know how to get "prompt"ness and "sure"ness guarantees without running finalizers synchronously in the thread dropping the reference. That's "safe" in your sense, but it's unsafe in that it can result in spurious deadlocks and/or similar problems. Hence it's highly undesirable. For details, see my 2003 POPL paper (http://portal.acm.org/citation.cfm?doid=604131.604153 or http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html ). Hans From weigelt@metux.de Tue Feb 18 18:25:17 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 19:25:17 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: References: <5.1.1.6.0.20030218090549.02faf830@pop3.geodesic.com> Message-ID: <20030218182516.GD29805@metux.de> On Tue, Feb 18, 2003 at 12:28:51PM -0500, Igor Pechtchanski wrote: > 1) those where the large heap size comes not from the amount of data > handled in any one transaction, but rather from the number of concurrent > transactions. Since each transaction is (usually) a separate entity, > approaches like region-based GC or transaction-specific heaps might work > well there. This is tricky if there are references from one region to another. How to cope with this ? > 2) those actually handling massive amounts of data for each transaction > (such as search engines). Such applications mostly do not have a lot of > simultaneous live data, just a large data stream. Since the performance > of, say, copying GC is proportionate to the amount of live data, this > shouldn't affect performance. For applications with limited lifetimes of several entities (i.e. request), there's another quite interesting model. Look at the apache-2, they're using different pools for entities with different lifetimes (i.e. request vs. thread vs. global). This is a little bit like the unix process model, where the whole vm memory gets freed when the process has died. The problem with this model is to take care of not mixing up several pools. -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From pechtcha@cs.nyu.edu Tue Feb 18 18:58:07 2003 From: pechtcha@cs.nyu.edu (Igor Pechtchanski) Date: Tue, 18 Feb 2003 13:58:07 -0500 (EST) Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030218182516.GD29805@metux.de> Message-ID: On Tue, 18 Feb 2003, Enrico Weigelt wrote: > On Tue, Feb 18, 2003 at 12:28:51PM -0500, Igor Pechtchanski wrote: > > > > 1) those where the large heap size comes not from the amount of data > > handled in any one transaction, but rather from the number of concurrent > > transactions. Since each transaction is (usually) a separate entity, > > approaches like region-based GC or transaction-specific heaps might work > > well there. > This is tricky if there are references from one region to another. > How to cope with this ? Same way generational GC copes with it, i.e., through write barriers. However, when I said "separate entities", I meant mostly "independent heaps", in which case there won't be any cross-references. > > 2) those actually handling massive amounts of data for each transaction > > (such as search engines). Such applications mostly do not have a lot of > > simultaneous live data, just a large data stream. Since the performance > > of, say, copying GC is proportionate to the amount of live data, this > > shouldn't affect performance. > For applications with limited lifetimes of several entities (i.e. request), > there's another quite interesting model. Look at the apache-2, they're > using different pools for entities with different lifetimes > (i.e. request vs. thread vs. global). This is a little bit like the > unix process model, where the whole vm memory gets freed when the process > has died. The problem with this model is to take care of not mixing up > several pools. That's largely the idea behind region-based allocation/GC. I believe it was introduced in Trishul Chilimbi's paper, but I'm sure people here will correct me if I'm wrong. Igor -- http://cs.nyu.edu/~pechtcha/ |\ _,,,---,,_ pechtcha@cs.nyu.edu ZZZzz /,`.-'`' -. ;-;;,_ igor@watson.ibm.com |,4- ) )-,_. ,\ ( `'-' Igor Pechtchanski '---''(_/--' `-'\_) fL a.k.a JaguaR-R-R-r-r-r-.-.-. Meow! Oh, boy, virtual memory! Now I'm gonna make myself a really *big* RAMdisk! -- /usr/games/fortune From cef@geodesic.com Tue Feb 18 19:07:09 2003 From: cef@geodesic.com (Charles Fiterman) Date: Tue, 18 Feb 2003 13:07:09 -0600 Subject: [gclist] Replies. Message-ID: <5.1.1.6.0.20030218130550.030f7fc0@pop3.geodesic.com> I prefer not to have replies go to me and gclist. Since I'm obviously on gclist that means I get two copies. I can't imagine anyone who wouldn't have this preference. From hans_boehm@hp.com Tue Feb 18 19:33:14 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 18 Feb 2003 11:33:14 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136AA@hplex4.hpl.hp.com> > From: David F. Bacon [mailto:dfb@watson.ibm.com] > > > I should really have said "tracing GC" in the following, > but that was the > topic of discussion, I think. Fundamentally, a tracing > collector needs to > do the same amount of tracing work whether a client allocates, say, 10 > objects containing 100 bytes each, or a single 1000 byte > objects. Large > objects cost proportionately more tracing work. This is not true for > malloc/free allocation, where the 1000 byte > allocation+deallocation often > doesn't cost more than a single 100 byte allocation. (A pure > reference > count collector without cycle detection behaves more or less like > malloc/free here.) > > you're assuming the collector has to scan the whole object to find the > pointers, right? for a type-accurate collector, the tracing work is > proportional to the pointer density of the program times the > memory size, > which is usually much smaller. Even for the gcj collector, that's pretty much true. But I think it matters only in that it makes the precise argument harder. Assume you have a nongenerational collector, 20 MB of static 20% pointer-density live data in a 40MB heap, and you start repeatedly allocating and immediately dropping 1MB pointer-free objects. You will still have to trace 4MB of pointers every 20 allocations. Thus they will still be far more expensive than allocating cons cells. (It's not hard to doctor this example to deal with a generational collector, though you probably need to allocate some pointers in that case.) > > This becomes even more true for a fully conservative > collector for C, > which really has to initialize objects itself, in order to > avoid preserving > stale pointers. In that case the allocation time includes > initialization > time. (In real life, I doubt this makes a huge difference, since the > initialization time tends to be dominated by cache miss time. > If the client > initializes the object later, as it normally would, it thus > avoids the cache > miss time. But the time cost has effectively been moved from > the client int > o the allocator.) > > i keep thinking that we should be able to fix this problem, > at least for > objects larger than a cache line, by using the "cache line > clear" operations > that now exist in many cpus. has anyone expored this? In the case of our collector, it would clearly help in the not infrequent case of building a free list in an empty page. The initial write in that case is with bzero or memset, which probably already take advantage of any such possibility. In other cases, the limiting factor seems to be the fact that you don't want to introduce model dependencies by hard-coding the line size, etc. I haven't experimented with it much, since I think none of the Intel architectures currently provide something along these lines. > object pooling is common in java systems. the problem is > that it brings > back all of the headaches of malloc/free. The question in my mind is whether you can confine it to one or two large object types. My guess is that usually you can (and should). I agree that widespread object pooling is generally a bad idea. It's external fragmentation cost is usually too high in large systems, in addition to the malloc/free problems. Hans From weigelt@metux.de Tue Feb 18 19:11:18 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 20:11:18 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <000201c2d773$a6f685a0$5bd1dc0c@sulaco> References: <20030218133150.GC30555@metux.de> <000201c2d773$a6f685a0$5bd1dc0c@sulaco> Message-ID: <20030218191118.GA5030@metux.de> On Tue, Feb 18, 2003 at 12:32:03PM -0500, Arlie Davis wrote: > Microsoft's CLR (.Net Framework) does a good job on "large" objects. > Objects above a certain threshold (20k, I believe) are allocated in a > traditional malloc/free heap, but their lifetime is still tracked > through GC. Also, since the CLR has access to all type information, it > only scans memory locations that are known to be pointers. This is just like oberon does w/ all objects. If we've got some type information, we could use three heaps for different kinds of objects: a) strings (and other objects which contain no pointers) b) typed objects (we only have to know where pointers lay around) c) untyped objects (must be treatened as an array of pointers) For these three types we use different methods for catching pointers. a) there are have no pointers. nothing to do b) we have to look on each pointer location defined by the pointer map. c) simply scan the whole chunk as an pointer array (assume aligned ptrs?) On Unix we can map memory almost everywhere we wana have it, so we could split up our address space into several huge ranges, where our pools lay in. So we can decide very fast, which object type an pointer points to. Now it is the task of upper level functions to decide, where to allocate an new object from. By default we use c) if we dont know more about the object (malloc() replacement). > Also, consider the example (brought up here) of an application which > must process a high volume of transactions, with a high degree of > consistency of time required per transaction. Just because you use a > GC, doesn't mean you *always* allocate fresh objects for every > transaction. You can still -- selectively -- use object pooling. Many > large applications that are based on explicit-free gain performance by > pooling instances of objects. Yes, for example you have an very large number of the same objects, you could easily allocate an big array and use an simple allcation map. You should try not to use pointers to the elements of this array, because the gc perhaps does not like it (in oberon this is forbidden by the quite strict type system), but adress by an simple index. (this could also save memory space) hmm... i'm currently thinking about using page fault information for optimizing the GC process (branch cutting?). is there any chance ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From rog@vitanuova.com Tue Feb 18 19:44:31 2003 From: rog@vitanuova.com (rog@vitanuova.com) Date: Tue, 18 Feb 2003 19:44:31 0000 Subject: [gclist] why malloc/free instead of GC? Message-ID: <1136864263c71d58f046b54884bda3fd@vitanuova.com> > a) strings (and other objects which contain no pointers) > b) typed objects (we only have to know where pointers lay around) > c) untyped objects (must be treatened as an array of pointers) pointers between b) and c) are gonna cause problems. From weigelt@metux.de Tue Feb 18 19:33:21 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 20:33:21 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <1136864263c71d58f046b54884bda3fd@vitanuova.com> References: <1136864263c71d58f046b54884bda3fd@vitanuova.com> Message-ID: <20030218193321.GC5030@metux.de> On Tue, Feb 18, 2003 at 07:44:31PM +0000, rog@vitanuova.com wrote: > > a) strings (and other objects which contain no pointers) > > b) typed objects (we only have to know where pointers lay around) > > c) untyped objects (must be treatened as an array of pointers) > > pointers between b) and c) are gonna cause problems. why should they ? i dont wanna build an allocator with strictly separated heaps. the different heaps are for easy detection of the best scan method. you can simply guess on the address, whether the object is typed, untyped or w/o pointers. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From weigelt@metux.de Tue Feb 18 19:28:03 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Tue, 18 Feb 2003 20:28:03 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: References: <20030218182516.GD29805@metux.de> Message-ID: <20030218192803.GB5030@metux.de> On Tue, Feb 18, 2003 at 01:58:07PM -0500, Igor Pechtchanski wrote: > Same way generational GC copes with it, i.e., through write barriers. > However, when I said "separate entities", I meant mostly "independent > heaps", in which case there won't be any cross-references. Ok, so the programmer has to make sure, that these two pools are strictly separate. hmm. This is almost the same like the pooled allocator of the apache2. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From basile@starynkevitch.net Tue Feb 18 22:13:49 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Tue, 18 Feb 2003 23:13:49 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> Message-ID: <15954.45085.866899.37782@hector.lesours> For completeness, I changed my tiny test a bit to allocate smaller objects, to take into account Hans Boehm's remark on typical object size ################################ // file essm.c // compile for malloc: gcc -O essm.c -o essm // compile for Boehm's GC: gcc -O -DUSEGC essm.c -o essm_gc -lgc // compile for Qish GC: /// gcc -O -I../Qish/include -DUSEQISH essm.c -o essm_qish -L../Qish/lib -lqish -ldl #include #include #include #include #include #ifdef USEGC #include #define malloc(S) GC_malloc(S) #ifdef USEGCFREE #define free(P) GC_free(P) #else #define free(P) {} #endif #endif void *tabptr[16]; #ifdef USEQISH #include "qish.h" #define tabptr qish_roots #define HEADEROF(Ad) (*(unsigned*)(Ad)) struct obj_st { unsigned header; void *tab[0]; }; void * essm_qish_gc_copy (void **padr, void *dst, const void *src) { int i = 0; static int cnt; struct obj_st *odst = dst; const struct obj_st *osrc = src;; unsigned header = osrc->header; cnt++; odst->header = osrc->header; for (i = 0; i < header; i++) odst->tab[i] = osrc->tab[i]; *padr = odst; return (void *) (odst->tab + header); } void * essm_qish_minor_scan (void *ptr) { int i = 0; static int cnt; struct obj_st *ob = ptr; unsigned header = ob->header; cnt++; for (i = 0; i < header; i++) QISHGC_MINOR_UPDATE (ob->tab[i]); return (void *) (ob->tab + header); } void * essm_qish_full_scan (void *ptr) { int i = 0; static int cnt; struct obj_st *ob = ptr; unsigned header = ob->header; cnt++; for (i = 1; i <= header; i++) QISHGC_FULL_UPDATE (ob->tab[i]); return (void *) (ob->tab + header); } void essm_qish_fixed_scan (void *ptr, int sz) { qish_panic ("fixed_scan should not be called ptr=%p sz=%d", ptr, sz); } #endif int main (int argc, char **argv) { long long maxcnt = 1000000; #define MAXALLOC 20 long long taballoc[MAXALLOC + 1]; long long cumalloc = 0; long long i = 0; int r = 0, s = 0, n = 0; double usert, syst, tick; struct tms t; struct st *p = 0; if (argc > 1) maxcnt = atol (argv[1]) * 1000; #ifdef USEQISH qishgc_init (); qish_gc_copy_p = essm_qish_gc_copy; qish_minor_scan_p = essm_qish_minor_scan; qish_fixed_scan_p = essm_qish_fixed_scan; qish_full_scan_p = essm_qish_full_scan; #endif //USEQISH memset (&t, 0, sizeof (t)); memset (taballoc, 0, sizeof (taballoc)); if (maxcnt < 100000) maxcnt = 100000; printf ("begin maxcnt=%lld=%e\n", maxcnt, (double) maxcnt); for (i = 0; i < maxcnt; i++) { if ((i & 0x1fffff) == 0) printf ("i=%lld [=%.3g %%]\n", i, 100.0*(double)i/maxcnt); r = lrand48 () & 0xf; #ifndef USEQISH if (tabptr[r]) free (tabptr[r]); #endif n = lrand48 () % 131072 + 4; // approximate s = integer square root(n) s = n / 256 + 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + n / s) / 2; s = (s + 8) & (~3); cumalloc += s; if (s / 16 < MAXALLOC) taballoc[s / 16]++; #ifndef USEQISH tabptr[r] = malloc (s); if (s > 4 * sizeof (void *) && lrand48 () % 8 < 3) ((void **) (tabptr[r]))[1] = tabptr[lrand48 () % 8]; #else n = s / 4 + 1; { volatile struct { struct obj_st *ptr; } _locals_ = {0}; #define l_ptr _locals_.ptr BEGIN_LOCAL_FRAME_WITHOUT_ARGS (); assert(n>0 && n<1000); QISH_ALLOCATE (l_ptr, sizeof (struct obj_st) + n * sizeof (void *)); l_ptr->header = n; if (n > 3 && lrand48 () % 8 < 3) { l_ptr->tab[0] = (void *) (tabptr[lrand48 () % 8]); QISH_WRITE_NOTIFY (l_ptr); } tabptr[r] = l_ptr; dbgprintf("r=%d n=%d l_ptr=%p", r, n, l_ptr); EXIT_FRAME (); } #endif if (!tabptr[r]) fprintf (stderr, "malloc(%d) failed i=%lld\n", s, i); }; times (&t); tick = (double) sysconf (_SC_CLK_TCK); usert = ((double) t.tms_utime) / tick; syst = ((double) t.tms_stime) / tick; printf ("end maxcnt=%lld=%e cumulated alloc=%lld=%.3g bytes, mean %.3g bytes\n", maxcnt, (double) maxcnt, cumalloc, (double) cumalloc, (double) cumalloc / (double) maxcnt); #if 0 // don't work as I want... for (i = 0; i < MAXALLOC; i++) { long long ta = taballoc[i]; printf ("alloc<%d: %lld=%.3g i.e. %.3g %%\n", 16 + i * 16, ta, (double) ta, 100.0 * ((double) ta / ((double) maxcnt))); }; #endif #ifdef USEQISH printf("done %d minor & %d full garbage collections\n", qish_nb_minor_collections, qish_nb_full_collections); #endif printf ("%s cputime user=%g system=%g tick=%g total == per iteration user=%g system=%g\n", argv[0], usert, syst, tick, usert / (double) maxcnt, syst / (double) maxcnt); return 0; } // eof essm.c ################################ with gcc -DUSEGC -O essm.c -o essm_gc -lgc the program ./essm_gc gives (using Boehm's GC) begin maxcnt=1000000=1.000000e+06 i=0 [=0 %] end maxcnt=1000000=1.000000e+06 cumulated alloc=248908533=2.49e+08 bytes, mean 249 bytes ./essm_gc cputime user=2.13 system=0.02 tick=100 total == per iteration user=2.13e-06 system=2e-08 with gcc -O -g essm.c -o essm the program ./essm gives (using malloc/free) begin maxcnt=1000000=1.000000e+06 i=0 [=0 %] end maxcnt=1000000=1.000000e+06 cumulated alloc=247413992=2.47e+08 bytes, mean 247 bytes ./essm cputime user=0.72 system=0 tick=100 total == per iteration user=7.2e-07 system=0 For completeness and shameless plug I also hacked the same (useless) program to use my Qish generational copying GC - see http://freshmeat.net/projects/qish for details on Qish. I even added a pair of BEGIN_LOCAL_FRAME_WITHOUT_ARGS + EXIT_FRAME macros, even if on this particular example they are useless (since the result of allocation goes into a global root). ./essm_qish begin maxcnt=1000000=1.000000e+06 i=0 [=0 %] end maxcnt=1000000=1.000000e+06 cumulated alloc=247407768=2.47e+08 bytes, mean 247 bytes done 29 minor & 0 full garbage collections ./essm_qish cputime user=0.72 system=0.28 tick=100 total == per iteration user=7.2e-07 system=2.8e-07 Now I try to run it bigger, with more allocations (so to trigger several full garbage collections) To have the full GC executed sevveral times, I run it more: $PWD/essm_qish 54321 begin maxcnt=54321000=5.432100e+07 i=0 [=0 %] i=2097152 [=3.86 %] i=4194304 [=7.72 %] i=6291456 [=11.6 %] i=8388608 [=15.4 %] i=10485760 [=19.3 %] i=12582912 [=23.2 %] i=14680064 [=27 %] i=16777216 [=30.9 %] i=18874368 [=34.7 %] i=20971520 [=38.6 %] i=23068672 [=42.5 %] i=25165824 [=46.3 %] i=27262976 [=50.2 %] i=29360128 [=54 %] i=31457280 [=57.9 %] i=33554432 [=61.8 %] i=35651584 [=65.6 %] i=37748736 [=69.5 %] i=39845888 [=73.4 %] i=41943040 [=77.2 %] i=44040192 [=81.1 %] i=46137344 [=84.9 %] i=48234496 [=88.8 %] i=50331648 [=92.7 %] i=52428800 [=96.5 %] end maxcnt=54321000=5.432100e+07 cumulated alloc=13437122740=1.34e+10 bytes, mean 247 bytes done 1645 minor & 6 full garbage collections /home/basile/Misc/essm_qish cputime user=41.45 system=13.37 tick=100 total == per iteration user=7.63057e-07 system=2.46129e-07 With the same allocation count the Boehm's test end with i=52428800 [=96.5 %] end maxcnt=54321000=5.432100e+07 cumulated alloc=13437153848=1.34e+10 bytes, mean 247 bytes /home/basile/Misc/essm_gc cputime user=116.91 system=0.19 tick=100 total == per iteration user=2.15221e-06 system=3.49773e-09 So a malloc/free is about 0.6 microseconds while a GC_malloc is about 2 (or 2.2) microseconds, and a Qish allocation is about 1.1 microseconds (on average) *including garbage collection time* (of course Qish has more overhead in practice, because of the mandatory local roots registration and of the write barrier; and Qish is much less confortable to code with, since it requires a particular coding style). Sorry to Hans Boehm for having provided an unrealistic benchmark. (If any reader of this list happens to have tried Qish I would be delighted to get feedback; Qish is opensource, under LGPL) Regards. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From hans_boehm@hp.com Tue Feb 18 23:52:50 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Tue, 18 Feb 2003 15:52:50 -0800 Subject: [gclist] why malloc/free instead of GC? Message-ID: <75A9FEBA25015040A761C1F74975667DA136B2@hplex4.hpl.hp.com> It looks to me like much of this difference can still be explained by the fact that GC_malloc initializes the resulting objects, and hence takes the cache misses that a real client would otherwise take later. To make the measurements more comparable, you should initialize the objects after you allocate them. (I still wouldn't expect GC_malloc to win. I've normally seen that only for cons-cell sized or slightly larger objects.) Are there also differences in the amount of thread support that's included in the measurements? E.g. the system malloc usually tests a global to determine at runtime whether it needs to lock. Hans > -----Original Message----- > From: Basile STARYNKEVITCH [mailto:basile@starynkevitch.net] > Sent: Tuesday, February 18, 2003 2:14 PM > To: gclist@iecc.com > Subject: Re: [gclist] why malloc/free instead of GC? > > > For completeness, I changed my tiny test a bit to allocate smaller > objects, to take into account Hans Boehm's remark on typical object > size > > ... > From fjh@cs.mu.OZ.AU Wed Feb 19 04:10:46 2003 From: fjh@cs.mu.OZ.AU (Fergus Henderson) Date: Wed, 19 Feb 2003 15:10:46 +1100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <15954.45085.866899.37782@hector.lesours> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> <15954.45085.866899.37782@hector.lesours> Message-ID: <20030219041046.GA27015@ceres.cs.mu.oz.au> On 18-Feb-2003, Basile STARYNKEVITCH wrote: > For completeness and shameless plug I also hacked the same (useless) > program to use my Qish generational copying GC - see > http://freshmeat.net/projects/qish for details on Qish. IIRC, qish depends on GCC's `-fvolatile' and `-fvolatile-globals' options, right? Firstly, because of this, it's not really fair to compare just GC times, since qish will have a significant overhead on code which does not do any allocation at all. So benchmarks which do allocation but have little or no computation (referencing global variables, dereferencing pointers, etc.) will unfairly advantage qish. Secondly, you may be interested to know that these options (or at least `-fvolatile-globals' -- I'm not 100% sure about `-fvolatile') have been removed from the CVS sources for GCC, because they were broken in GCC versions 3.0 and beyond. So this may cause trouble for Qish. -- Fergus Henderson | "I have always known that the pursuit The University of Melbourne | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From basile@starynkevitch.net Wed Feb 19 04:44:30 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Wed, 19 Feb 2003 05:44:30 +0100 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <20030219041046.GA27015@ceres.cs.mu.oz.au> References: <75A9FEBA25015040A761C1F74975667DA136A4@hplex4.hpl.hp.com> <01e301c2d755$8d1b6510$1c02a8c0@watson.ibm.com> <15954.45085.866899.37782@hector.lesours> <20030219041046.GA27015@ceres.cs.mu.oz.au> Message-ID: <15955.2990.301487.95133@hector.lesours> >>>>> "Fergus" == Fergus Henderson writes: Fergus> On 18-Feb-2003, Basile STARYNKEVITCH Fergus> wrote: >> For completeness and shameless plug I also hacked the same >> (useless) program to use my Qish generational copying GC - see >> http://freshmeat.net/projects/qish for details on Qish. Fergus> IIRC, qish depends on GCC's `-fvolatile' and Fergus> `-fvolatile-globals' options, right? Not exactly (see below). The posted code was compiled with (assuming Qish is in ../Qish): gcc -O -I../Qish/include -DUSEQISH essm.c -o essm_qish -L../Qish/lib \ -lqish -ldl For information, gcc -O3 also works and gives user=7.352e-07 system=2.272e-07 seconds per iteration, while the binary compiled with -O gives user=7.525e-07 system=2.384e-07 seconds per iteration and the binary compiled with -O0 [no optimisation at all] gives user=8.243e-07 system=2.249e-07 seconds per iteration. Fergus> Firstly, because of this, it's not really fair to compare Fergus> just GC times, since qish will have a significant overhead Fergus> on code which does not do any allocation at all. I agree with the comment, but Qish does not require actually -fvolatile or -fvolatile-globals [even if I wrote that in the documentation; but I checked since the ISO C99 spec about volatile]. It does require that pointer arguments are declared volatile, and that local pointer variables are (like in the example) in a volatile structure initialized to 0: volatile struct { struct obj_st *ptr; } _locals_ = {0}; Fergus> So Fergus> benchmarks which do allocation but have little or no Fergus> computation (referencing global variables, dereferencing Fergus> pointers, etc.) will unfairly advantage qish. I agree with the remark above. But since -fvolatile is not required, there is no advantage to Qish here, and even a disadvantage to Qish (because it requires some careful coding conventions, and because the mandatory BEGIN_LOCAL_FRAME*/EXIT_FRAME macros cost a few machine instructions each in every call involving pointers). Fergus> Secondly, you may be interested to know that these options Fergus> (or at least `-fvolatile-globals' -- I'm not 100% sure Fergus> about `-fvolatile') have been removed from the CVS sources Fergus> for GCC, because they were broken in GCC versions 3.0 and Fergus> beyond. So this may cause trouble for Qish. I don't need them. I just need a compiler respecting the volatile keyword, and a coder which carefully use them: A. in pointer arguments: foo(struct yourstruct_st* volatile p) B. in local pointers, like above. BTW I actually tested some qish code with TinyCC (see www.tinycc.org). Actually, I wrote that Qish needed -fvolatile before understanding exactly what the volatile keyword means in C99. This was my mistake. Qish don't need -fvolatile, but do need careful use of volatile keyword (see points A,B above) and requires some specific coding style (notably frame entering & exiting macros, and write barrier macros). And yes, Qish does have an overhead, because even functions which only passes GC-ed pointers [to allocating functions] need to follow coding conventions (in particular the BEGIN_LOCAL_FRAME*/EXIT_FRAME macros) even if they don't do allocation themselves. so if f(p) calls g(p,q) which calls h(p,r) which allocate pointers [where p,q,r are GC-ed pointers arguments declared volatile] , all the f, g, and h functions need the BEGIN_LOCAL_FRAME*/EXIT_FRAME macros pairs even if only h allocate pointers. Above all, Hans is right to recall that Qish is not multithreaded and won't run in a multithreaded application. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From basile@starynkevitch.net Wed Feb 19 05:17:51 2003 From: basile@starynkevitch.net (Basile STARYNKEVITCH) Date: Wed, 19 Feb 2003 06:17:51 +0100 Subject: [gclist] glib/gtk w/ GC In-Reply-To: <20030218130421.GA30555@metux.de> References: <20030218130421.GA30555@metux.de> Message-ID: <15955.4991.443947.347765@hector.lesours> >>>>> "Enrico" == Enrico Weigelt writes: Enrico> hi folks, i'm gonna start working on an gc based derivate Enrico> of the glib/gtk. anyone interested in helping ? It is a huge work. The main problem is that the memory mechanism is deeply rooted in GTK2. The object reference counters goes down into glib/gobject, and widgets are finalized. What kind of GC do you want to use? Your own, or Boehm's? Actually, I thought of doing this, and concluded that writing a toolkit which borrows piece of code from GTK2 is easier than porting GTK2 to a GC. There used to be some (opensource, but not very popular) toolkits above Boehm's GC. You need some finalization for widgets, because they use system resources (eg X11 windows). -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet alias: basiletunesorg 8, rue de la Faïencerie, 92340 Bourg La Reine, France From Arun_Singla@infosys.com Wed Feb 19 05:02:44 2003 From: Arun_Singla@infosys.com (Arun Singla) Date: Wed, 19 Feb 2003 10:32:44 +0530 Subject: [gclist] unsubscribe Message-ID: Arun Singla Software Engineer EISAA Infosys Technologies Limited Hootagalli Mysore Phone -91-821-404101 Fax -91-821-404200 http://www.infy.com mailto: arun_singla@infosys.com -----Original Message----- From: Fergus Henderson [mailto:fjh@cs.mu.OZ.AU] Sent: Wednesday, February 19, 2003 9:41 AM To: Basile STARYNKEVITCH Cc: gclist@iecc.com Subject: Re: [gclist] why malloc/free instead of GC? On 18-Feb-2003, Basile STARYNKEVITCH wrote: > For completeness and shameless plug I also hacked the same (useless) > program to use my Qish generational copying GC - see > http://freshmeat.net/projects/qish for details on Qish. IIRC, qish depends on GCC's `-fvolatile' and `-fvolatile-globals' options, right? Firstly, because of this, it's not really fair to compare just GC times, since qish will have a significant overhead on code which does not do any allocation at all. So benchmarks which do allocation but have little or no computation (referencing global variables, dereferencing pointers, etc.) will unfairly advantage qish. Secondly, you may be interested to know that these options (or at least `-fvolatile-globals' -- I'm not 100% sure about `-fvolatile') have been removed from the CVS sources for GCC, because they were broken in GCC versions 3.0 and beyond. So this may cause trouble for Qish. -- Fergus Henderson | "I have always known that the pursuit The University of Melbourne | of excellence is a lethal habit" WWW: | -- the last words of T. S. Garp. From arlie@sublinear.org Wed Feb 19 22:18:53 2003 From: arlie@sublinear.org (Arlie Davis) Date: Wed, 19 Feb 2003 17:18:53 -0500 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <75A9FEBA25015040A761C1F74975667DA136B2@hplex4.hpl.hp.com> Message-ID: <000d01c2d864$e2ef54a0$5bd1dc0c@sulaco> Also, note that most apps that use malloc/free for typical "class" objects (small-to-medium size, with significant pointer density) perform some sort of class initialization. It may be a field-by-field initialization of pointers, or (more often) it is a bulk zero fill (modulo vtable setup). The time to do this, and the cache misses, will not show up in traces of malloc/free cost, but do show up in GC allocations. So, there is yet another reason that direct, API-level comparisons of GC vs. malloc are inaccurate, or at least incomplete. A better (though still incomplete) comparison would be total time spent in, say, C++ new/delete, to GC alloc / GC collect. Also, in environments that mix reference counting with unmanaged heaps, such as COM development on Win32, you must also account for the time spent in AddRef and Release. Most thread-safe implementations use interlocked integer primitives, which are quite costly on SMP machines. I've done a fair amount of profiling of real-world server apps on Win32, and in many implementations, SMP scalability is severely hindered by the very high frequency of interlocked operations. In services that make heavy use of COM interfaces, reference counting is often one of the biggest users of interlocked access. All of this must be taken into account when considering the behavior of real-world, complex applications & services, and how they use memory. -- arlie -----Original Message----- From: owner-gclist@lists.iecc.com [mailto:owner-gclist@lists.iecc.com] On Behalf Of Boehm, Hans Sent: Tuesday, February 18, 2003 6:53 PM To: 'Basile STARYNKEVITCH' Cc: gclist@iecc.com Subject: Re: [gclist] why malloc/free instead of GC? It looks to me like much of this difference can still be explained by the fact that GC_malloc initializes the resulting objects, and hence takes the cache misses that a real client would otherwise take later. To make the measurements more comparable, you should initialize the objects after you allocate them. (I still wouldn't expect GC_malloc to win. I've normally seen that only for cons-cell sized or slightly larger objects.) Are there also differences in the amount of thread support that's included in the measurements? E.g. the system malloc usually tests a global to determine at runtime whether it needs to lock. Hans > -----Original Message----- > From: Basile STARYNKEVITCH [mailto:basile@starynkevitch.net] > Sent: Tuesday, February 18, 2003 2:14 PM > To: gclist@iecc.com > Subject: Re: [gclist] why malloc/free instead of GC? > > > For completeness, I changed my tiny test a bit to allocate smaller > objects, to take into account Hans Boehm's remark on typical object > size > > ... > From jcampbell3@prodigy.net Thu Feb 20 03:09:14 2003 From: jcampbell3@prodigy.net (Larry Evans) Date: Wed, 19 Feb 2003 21:09:14 -0600 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <000d01c2d864$e2ef54a0$5bd1dc0c@sulaco> References: <000d01c2d864$e2ef54a0$5bd1dc0c@sulaco> Message-ID: <3E5446DA.1080903@prodigy.net> Arlie Davis wrote: [snip] > > Also, in environments that mix reference counting with unmanaged heaps, > such as COM development on Win32, you must also account for the time > spent in AddRef and Release. Most thread-safe implementations use > interlocked integer primitives, which are quite costly on SMP machines. > > I've done a fair amount of profiling of real-world server apps on Win32, > and in many implementations, SMP scalability is severely hindered by the > very high frequency of interlocked operations. In services that make > heavy use of COM interfaces, reference counting is often one of the > biggest users of interlocked access. If smart pointers were used, wouldn't weighted reference counting [ Richard E. Jones and Rafael D. Lins. _Cyclic weighted reference counting without delay_ Technical Report 28-92, Computing Laboratory, The University of Kent at Canterbury, December 1992 ] alleviate this at the cost of more memory being used by the smart pointers? > [snip] From arlie@sublinear.org Thu Feb 20 05:12:46 2003 From: arlie@sublinear.org (Arlie Davis) Date: Thu, 20 Feb 2003 00:12:46 -0500 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <3E5446DA.1080903@prodigy.net> Message-ID: <001f01c2d89e$b4b490c0$5bd1dc0c@sulaco> I believe the short answer is "no". The paper you refer to deals with discovering cyclic reference loops and dealing with them, especially in distributed environments. What I'm referring to is reference counting under Microsoft's COM (using the IUnknown interface), and the nearly-mandatory implementation of using interlocked integer access. The environment I'm referring to is a common one -- a single process hosting multiple threads, executing on multiple processors, in which all threads may discover, use, and release reference-counted interfaces. Also note that implementations of the "weighted reference counting" described in the paper would suffer the same performance problem, if you allow for multiple threads to alter the same weighted reference. The threads will necessarily need to synchronize access to the weighted reference field. On most current SMP x86 systems, this can only be accomplished using some form of interlocked access, or techniques that boil down to the same. Basically, they are totally different problems. -- arlie -----Original Message----- From: owner-gclist@lists.iecc.com [mailto:owner-gclist@lists.iecc.com] On Behalf Of Larry Evans Sent: Wednesday, February 19, 2003 10:09 PM To: gclist@iecc.com Subject: Re: [gclist] why malloc/free instead of GC? Arlie Davis wrote: [snip] > > Also, in environments that mix reference counting with unmanaged > heaps, such as COM development on Win32, you must also account for the > time spent in AddRef and Release. Most thread-safe implementations > use interlocked integer primitives, which are quite costly on SMP > machines. > > I've done a fair amount of profiling of real-world server apps on > Win32, and in many implementations, SMP scalability is severely > hindered by the very high frequency of interlocked operations. In > services that make heavy use of COM interfaces, reference counting is > often one of the biggest users of interlocked access. If smart pointers were used, wouldn't weighted reference counting [ Richard E. Jones and Rafael D. Lins. _Cyclic weighted reference counting without delay_ Technical Report 28-92, Computing Laboratory, The University of Kent at Canterbury, December 1992 ] alleviate this at the cost of more memory being used by the smart pointers? > [snip] From weigelt@metux.de Thu Feb 20 14:54:56 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Thu, 20 Feb 2003 15:54:56 +0100 Subject: [gclist] glib/gtk w/ GC In-Reply-To: <029d01c2d8e3$7c705e70$34128aca@z> References: <20030218130421.GA30555@metux.de> <003d01c2d757$03d56b00$d46a86cb@z> <20030218171500.GA28255@metux.de> <000901c2d805$a3f677b0$34128aca@z> <20030219211516.GA25569@metux.de> <029d01c2d8e3$7c705e70$34128aca@z> Message-ID: <20030220145455.GC1764@metux.de> On Thu, Feb 20, 2003 at 11:25:06PM +1000, Steven Shaw wrote: > You wish to convince people to use gc? > > What I'm trying to say is that some people who use glib who wouldn't want gc > (perhaps because they couldn't live with the downside). No, i simply want an lib like glib, but gc-based. I'll then start porting some applications to this one. > > > You might find some resistance to what you propose because of that. > > > I guess you are proposing a fork anyways? > > Well, i dont care of them. An forkoff will me necessary, because > > this new lib _will_ break the existing interfaces. > > Sure. I guess everyone using original-glib can continue to use that if they > want. Others can adopt the new gc-glib you propose. Yes, but that's not the whole point. IMHO it is very important, that an library which is meant to be production stable _must_ provide at least the same interface (or an derived one) of it's earlier version, so it can _always_ be used as an drop-in replacement for the older versions. > > btw: at this point we also should start defining _strict_ interfaces, > > which must bei 100% reliable: if an version a supports some interface, > > the following versions _must_ continue providing them. > > Tell me more about _strict_ interfaces. Why are you so concerned over it? > Are you proposing something like MS-COM? No, i'm speaking of library/module interfaces at several points of view. Let's take some examples: * glib-1.2-binary-i386: + derived from glib-1.1-binary-i386 + runs on systems which provide i386 processor enviroment + links clients against glib.so.1.2-i386 + exported functions (w/ function signatures, ...) * glib-1.2-binary-i686: + derived from glib-1.2-binary-i686 + runs on systems which provide i686 processor enviroment + links clients against glib.so.1.2-i686 + exported functions ... * glib-1.2-C-include: + derived from glib-1.1-C-include + provides functions, types, variables, defines + provides rules for interface translation on compile time + specifies pathes, etc. So if we are doing an translation (loading an binary into an VM dyn. linking is also an translation process just as compiling sources to binaries) Now we're compiling an application againgst glib, we import the interface glib-1.2-C-include. The translator now knows evrything about the glib's C-binding necessary to build an glib-based application. As an product of this translation we have an package which needs glib as an dynamic library in some special binary format (i.e.i686), do it requires the appropriate interface (glib-1.2-binary-i686) > I wish there was a programming system where it was easy to have constant > (inevitable) evolution of the interfaces; where old libraries can be used > side-by-side with new ones. Yes, i want to enforce this. It's an kind of design-by-contract. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From weigelt@metux.de Thu Feb 20 15:04:36 2003 From: weigelt@metux.de (Enrico Weigelt) Date: Thu, 20 Feb 2003 16:04:36 +0100 Subject: [gclist] glib/gtk w/ GC In-Reply-To: <15955.4991.443947.347765@hector.lesours> References: <20030218130421.GA30555@metux.de> <15955.4991.443947.347765@hector.lesours> Message-ID: <20030220150436.GE1764@metux.de> On Wed, Feb 19, 2003 at 06:17:51AM +0100, Basile STARYNKEVITCH wrote: > What kind of GC do you want to use? Your own, or Boehm's? I'd start with boehm's, but then try to do some optimizations, i.e several pools for different object classes (strings, etc) > There used to be some (opensource, but not very popular) toolkits > above Boehm's GC. examples ? > You need some finalization for widgets, because they use system > resources (eg X11 windows). yes, but finalization should not be such an problem. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux ITS Webhosting ab 5 EUR/Monat. UUCP, rawIP und vieles mehr. phone: +49 36207 519931 www: http://www.metux.de/ fax: +49 36207 519932 email: contact@metux.de cellphone: +49 174 7066481 smsgate: sms.weigelt@metux.de --------------------------------------------------------------------- Diese Mail wurde mit UUCP versandt. http://www.metux.de/uucp/ From mhamburg@adobe.com Thu Feb 20 18:40:21 2003 From: mhamburg@adobe.com (Mark Hamburg) Date: Thu, 20 Feb 2003 10:40:21 -0800 Subject: [gclist] Daily gclist MIME digest V4 #28 In-Reply-To: <200302201020.h1KAKbkn014113@smtp-relay-1.adobe.com> Message-ID: on 2/20/03 2:20 AM, gclist-owner@lists.iecc.com at gclist-owner@lists.iecc.com wrote: > Also, in environments that mix reference counting with unmanaged heaps, > such as COM development on Win32, you must also account for the time > spent in AddRef and Release. Most thread-safe implementations use > interlocked integer primitives, which are quite costly on SMP machines. > > I've done a fair amount of profiling of real-world server apps on Win32, > and in many implementations, SMP scalability is severely hindered by the > very high frequency of interlocked operations. In services that make > heavy use of COM interfaces, reference counting is often one of the > biggest users of interlocked access. I can believe that naïve reference counting is expensive. I've worked on projects that go through a fair number of moderately unsafe contortions to avoid needless increments and decrements. The best scheme I've seen so far for dealing with this -- particularly in a non-GC friendly environment -- has probably been what Apple (NeXT) did in Cocoa (NeXTStep) with the autorelease pools. These allow most references on the stack to be passed around with no need to increment or decrement pointers. I suspect that most naïve implementations of COMPtr or RCPtr templates in contrast increment and decrement the count on each construction, destruction, and assignment. If you pass through a lot of subroutines, that gets very expensive very quickly. Mark From jmunsin@iki.fi Thu Feb 20 19:15:31 2003 From: jmunsin@iki.fi (Jonas Munsin) Date: Thu, 20 Feb 2003 21:15:31 +0200 Subject: [gclist] why malloc/free instead of GC? In-Reply-To: <15953.15577.512872.84744@hector.lesours> References: <1045507612.1760.58.camel@mwhlaptop> <15953.15577.512872.84744@hector.lesours> Message-ID: <20030220191531.GA17288@nemo.sby.abo.fi> On Mon, Feb 17, 2003 at 08:49:45PM +0100, Basile STARYNKEVITCH wrote: > Actually, I'm surprised that today's major opensource projects (like > Apache, GNOME, KDE...) don't use GC [with the exception of Emacs, > which used to explicitly show GC periods to user - this was a wrong > decision, because it made users complain against GC]. There are a few C opensource projects which use gc, w3m is one that comes to mind. From mwh@cs.umd.edu Wed Feb 26 19:50:28 2003 From: mwh@cs.umd.edu (Michael Hicks) Date: Wed, 26 Feb 2003 14:50:28 -0500 Subject: [gclist] controlling heapsize in BDW collector Message-ID: <1046289029.1519.64.camel@mwhlaptop> Hi all. I wonder if anyone can provide some input on how to correctly set the heapsize for the BDW collector. I'm trying to do some performance comparisons between GC and non-GC'ed apps, and in particular I want to examine the tradeoff between memory footprint and latency in a GC'ed setting. The idea is that the more memory you're willing to allow, the less latency impact there will be with GC, since you'll collect less often. And the converse is also true. So, I have an application that has about a 128K footprint when using GC_malloc and GC_free, and about a 348K footprint when removing the GC_free's so that the collector is used. What I'd like to do is force the heapsize to be somewhere between 128K and 348K (as close to 128K as possible) while still using the collector, so that garbage collections occur more often. Then I can assess the latency impact. However, when I do this by calling GC_set_max_heap_size(max_heap_size), GC_malloc returns NULL in basically every case unless I set max_heap_size to be roughly 348K. I also set the GC_use_entire_heap flag to be true, with the same result. Why would this be happening? When using GC_free, the heap usage never rises above 100K, so it's not that I'm allocating a lot of batched objects and then freeing them all at once. By the same token, I'd be really surprised if this was some kind of fragmentation overhead (2/3 of the heap is fragmentation!!!???). The objects being allocated are relatively large, ranging from 2K to 15K. Finally, spurious retention also seems unlikely: to be safe I NULL all of the objects that are allocated (these are packets being forwarded by a proxy), and the results are the same. If this is not some kind of limitation with the collector, can anyone suggest how I would go about debugging this behavior? Turning off -DSILENT has not been too helpful. Has anyone had success setting the maximum heapsize to something below what the collector would naturally come to? Thanks in advance, Mike From hans_boehm@hp.com Wed Feb 26 20:57:44 2003 From: hans_boehm@hp.com (Boehm, Hans) Date: Wed, 26 Feb 2003 12:57:44 -0800 Subject: [gclist] controlling heapsize in BDW collector Message-ID: <75A9FEBA25015040A761C1F74975667DA136E1@hplex4.hpl.hp.com> [I recently set up the gc@linux.hpl.hp.com mailing list for discussions specific to this collector. I'm not sure that this question is completely collector specific, but if it were, that would be an alternative place to ask.] Do you clear pointers to objects at the same point at which you would have explicitly deallocated them? Otherwise, I would expect that the maximum amount of reachable memory is larger than the maximum amount of malloc/free allocated memory. A factor of 3 seems unlikely, but not impossible. You are really operating the collector at a point it wasn't designed for. In particular, it sounds like you only have on the order of 10 live objects around. The collector will perform suboptimally here for a variety of reasons: 1) Garbage collectors are inherently not terribly efficient with an average object size of 10K or so. See the previous discussion on this list. 2) A conservative collector, or one with otherwise incomplete liveness information, will typically follow some small number of pointers on the stack that were used as compiler temporaries, but are really dead. I would normally expect this number to be on the order of at most a dozen, and it usually doesn't matter. But with only a dozen live objects ... 3) The collector needs to scan some of amount of static data, e.g. owned by libc, during each collection cycle. Even a 300K heap is too small to amortize that cost. (It will try to grow the heap to compensate, though GC_set_max_heap_size or the GC_MAXIMUM_HEAP_SIZE environment variable should inhibit that.) 4) The collector's data structures aren't tuned for heaps this small. The heap expansion increment and some temporary space areas are too large by default. If you want to debug this, try placing building a debuggable collector and placing a breakpoint in GC_expand_heap_inner(). Looking at the stack at the last heap expansion generally gives you a good idea why it decided it needed to grow the heap. Calling GC_dump() at that point should tell you something about what the heap looks like. (And with a 340K heap, the size of the dump will be manageable.) Hans > -----Original Message----- > From: Michael Hicks [mailto:mwh@cs.umd.edu] > Sent: Wednesday, February 26, 2003 11:50 AM > To: gclist@iecc.com > Subject: [gclist] controlling heapsize in BDW collector > > > Hi all. > > I wonder if anyone can provide some input on how to correctly set the > heapsize for the BDW collector. I'm trying to do some performance > comparisons between GC and non-GC'ed apps, and in particular I want to > examine the tradeoff between memory footprint and latency in a GC'ed > setting. The idea is that the more memory you're willing to > allow, the > less latency impact there will be with GC, since you'll collect less > often. And the converse is also true. > > So, I have an application that has about a 128K footprint when using > GC_malloc and GC_free, and about a 348K footprint when removing the > GC_free's so that the collector is used. What I'd like to do is force > the heapsize to be somewhere between 128K and 348K (as close > to 128K as > possible) while still using the collector, so that garbage collections > occur more often. Then I can assess the latency impact. > However, when > I do this by calling GC_set_max_heap_size(max_heap_size), GC_malloc > returns NULL in basically every case unless I set max_heap_size to be > roughly 348K. I also set the GC_use_entire_heap flag to be true, with > the same result. > > Why would this be happening? When using GC_free, the heap usage never > rises above 100K, so it's not that I'm allocating a lot of batched > objects and then freeing them all at once. By the same token, I'd be > really surprised if this was some kind of fragmentation > overhead (2/3 of > the heap is fragmentation!!!???). The objects being allocated are > relatively large, ranging from 2K to 15K. Finally, spurious retention > also seems unlikely: to be safe I NULL all of the objects that are > allocated (these are packets being forwarded by a proxy), and the > results are the same. > > If this is not some kind of limitation with the collector, can anyone > suggest how I would go about debugging this behavior? Turning off > -DSILENT has not been too helpful. Has anyone had success setting the > maximum heapsize to something below what the collector would naturally > come to? > > Thanks in advance, > Mike > From tkb@tkb.mpl.com Wed Feb 26 21:09:27 2003 From: tkb@tkb.mpl.com (tkb@tkb.mpl.com) Date: Wed, 26 Feb 2003 16:09:27 -0500 Subject: [gclist] controlling heapsize in BDW collector In-Reply-To: <75A9FEBA25015040A761C1F74975667DA136E1@hplex4.hpl.hp.com> References: <75A9FEBA25015040A761C1F74975667DA136E1@hplex4.hpl.hp.com> Message-ID: <15965.11527.89770.911730@erekose.mpl.com> Boehm, Hans writes: > [I recently set up the gc@linux.hpl.hp.com mailing list for > discussions specific to this collector. I'm not sure that this > question is completely collector specific, but if it were, that > would be an alternative place to ask.] I'll repeat the information from Hans Boehm's web site http://www.hpl.hp.com/personal/Hans_Boehm/gc/ about subscribing to that mailing list for quick reference. We have recently set up two mailing list for collector announcements and discussions: * gc-announce@linux.hpl.hp.com is used for announcements of new versions. Postings are restricted. We expect this to always remain a very low volume list. * gc@linux.hpl.hp.com is used for discussions, bug reports, and the like. Subscribers may post. To subscribe to these lists, send a mail message containing the word "subscribe" to gc-announce-request@linux.hpl.hp.com or to gc-request@linux.hpl.hp.com. (Please ignore the instructions about web-based subscription. The listed web site is behind the HP firewall.) -- T. Kurt Bond, tkb@tkb.mpl.com