[gclist] My copying collector or Boehm's?

Ji-Yong D. Chung virtualcyber@erols.com
Fri, 16 Mar 2001 18:20:59 -0500


Hi,

    There is always a good chance, given my relative inexperience,
that I really could be screwing up with my use of your collector.  At least
it is not crashing :)

> [I wrote]
> >     With Boehm's collector, my interpreter runs about 2.5
> > times slower than before.  This is not a knock on Boehm's
> > collector.

> [you replied] I still find that surprising, since it suggests that your
interpreter is
> spending >60% of its time allocating/collecting with our collector,
> something I don't normally see.  A profile and GC log would be interesting
> to me.  In particular, it would be nice to know:
> a) How much time is spent in the marker?
> b) How much time is spent locking around allocation?  (The win32 threads
> port uses EnterCriticalSection() and LeaveCriticalSection().  On many
other
> platforms a custom locking scheme is used instead, since the standard one
> exhibited serious performance problems, some form of convoying being the
> most common.

    Only 1 thread was really active in my test runs, so I'd guess that
 convoy did not have a chance to form.

    In any case, I did two profile runs on VC++6.0.  One for my Scheme
interpreter's
main evaluation loop, and another one for GC_malloc.  (While it would have
been
preferable to profile them in the same run, I had slight difficulty with
getting my profiler
to run correctly).

Both profiles were taken, with my interpreter running the same scheme source
code.


=====================================================================
||     This profile run is on the Evaluation loop of my Scheme interpreter.
|| This is dificult to read, but there as supposed to be 6 columns.
|| They are:
|| (1) function time in milliseconds,
|| (2) function time as percent of overall measured time
|| (3) function+child time,
|| (4) function+child time as percent of overall measured time
|| (5) hit count
|| (6) name of the function.


Profile: Function timing, sorted by time
Date:    Fri Mar 16 17:37:57 2001

Program Statistics
------------------
    Command line at 2001 Mar 16 17:34: tpscript.exe
    Start function: _GC_malloc
    Total time: 18652.854 millisecond
    Time outside of functions: 7747.360 millisecond
    Call depth: 13
    Total functions: 3854
    Total hits: 5216935
    Function coverage: 2.9%
    Overhead Calculated 6
    Overhead Average 6

Module Statistics for tpscript.exe
----------------------------------
    Time in module: 6488.312 millisecond
    Percent of time in module: 59.5%
    Functions in module: 3589
    Hits in module: 2337349
    Module function coverage: 0.1%

        Func          Func+Child                           Hit
        Time   %                 Time                      %      Count
Function
---------------------------------------------------------
    6475.395    59.4    10726.211         98.4          2336757
_GC_malloc (gc.dll)
      11.505       0.1       28.241                 0.3          148
_GC_default_push_other_roots (gc.dll)
       0.838       0.0        1.082                   0.0              148
_GC_push_all_stack (gc.dll)
       0.573       0.0        1.086                   0.0          148
_GC_push_finalizer_structures (gc.dll)
       0.000       0.0        0.000                   0.0              148
Script::SchemeVM::PushRoots(void)

Note* -- As you indicated in your email, my program spends most
the time in _GC_malloc.  I think this included the time spent on
garbage collection as well.

============================================================
|| Here is the measurement on your GC.dll

Module Statistics for gc.dll
----------------------------
    Time in module: 4417.182 millisecond
    Percent of time in module: 40.5%
    Functions in module: 265
    Hits in module: 2879586
    Module function coverage: 39.6%

        Func          Func+Child                 Hit
        Time      %         Time      %      Count  Function
---------------------------------------------------------
    2097.812  19.2     2112.566  19.4    10188 _GC_mark_from (mark.obj)
     559.504   5.1      559.504   5.1    13777 _GC_reclaim_clear
(reclaim.obj)
     125.837   1.2      171.310   1.6  2380112 _GC_malloc (malloc.obj)
     111.131   1.0      133.912   1.2     2093 _GC_build_fl (new_hblk.obj)
     105.330   1.0      384.709   3.5      296 _GC_apply_to_all_blocks
(headers.obj)
      99.517   0.9     4245.413  38.9    18596 _GC_generic_malloc
(malloc.obj)
      75.057   0.7      863.804   7.9    18744 _GC_continue_reclaim
(reclaim.obj)
      74.471   0.7       74.471   0.7     1794 _GC_reclaim_clear2
(reclaim.obj)
      66.135   0.6       66.135   0.6    25577 _GC_clear_hdr_marks
(mark.obj)
      65.022   0.6     4058.541  37.2    18596 _GC_allocobj (alloc.obj)
      56.553   0.5      188.229   1.7    24667 _GC_reclaim_block
(reclaim.obj)
      53.679   0.5      750.716   6.9    16503 _GC_reclaim_generic
(reclaim.obj)
      49.125   0.5     4107.725  37.7    18596 _GC_generic_malloc_inner
(malloc.obj)
      48.739   0.4       73.931   0.7    22284 _GC_block_nearly_full
(reclaim.obj)
      45.451   0.4      300.788   2.8      148 _GC_finish_collection
(alloc.obj)
      44.964   0.4      111.610   1.0    14328 _GC_allochblk_nth
(allchblk.obj)
      42.626   0.4     2286.872  21.0    11816 _GC_mark_some (mark.obj)
      38.236   0.4       38.236   0.4    24815 _GC_next_used_block
(headers.obj)
      38.181   0.4       38.181   0.4    18596 _GC_write_hint (os_dep.obj)
      38.171   0.4       38.171   0.4    18596 _GC_invoke_finalizers
(finalize.obj)
      38.030   0.3      788.746   7.2    16503
_GC_reclaim_small_nonempty_block (reclaim.obj)
      36.689   0.3       36.689   0.3      148 _GC_read_changed
(stubborn.obj)
      35.890   0.3       91.150   0.8    24667 _clear_marks_for_block
(mark.obj)
      32.599   0.3     2382.769  21.8      148 _GC_stopped_mark (alloc.obj)
      31.552   0.3       50.877   0.5    18596 _GC_clear_stack (misc.obj)
      26.944   0.2       26.944   0.2      780 _GC_reclaim_clear4
(reclaim.obj)
      26.059   0.2       81.437   0.7     1332
_GC_push_next_marked_uncollectable (mark.obj)
      25.969   0.2       25.969   0.2    25406 _GC_block_empty (reclaim.obj)
      21.913   0.2       21.913   0.2      445 _GC_build_fl_clear4
(new_hblk.obj)
      21.005   0.2      135.505   1.2     2241 _GC_allochblk (allchblk.obj)
      19.420   0.2       19.420   0.2    18744 _GC_approx_sp (mark_rts.obj)
      17.333   0.2       17.333   0.2     9792 _GC_add_to_black_list_stack
(blacklst.obj)
      16.263   0.1       17.142   0.2     1184 _GC_push_marked (mark.obj)
      15.211   0.1      252.143   2.3      148 _GC_start_reclaim
(reclaim.obj)
      14.620   0.1       14.620   0.1      296 _GC_get_lo_stack_addr
(win32_threads.obj)
      13.995   0.1       13.995   0.1      148 _GC_clear_bl (blacklst.obj)
      12.799   0.1       12.799   0.1    11073 _GC_block_nearly_full3
(reclaim.obj)
      12.394   0.1       12.394   0.1    10472 _GC_block_nearly_full1
(reclaim.obj)
      12.031   0.1       15.668   0.1     2118 _GC_add_to_fl (allchblk.obj)
      11.344   0.1       11.344   0.1      148 _GC_stop_world
(win32_threads.obj)
      10.281   0.1       16.618   0.2     2093 _GC_get_first_part
(allchblk.obj)
       8.093   0.1       32.655   0.3     1939 _GC_freehblk (allchblk.obj)
       7.453   0.1        7.453   0.1    11816 _GC_default_oom_fn (misc.obj)
       7.453   0.1        7.453   0.1    11816 _GC_never_stop_func
(misc.obj)
       7.368   0.1      276.784   2.5     2241 _GC_new_hblk (new_hblk.obj)
       7.141   0.1        7.141   0.1      148 _GC_start_world
(win32_threads.obj)
       6.898   0.1       22.951   0.2     2094 _setup_header (allchblk.obj)
       6.653   0.1        6.653   0.1     2102 _GC_is_black_listed
(blacklst.obj)
       6.282   0.1       11.214   0.1     2093 _GC_install_counts
(headers.obj)
       6.064   0.1        6.981   0.1     1939 _GC_free_block_ending_at
(allchblk.obj)


============================================================

> c) What fraction of the time is spent context switching (see (b)).

    I have just two threads, one of which is taking a nice nap while
the other is running.  (So there is not much overhead on that).

> d) Does the amount of live data in the GC log look right?

    I have not yet looked into this -- I have to dig into your package to
find how to use GC_log.

> > The drop off in performance was expected,
> > because (1) my original collector is turned on and off
> > at precise points in my C++ code to minimize
> > collection
>
> That's potentially a big win, clearly.  You can
> also do that with our collector, though it may be much
> less effective with a global heap than per-thread heaps.

    Do you mean, provided that I can make your collector to use per-thread
heap, I can disallow GC_malloc from checking whether
it should garbage collect?

> > (2) my collector uses type information all the time,
> My experience has been that for small objects this matters in the expected
> case only to the extent that it reduces the overhead of checking a real
> pointer, i.e. only if you can actually reduce checking on a "pointer"
field
> to a comparison against null.  [snipped, as the rest is a bit over my
> head]

    And indeed, there is no need for pointer checking, as it does
use the type information.

> > (4) heap residency
> > was low for the test cases. -- which favors copying
> > collector over mark-sweep.
> Somewhat.  But it also helps our collector a lot, provided you similarly
> increase the heap size.  (This should work better in 6.0 than in the 5.x
> releases.)  If it doesn't help, I would definitely suspect the locking
code.

    I have increased the starting heap size for your collector to about
1 Meg, but I have not yet seen substantial speedup yet.  I will try
2 Megs.