[gclist] Representation of pointer set

Charles Fiterman cef@geodesic.com
Thu, 06 Jul 2000 09:46:57 -0500

At 10:23 AM 7/6/00 -0400, you wrote:
>Thank you for your insightful post.
>Can anyone direct me to a serious study that shows that "most" 
>programs won't require a large pointer set? Ideally one that spans a 
>broad range of languages and types of programs.

What he is saying is that if you keep a mark stack it won't grow very deep.
Since most collectors keep a mark stack and don't crash there should be
easy statistics on that.

>Actually, while I'm at it, the same goes for other things - that most 
>reference counts are one, that most objects are small, that cycles 
>are rare, and so on. I've only seen language specific treatment of 
>these issues, and usually only on a handful of programs.

How common cycles are depends on the data structures represented. Double
linked lists are common except in pure functional languages where you can't
write them. 

>Ideally it would be nice to have a statistical clarification of what 
>"most" means - it would be nice to be able to speak about the 
>pathological cases as being so-many sigmas (standard deviations) away 
>from the mean and stuff like that. This way would could aim for six 
>sigma garbage collection and stuff like that.
>Andrew Chen
>> During the mark phase, a GC will typically keep a set of pointers to
>> objects that have been marked but their children have not yet been
>> marked.  In the worst case, this set can grow very large, but that seems
>> unlikely.
>> Representing the set as a linked list seems inefficient.  An array seems
>> like the right choice.
>> The literature is full of techniques for dealing with overflow of this
>> set.  They seem annoyingly complicated to me.
>> My question is: Why is this necessary, on a machine with a large virtual
>> address space?  Why not just allocate an enormous array in *virtual*
>> memory (1/4 of the address space?)?  Then the array can't overflow.  In
>> most cases, the array will use a reasonably small amount of *physical*
>> memory, so it will be efficient.  If somebody creates an extremely long
>> linked list, then the array might actually eat up large amounts of
>> physical memory, and therefore cause slowness due to paging.  But it
>> seems tolerable to have slowness in unlikely cases.  Pretty much any
>> modern GC has *that* property -- show me an incremental or generational
>> GC, and I can construct a pathological case that will kill its
>> performance.
>> This all assumes a reasonable operating system, which commits physical
>> memory pages only when needed.
>> - Bob