[gclist] reference counting

Boehm, Hans hans_boehm@hp.com
Thu, 14 Sep 2000 16:01:26 -0700

> -----Original Message-----
> From: Manoj Plakal [mailto:plakal@cs.wisc.edu]
> Sent: Thursday, September 14, 2000 2:08 PM
> To: Boehm, Hans
> Cc: 'Jeff Sturm'; gclist@iecc.com
> Subject: Re: RE: [gclist] reference counting
> Boehm, Hans wrote (Thu, Sep 14, 2000 at 10:00:00AM -0700) :
> > If I understand it correctly, this is different from a 
> cache that keeps
> > multiple valid bits per line.  But it should help with our 
> collector.  At
> > least in some simple benchmarks a large fraction of the 
> cache misses on
> > write occur when an entire page is being written.  Thus it 
> should be safe to
> > ignore the previous contents of the page, and hence cache line.
> 	Correct me if I'm wrong, but ...
> 	I think there is some confusion here. Having multiple
> 	valid bits per line is what is called sub-blocking,
> 	if I remember my Hennessy & Patterson correctly. This
> 	reduces the size of data exchanged between the cache
> 	and the memory system, thus reducing conflict misses,
> 	false sharing and bandwidth. This is completely
> 	transparent to software.
> 	What you seem to need is an "overwrite-cache-line"
> 	kind of store instruction which tells the cache that it
> 	does not need to fetch the line in case of a miss
> 	since the data is going to be over-written. Seems kind
> 	of tricky for the software since it has to guarantee
> 	that it will over-write a whole cache line's worth
> 	of data, and make sure that there are no parts of
> 	the cache line that are "left over" and which need
> 	to be fetched (I guess this is where sub-blocking
> 	would be useful).

I agree there's some confusion, and the terminology is also what I remember.
But I think the two ideas are closely related here.  I'll try to summarize
the issues here, though I'd appreciate correction from some of the original
paper authors.

A number of the older GC measurements, including the one Henry Baker
originally referred to, were made on machines with sub-blocking, notably
certain models of MIPS-based DECstations(?).  The net effect of this was
that if you used a bump-the-pointer allocator, and generally wrote things in
subblock increments, you could write, and subsequently read back, previously
uncached memory with only memory traffic needed to write the evicted cache
lines, and essentially no stalls for memory.  In particular, there was never
a need to read the rest of the cache line that you were writing, because it
could simply be marked invalid by the hardware and, in almost all cases,
would quickly be overwritten anyway.

I agree that this is transparent to software unless you look at the delays

Unfortunately, very little (no?) modern hardware supports subblocking with
the right granularity.  Empirically, it is very hard to get most modern
hardware to avoid reading a previously uncached (i.e. cachable, but not
currently in cache) line that's being written.  This often seems to be true
even if you know you are writing large sequential regions of memory, at
least if you want the resulting memory to end up in cache.  I conjectured
that the reason for this is that subblocking is hard to implement on
multiprocessors, but I really don't know whether that's true.

I think it's interesting to ask whether one can get the same effect (no
cache fills if you're only writing) with an instruction like the one the
Alpha apparently has, which I gather declares your intention to overwrite
the entire line.

> 	What might also be useful is the latest addition of
> 	streaming multimedia instructions to various
> 	CPUs: e.g., the SSE2 instructions in Pentium-III
> 	include some cache-bypass loads/stores which might
> 	be useful for not polluting the cache e.g., doing	
> 	the sweep phase of GC. I think you can load 
> 	relatively large chunks into a set of registers
> 	so you don't have to go to memory for each word.
The problem with this is that when you're allocating, you want the result in
the cache.  And things aren't really any different in the sweep phase, since
you shouldn't be sweeping a cache line unless you're about to allocate it,
again for reasons of memory traffic.  I think the only time a garbage
collector might want to bypass the cache is when it's marking or copying,
since that phase exhibits very little temporal locality.