[gclist] Re: finding object prefix from infix pointer

David Chase chase@world.std.com
Fri, 09 Jul 1999 13:56:40 -0400

> || But it looks like we've moved from the topic of implementing read/write
> || barriers to a _completely_ different (and orthogonal, it seems to me)
> || topic of identifying object head from an infix pointer (assuming your
> || implementation uses unguarded infix pointers).

At 02:15 PM 7/9/99 +0200, marc.shapiro@inria.fr wrote:
>I'm interested in a discussion of the alternatives: this is a costly
>Hans, how does the conservative collector do this?

[ Replying semi-redundantly ... ].  The conservative
collector uses BIBOP, or Big Bag of Pages, so the page containing
an object determines its size.  (See earlier messages on this subject).
At one time, power-of-two sizes were used, but I am not sure if that
is still the case.  Note that for the limited case we are considering,
namely offsets within a page, that there are other ways to implement
division besides actually invoking an expensive (or non-existent)
intruction.  In practice, it goes something like this:

  pointer q;
  page p = page_of(q); // e.g. q >> 13 
  index i = index_of(q); // e.g. (q >> 3) & 0x3ff
  int s = size_of(p); // e.g., 1 << logsizetable[p - basepage] or
                     // sizes[ indexsizetable [p-basepage] ]

and now, it depends upon whether you are using powers of two, or actually
doing division.  It may be best to just divide.  However, if that is too
expensive, then you might play this game:

 // s is number of 8-bytes, not number of bytes.

 int s_minus_1_complement = ~(s-1);
 if ((s ^ s_minus_1_complement) - s == 0) {
    // s is a power of two.
    q_rounded = p + offset_from_index(i & s_minus_1_complement); // e.g.
remainder << 8
 } else {
    // s is not a power of two, use iterative subtraction
    // to get remainder, starting small.
    int divisor = first_divisors[s];
    remainder = i;
    while (divisor >= s) {
       int trial = remainder - divisor;
       if (trial >= 0) remainder = trial;
       divisor = divisor >> 1;
    q_rounded = q + offset_from_index(remainder); // e.g. remainder << 8

Because we know that the input is limited to 1023 or less,
first_divisor[i] is equal to the largest number of the form
(2 << k) * i that is less than or equal to 1023 (can actually
be less than, since equal to 1023 only occurs for s = 1 which
is a power of two).  Thus, first_divisor[3] = 768,
first_divisor[5] = 640.

Or, you can play games with multiplication; for a 10-bit input,
you can use

  fraction[i] = ceil((double) (1 << 20) / (double) i)

and thus

  quotient = (i * fraction[i]) >> 20;

but this requires an additional multiply to get you to
the beginning of the object.

(I just wrote the program to test this; one advantage of the small
range of inputs is that you can exhaustively test them all much faster
than you can theorize about what the answer ought to be.
Note that pushing to more than a 10-bit input in a 32-bit word
would require some thought.  My first attempt failed.)

I can also imagine doing something that is a hybrid of BIBOP and some
other scheme.  A page is filled with fixed-size "containers", each of
which contains a small number of objects.   Use BIBOP to map an
address to some object header, not necessarily the one that you
are looking for.  The header will indicate the size, which
allows you to find the next object in the container.
Thus, you might use power-of-two-sized containers for purposes of
fast arithmetic (no division, no multiplication), and possibly chain
one or two objects forward to find the object you're actually after.
The cost of chaining could easily be less than the cost of doing
a division.

It's pretty clear that there's a fair amount of fun to be had hacking
if you are trying to squeeze out the last cycle.

>The plan in our PerDiS system (not yet implemented I think) is simply to
>have a fixed-size bitmap associated with every fixed-sized sub-heap.
>Assuming the allocation granularity is 16 bytes, each bit in the bitmap
>says whether an object starts in this granule or not.  

I would seriously consider using BIBOP.  Right now, the system (a Java
virtual machine) I am working on does not (for various reasons) but BIBOP
has plenty of things in its favor.  Typically, you will pair a BIBOP
heap organization with a multiple-free-list allocator.  If your code
is slightly compiled, you can refer directly to the list for an object,
which makes allocation competitive with what you might get from a
compacting collector.  At the machine language level, on friendly
machines, this means that you can also use compare-and-swap or
load-linked-store-conditional to get (relatively) fast multithreaded
access to your lists, using the idiom:

  the_list = lists + k_size; // C pointer arithmetic
  candidate = *the_list;
  if (candidate == 0) /* do something slow */
  candidate_next = candidate -> next;
  if (! compare_and_swap( the_list, candidate, candidate_next) ) /* do
something slow */

Sadly, on at least one popular architecture (Pentium) compare_and_swap
requires bus locking, which is mighty slow.  Another alternative is to
use one heap (set of sized pages) per thread, perhaps allocating them
lazily to the threads as needed.

Test program for division-into-multiplication stunt.
#include <stdio.h>
#include <math.h>

test (int k) {
  unsigned int i, j;

  for (i = 1; i <= 1024; i++) {
    unsigned int mf = ceil((double) (1 << k) / (double) i);

    for (j = 0; j < 1024; j++) {
      unsigned int trial = (j * mf) >> k;
      unsigned int actual = j / i;
      if (trial != actual) {
	printf("Stunt failed for k = %d, j = %d, i = %d, actual = %d, trial = %d\n",
	       k, j, i, actual, trial);

int main() {
  int i;
  for (i = 22; i > 15; i--) {

David Chase
NaturalBridge LLC