Memory Management, Inlining, and POINTER

iepos@tunes.org iepos@tunes.org
Fri, 31 Mar 2000 13:07:32 -0800 (PST)


> >If one is popping things
> >off, then why were they put there in the first place?
> 
> Um...  Who says they were put there?  You have a reference to a list; you
> want to get the first thing in the list.  Now, that list still may exist
> (someone else may have it); but now you have the first thing as in
> independantly considerable item.

This situation could happen, in a system that used references,
but it seems like very poor style to me, to leave large lists on
the stack (even if you really only leave a reference), when you know
you are only going to use a small piece of the list. It seems like it
would be wiser to arrange things such that instead of copying a reference
to the whole list earlier, one had just copied the small piece that
one really needed a copy of.

The rationale for doing it this way is that it may prevent storage
from being wasted; if one copied the whole list (possibly by reference) when
one really only needed a small piece, then one is preventing the
rest of the list from being garbage-collected even though it is not
really needed. Another benefit is that if one only copies the small
piece it may become reasonable to copy by value instead of by reference.

> As for using 'pop': that only works if lists are not 'items' which can be
> manipulated by themselves, OR are item which require special manipulation.
> I wouldn't really enjoy that, but it's up to you.  I do, however, have to
> point out that if this is where you go, then you're not actually avoiding
> 'first' and 'rest'; on the contrary, you're avoiding 'cons'.

Hmm... I don't quite understand. I've never considered avoiding
"cons"; it's one of the more fundamental primitives.

> >There must be new unreleased versions of Dolphin floating around
> >I guess ...
> 
> There's one version, 0.5, which is nearly complete but which few have seen;
> 0.6 is currently being coded (by a different coder), and 0.7 is in
> development, and is planned to incorporate most, if not all, of the features
> involved in 0.7, but take advantage of Forth-based construction for:
> 
>  - faster task switching (there's less state in the virtual processor)
>  - dynamic optimization
> 
> ...and perhaps other things.  Who knows.

This is quite an interesting project...
Hopefully something cool will come out of it.

> >This is wacko... no lists? Or maybe I should have said "array"; you
> >are going to have arrays, right? How are you going to represent
> >strings of characters in the system? Ahh, I see; you're just going
> >to keep the characters in unwrapped form, directly on the stack, right?
> >(:-))
> 
> Grin.  Nope, no lists, and in a general sense no arrays either.  Actually, I
> am supporting _buffers_; if you can grab a chunk of memory you can use it to
> hold anything except the system stack (you aren't allowed to play with
> that).  You could use it to hold an array, and store xts in it; a quotation
> will be an ordered collection of xts.  (xt == Execution Token.)

Okay... so you are going to have a program sort of like malloc() which
can be used to ask the system for a pointer to an available buffer of
some size, right? One question... is the system going to be designed in such
a way that such allocated memory could be physically moved around
(for defragmentation or caching purposes) by the system without disturbing
the program?

> My current plan is to have three layers of code:
> 
> 1. low-level Forth: written in native x86, implements a simple virtual
> machine.
> 2. quotation system: written in low-level Forth.
> 3. high-level Forth: "outer" interpreter (text), block support (512 byte per
> block, direct hard drive access), optimiser subsystem.
> 
> If all goes well, other languages may be implementable on top of those --
> for example, a Joy-style system.

Sounds like a good plan...

> >> Eventually, everything will be kept in memory.
> 
> >Hrmm... this tells me nothing.
> >Of course everything will be kept in memory.
> >Where else would it be kept?
> 
> On the stack.  You're assuming that the stack is in memory, a false
> assumption even on the x86.  I could keep six whole stack items on the x86,
> which is a virtual infinity of stack space for machine Forth applications.
> Or I could kepe the stack in a seperate segment, completely inaccesable to
> the data operations of the VM.
> 
> Or I could be using a stack coprocessor.
> 
> Actually, though, I'm sure I'll be keeping the stack in memory.

My misunderstanding comes from the word "memory"; I was thinking
of registers and hard disks as forms of memory... but, you answer
my question below...

> >My question, rephrased:
> 
> >  Will arrays and quoted programs be kept directly on the stack?
> >It looks like the answer is "no", I'm guessing...
> 
> Not unless the user wishes to do so.  I have no objection to them keeping
> stuff whereever they want; I'll have no primitive which cares.  The L1 cache
> will probably throw a bolt if you keep a quotation on the stack, though --
> your 'i' primitive would have to be TERRIBLY sophisticated to handle all of
> the swapping it would require.

Hmm... 'i' (and other unquoting primitives) do pose a bit of a special
problem, in that it will do no good for them if (part of) the quotation
has been cached in registers; the instructions need to be in memory
if they are going to be executed. Of course, the stack is fine place
for that to be; I have no problem with jumping onto the stack (:-)).

> >> >The more I think about it, the more I like the idea of just keeping
> >> >everything directly on the stack; of course, a problem that arises
> >> >if one does this literally is that it becomes very expensive to
> >> >shuffle things around on the stack... I'm not sure how serious this
> >> >might be.
> 
> >> Also, you can't cache the top of the stack in a register.  
> 
> >Not so. Of course you can still cache the top few stack items 
> >in registers.
> 
> Not true if the top stack item is a list.

Ah... but it is still true if, say, the list only has two items in it.
In that case, the two items could be kept in two registers, and perhaps
the size (2) would be kept in a third register... If the list was
too big to fit in the registers, then probably only the first part of the
list would be kept in registers.

> >> I have really no idea why you like the idea; it doesn't even
> >> make sense to me.
>
> >Here is why I like the idea: it looks like it could be FAST.
> 
> [...]
>
> Could you clarify that?

Ahh.. I clarify below (or at least explain why I thought it could be fast).

> >For example, when one needs 4KB of scratch space for a sound buffer, 
> >one could just subtract 4K from ESP, and you have
> >allocated memory in a very speedy way (a segfault will be generated
> >and caught if the stack has overflowed). Actually, one would rarely
> >(if ever) just subtract ESP like that; rather, one would just use
> >the program that fills up the buffer, and it would
> >decrement the ESP as it goes along. Anyhow, the point
> >is that one can avoid the overhead of memory allocation (for such
> >a large buffer, this overhead might not be significant; but if
> >one is working with many small lists, then it could be significant).
> 
> Of course.  But one doesn't even have to use the stack for such a thing;
> that's what the Forth dictionary is for.  If you use the stack, you've going
> to spend your time trying to defeat the stack optimiser, as it attempts to
> register-cache the top stack items (which are supposed to be the most
> commonly used), but you wind up using effectively random items.
> 
> In the Forth dictionary, with a single HERE SWAP ALLOT command sequence,
> you've taken an integer stating the size of the buffer you need and returned
> a pointer to the buffer.  The buffer isn't going anywhere; it's not memory
> managed in any reasonable sense.

Okay, I think I see how that works... It looks basically the same
as how one would do it in C... 

The main disadvantage of this that I see is the overhead associated
with using "ALLOT" (and its counterpart, "FREE"?). Typically, I would
guess that it would require a compare of the requested size and
the available size, and then a subtraction from the available size.
It would occasionally also require some stepping through
its linked list of free blocks if there is no more room in the
current one, I suppose.

If one allocates memory directly on the stack, then the
comparison can happen implicitly through a segfault. This
saves a bit of time. Allocating and freeing memory would 
take place by subtracting and adding to ESP. This would probably
be faster than subtracting or adding to a separate "available size"
of a chunk of the heap, which would probably not be in a register.

This saved time could be significant if a lot of small buffers are
being allocated, I would guess.

> >Another reason why I like it is that it avoids the need for pointers
> >(well, probably not entirely)... pointers are bad; I don't like them
> >at all. Whenever you use a pointer, you can't move the thing without
> >updating all the pointers pointing to it. That's the problem with
> >traditional memory management systems (well, my C system, at least):
> >you always end up with holes all over the place because things can't
> >move around. 
> 
> This is, unfortunately, illusory; everything has to work with a pointer
> eventually, and that pointer has to be updated when you move the things it's
> talking about. 

True... eventually, just about everything does have to be manipulated
through a pointer. What I'm suggesting is that it may be useful to
try to minimize the number of pointers needed in the system, by
clustering related data under the same pointer. In particular, if
all data was kept directly on the stack, then you would only need a single
pointer, ESP.

However, as I've admitted before, it would not really work well to keep
_everything_ directly on the stack, because then when you wanted to
insert or delete something deep in the stack you would have too much
stuff in the way. But, these kinds of problems can be solved I suspect
by careful use of a few more pointers.

> If you're restricting yourself to "linear logic" you don't
> need pointers (in fact, that's the only way to do without pointers in the
> manner you describe); however, if you think about it, you'll realize that if
> you're using linear logic with pointers, you also don't need to worry about
> moving data around, because you KNOW there's only one pointer, and you can
> simply find it and change it.

Yes. If you know you only have one pointer, then its easier to move things.

> BTW, did you know that in Henry Baker's definition of Linear Logic, he
> requires "exactly one" pointer per data item?

Hmm.. well I guess we can hope that there aren't too many data items, then.

> >Let's face it; pointers are lame. They are slow and must be avoided
> >at all costs.
> 
> You say this, but your evidence is FAR from decisive.

Heh heh...

> >> >Also, it is interesting that you think of using "CALL"... 
> >> >I'm inclined
> >> >to always inline everything, but I guess this might not be practical
> >> >in some cases...
> 
> >> It's impossible (recursive programs) 
> 
> >Surprise. It is not impossible. It is not uncomputable. 
> [...]
> 
> >  foo == [[dup i] cons (foo\ Z)] dup i
> 
> Before we continue, I have to ask whether you realize what the definition of
> 'i' is.  CALL, right?  Right.  You haven't eliminated recursive CALLs.

Wait. 'i' need not be implemented with CALL. One could move the code
from the data stack to the top of the instruction stack. If done
carelessly, this could be very slow (the straightforward way to implement
it would be to JMP somewhere safe, memcpy from the data stack to the
code stack in the just the right place, and then JMP back).
This technique, inserting the code into the proper place to begin
with, would eliminate the need for a return address pointer.
But, I'm not claiming it has use beyond that. CALL might not really
too bad of a way to implement 'i', at least in many situations.

> >Usually, copying-by-reference causes slowness, for these reasons:
> 
> >  1) the slowness of indirection
> >  2) the slowness of reference-counting
> >  3) the slowness of seeking all the way to the other end of the
> >     hard drive (or to another computer, in a distributed system)
> >     to get something because a local copy was not made,
> >     but copying-by-reference was used instead.
> 
> >But in some situations I can see how copying-by-reference would still
> >be appropriate, in situations where one is dealing with many copies
> >of very large things; but these are fairly rare situations, I suspect.
> 
> Then try to avoid using references when you actually need copies.  It seems
> pretty simple to me -- use references when you can, copy when you must.  The
> entire purpose of linear logic is to provide a framework for deciding
> exactly that.  The goal is to _avoid_ as many copies as possible.

I'm going to go out on a limb and use an analogy of questionable
relevance: the human body, in which every cell has a copy of a huge
data structure, the DNA. As well as this setup works, perhaps there
actually is something good about really making copies of things instead
of just copying references.

Back to reality... the issue of when to copy by reference and when
to copy by value does seem like quite a tricky one. An advantage
of copy-by-value that I think may often be overlooked is "locality",
that it is then possible to keep the copy in order, where it needs
to be, whereas this may not be possible for copy-by-reference.

As an example, suppose that you have two data structures that are
stored on opposite ends of your hard disk; there is a large
fragment that both data structures contain, and the system is
considering "factoring" the fragment, eliminating the fragment
in one of the structures and replacing it with a reference to
the fragment in the other data structure. A disadvantage of
doing this would be that when reading this data structure, it
will be necessary to seek all the way to the other end of the
hard disk to get the fragment that was factored (this could be
partially solved by moving one structure so it was closer to the
other). Another disadvantage is that complications will arise if
the fragment is modified or deleted in one of the data structures.
Of course, the big advantage is that you have saved space, which
might also translate into a speed increase in some situations.

> >> Or you could just depend on the fact that the list/array of 
> >> pixels is in
> >> memory, so a 'dup' only duplicates an address.
> 
> >No! That would involve a pointer! Slow slow. 
> >There's no need to bring pointers into this.
> 
> But surely you see that in this case, the pointer-based solution is NOT slow
> at all, but on the contrary, is FAST?

Not really. If the pixel buffer is stored directly on the stack, then
an extra register may be available, since there is no longer a need
for a pointer to be stored to it (or rather, ESP plays the role of
the pointer). This could result in a speed increase. Also, if
the pixel buffer is stored directly on the stack, then that means
that its allocation was probably faster and that its deallocation
will also probably be faster.

Of course, it may turn out that a pointer would be the best technique
after all, if the pixel buffer was going to be swapped around
a lot or copied, but from the information given there is no indication that
this is the case.

> >> That's the whole purpose of keeping quotations in "array of 
> >> pointers to
> >> function structs" form.  An array of function structs is 
> >> less efficient to
> >> execute than a stream of native code (although not 
> >> disasterously less so,
> >> since the appropriate interpreter is simply a loop 
> >> containing an indirect
> >> call), but it contains (or points to) information which can 
> >> be used to count
> >> executions, profile execution times, and so on.
> 
> >Whoa... a whole array of POINTERS? indirect calls? slow slow slow.
> >seriously, keeping the code in uncompiled form and interpreting
> >from that, as you say, may not be a disasterously bad choice.
> 
> Utter and obvious prejudice.  You clearly have NO clue what you're talking
> about; there have been many benchmarks studying this stuff.

Heh heh... Okay, okay. Guilty as charged.
I'm really not an expert at memory management.

But, I'm a bit confused, as I was mainly agreeing with you that your
technique of interpreting (the first time) would not be that bad.

> My code is flexible; I haven't rejected or accepted
> pointers.  Your code will be bigger than mine, AND less flexible.  It _may_
> be faster, since you're claiming that it is, although I doubt it.  It'll
> require so much more space, though, than all advantage of your speed will be
> utterly lost.

8-(

> >But, some pointers will need to be used in the system.
> >If nothing else, there's ESP and EIP. But, other pointers
> >will be needed too, if we want to be practical. But,
> >it must be controlled. No more pointers than we need.
> 
> Of course.  And no more copies than we need; references are cheaper by far
> than copies, so use references whenever possible.

Hmm... References are probably usually cheaper, yes. Some notable
exceptions are:

  1) if you're dealing with really small pieces of data
  2) if your two copies are being used in quite unrelated parts of the system.

An extreme case of #2 can happen with distributed systems. In this case,
unless a data structure is _really_ big, it would be ridiculous
for two machines to share a copy. But, this can happen on a smaller
scale on disks with significant seek time, as mentioned earlier.
A system might be mostly working on one area of the hard disk and it
suddenly needs to zoom way off to another end because a copy-by-reference
was used in two unrelated parts of the system. 

> >All pointers that are used must be carefully kept track of,
> >so that things can still move around. 
> 
> This is true.  Each item should have a single 'owner' slot, which refers
> back to its (single) owner; nobody else may EVER have a pointer to it.  When
> the item is moved, the owner is updated. 

This would not support copying-by-reference, but I guess it could easily
be extended to, by allowing multiple owners.

But, I could see how a system could use references extensively while
never using copying-by-reference. In particular, single-owner references
might be used so that swapping around large structures becomes
decent, although copying them would remain expensive.

> >- "iepos" (Brent Kerby)
> 
> -Billy

There are several approaches to implementing a system that is careful
with its pointers that I've been thinking about; here are some,
listed in order of increasing complexity and dependence on pointers...

  1) Use mainly just two pointers in the system, the top of the data stack,
     and the top of the instruction stack. Keep everything on these
     stacks, and keep them exactly in order without copying-by-reference.

     This is quite a ridiculous approach; it has been recently implemented.
     See http://www.tunes.org/~iepos/prog.c and the "remarks" file
     in the same directory :-)

  2) Use a data and instruction stack, and keep everything on them,
     and in order, and without copying-by-reference, but keep the
     stacks padded in places where insertions are likely to take place
     or where deletions have taken place. Padding may be removed,
     by moving things around, after a region has cooled off.

     Doing it this way, with padding, makes addressing things on the stack a
     little bit tricky; one would probably want to have an addressing
     table somewhere (probably in the form of a binary tree)
     that tells where things are; there would probably need to be
     a pointer for each padded region. Entries in such a table could
     be cached (in registers) in many situations...

     This approach I think may be quite decent for many purposes.
     But, one prominent shortcoming is that it would go very slow if
     one tried to move around the same region over and over again.

  3) Use a similar approach to #2, except that when moving a large
     region, do not physically move it, but instead only modify
     the addressing table so that it appears to have moved.
     After a while, if it has not moved anymore, then do a physical
     move so that the addressing table may be simplified (for, a simple
     addressing table is a good thing to have; the more complex the
     addressing table is, the longer it takes to do translations).

  4) Use a similar approach to #3, except that when copying a large
     region, do not physically copy it, but instead only modify
     the addressing table so that it appears to have been copied.
     
Anyhow, there are surely many variations one could apply to these.
Also, I'm currently looking for a really good way to implement
an addressing table of the kind that these approaches would require.

One format that comes to mind is a binary tree... the table could
begin with a tag indicating whether it is a leaf or a node; then,
if it was a leaf (i.e., it describes an unfragmented chunk of memory),
the tag would be followed by a pointer to the memory. If it was
a node, then it would be followed by two pointers (one to the top
region and one to the bottom region of the stack, which is split),
and a integer indicating the size of the top region.

The very top leaf of the stack would probably be treated specially;
a pointer to it would probably always be kept in ESP. One would
probably arrange things so that a segfault occurs (and is handled)
if one accessed something beneath the top leaf of the stack,
so that the addressing table is only consulted if necessary.

But, I feel like I'm missing something.
I'd probably have to try to actually implement this to tell.

I think I may try to implement this sometime.
Maybe in NASM in the form of a little operating system.
Who knows...

As always, thanks for your comments...

- "iepos" (Brent Kerby)