Some notes concening VM :)

Brian Rice water at tunes.org
Sun Apr 29 10:53:35 PDT 2007


On Apr 28, 2007, at 7:25 PM, sig wrote:

> Hello , i'm new here and have particular interests in developping a  
> Slate VM.
> I didn't decided yet if i will participate in this project in nearest
> future, so i just want to discuss some of my ideas with you for now.
>
> Concerning OMM (object memory model) and stack organization.
> I'm refer to tables which you can see on this page:
> http://slate.tunes.org/doc/mobius/
>
> 5.1.1 stack format.
>
> Table 3. Block format.
>
> I see little reason why selectors must be kept apart from literals.
> The only place i see where they handled differently is at
> interpretation stage to use different opcode 'Load Selector'.
> If there's no extra processing except that, we can merge selectors
> with literals slot and remove an opcode 'Load Selector'.
>
> This is the code that shows that they handled in same manner..
>
> static INLINE void PSInterpreter_loadLiteral_(struct Interpreter * i,
> unsigned long int n)
> {
>  PSInterpreter_stackPush_(i, (((i -> method) -> literals) ->  
> elements)[n]);
> }
>
> static INLINE void PSInterpreter_loadSelector_(struct Interpreter * i,
> unsigned long int n)
> {
>  PSInterpreter_stackPush_(i, (((i -> method) -> selectors) ->  
> elements)[n]);
> }
>
> Maybe i don't see potential optimizing patterns which may utilise such
> distinction? Please let me know.

Well, I can say that for introspection, it is easier to tell what  
selectors a method sends if they are in a separate array, so easier  
to track down a call graph without referring to source code. Granted,  
in the general case, plenty of those literal Symbols are also used by  
sendTo: and variants, or Symbols can be generated at run-time by the  
method, but this makes the base case easier. Also granted, just doing  
"method literals select: [| :each | each is: Symbol]" isn't that much  
harder (although is: is kind of expensive in aggregate). So I  
wouldn't fight much for this one.

Perhaps the only other defense is how well the method format scales  
up, that is, how does method size grow with source length. The truly  
anomalous case is the bootstrap method... in an image bootstrap, what  
happens is that the Slate system compiles all the core source files'  
contents into a *single* bytecode method which is then marked as the  
starting point for kernel image execution, at which point it is run  
in order to build the system and then discarded.

> Same about 'Push Integer' opcode. Encoding integer constant in
> bytecode may be somewhat faster than extra lookup in literals array,
> but only in cases when your constant between 0 and 14, otherwise it
> will require extra processing: loading next byte from opcode array and
> adding its value with 15, also checking that its higher bit is 0...
> this extra processing may be even more costly than simply loading
> literal from literals array.

It's also a question of what the common case of integer literal is.  
Perhaps FF could indicate that the next number is an index into the  
literals array, so that tiny integers can be read directly and the  
rest is just an array lookup.

> You may say that merging all these opcodes to use single literal array
> will lead to higher number of literals in methods, which in turn leads
> to higher probability having a literal index greater than 14 in
> opcodes, and thus, will increase overall time of instruction
> processing as well as bytecode size.

Yes, and it would be true, but there are alternative approaches such  
as what I suggest above.

> But we have 2 unused bytecodes, yes? Then why we can't reuse them to
> load literals with index higher than 14 , which are already do not fit
> in single byte opcode.
> So, for high literal indexes we can introduce Load Literal2 , and even
> Load Literal3
> - first takes index from next byte of instruction stream, second takes
> next two bytes from instruction stream for insanely big indexes higher
> than 255.
> Both of these instructions can be placed under single opcode index but
> with different Immediate value, which will instruct interpreter how
> many bytes in instruction stream is reserved for index.
> Honestly i'm not very care about compact bytecode representation, i'm
> care about speed and uniform data organization.
> Using immediate value in opcode which if > 14 must be added to next
> byte requires additional processing - adding 15 to next byte instead
> of just loading next byte(s) and use them directly as final index
> value.
> I'd rather use 4 bits of 'Immediate value' for indicating exact number
> of bytes in bytecode stream reserved for index value instead of using
> immediate value directly and also checking following bytes for higher
> bit set (0x80) and so on.
>
> Why interpreter must spend its time to calculate index value (see the
> PSInterpreter_decodeImmediate function) , when its value already known
> at compilation time and we already can say if it fits in 1 byte code
> or 2 bytes or 3?

These are good points, and worth considering, but I think it'd be  
worth some profiling time to know what the real costs/benefits are.

> The next things is concerning stack organization.
>
> For those of you who knows the needs of Slate better than me, may find
> this unsiutable or less effective or more complex. But first, read to
> the end. :)
>
> Lets review the Frame Format (in Table 2)
> The need in stack frame is the need of tracking current context,
> return point, current instruction pointer and previous stack frame
> where we must return from local return.
> Other values on stack, like arguments and locals is pushed by opcode
> instructions and they number depends from current execution point and
> context e.t.c.
> But frames always have fixed number of variables.
> At some moment i asked myself: what if organize separate stack for
> frames? Can it improve an execution speed as well as reduce complexity
> of some sort?
>
> So, lets divide our stack on two parts: in first part we store all
> variables used by methods/blocks while in second we store only stack
> frames, which have fixed sizes.
> First that is evident: we have much simplified methods for traversing
> between stack frames - since they having fixed size we can traverse
> the frames stack by simply adding/subtracting contstant value from
> frame stack pointer.
> Second: we don't need a Pop instruction in bytecode. Pop is used to
> remove values on stack when we returning from method/block closure,
> but our stack frame already knows its return point, so by returning
> from method we discarding all values up to point which stored in
> frame's return point. Simple.
> Third: accessing lexical scope for nested blocks and/or checking
> expired context requires much less processing.
> Fourth: most variables which responsible for current interpretation
> state can be moved from Interpreter struct into Frame struct leaving
> only pointer to current Frame and then,
> we don't need to reload these variables each time we enter/return from
> method. We simply initialize them for the new Frame on entering and
> decrementing Frame pointer on exiting method. No extra processing.

Yes, these things are all true and your analysis seems sound.

I think the balancing factor which Lee had in mind was the ability to  
share the stack format between the bytecode system and the more  
complex binary compiler run-time. I don't recall his reasoning on  
this, or how well he documented it, however. Perhaps simply by  
knowing this, you might see where his mind was going when he made  
these decisions.

> --- there are so many things left to discuss, but i must go sleep :)
>
> I hope my ideas was worth reading them :)

Yep, thanks for sharing and for being interested in the first place.

--
-Brian
http://briantrice.com



More information about the Slate mailing list