>
> > A split between local, marginal, and global registers would be an
> > interesting thing to do, and I can see it making the code more elegant. I
> > worry about it making things more complex, though, especially with us
> > already having multiple register types. (We'd double or triple the number
> > of register types essentially, and to some extent blow cache even more
> > than we do now. Might be a win in other ways, though. I'll have to ponder
> > a bit)
>
> Yeah, I didn't like the idea of proliferating that more either. I still
> sometimes dream about a single register file of N regs into which we can
> put whatever we want. Each block of registers has room for the reg
> contents and the type info too. Seems you've got some of the support for
> that figured out in the stack already. Just declare that either (a) it
> is illegal (or behavior undefined) to do
>
>   set $2, 5
>   set $3, "foo bar"
>   add $1, $2, $3
>
> [just because we have higher-level data types than a "real" machine
> doesn't mean we can't still have general-purpose registers, I think]
>
> or (b) that if you do something numeric with a register that is
> non-numeric type mucking happens behind the scenes and throws an
> exception if there is a problem. Certainly this wouldn't be surprising
> to anyone who had been looking at what we do with PMCs and arithmetic
> ops.
>
> If we ever did move to such a single-register-file model, I'd support
> looking seriously at the calling conventions of MMIX to see if we can
> get the appropriate performance characteristics. And, BTW, we have
> 4*32 = 128 regs now. We could even match the logical register count
> of MMIX (256) with only a doubling of total register count. And, if
> we ever determined we needed another kind of register (such as one
> that can be used for address arithmetic, since INTVAL doesn't cut it),
> we wouldn't have to add a fifth file, we'd just add another type
> (thinking again about the stack implementation).
>

After reading the entire MMIX chapter, my mind went back and forth.  First of 
all, the only reason that 256 registers were used was because of the byte 
aligned register arguments to the op-codes (plus modern intuition is 
that more registers = faster execution).  Currently we have 4B arguments and 
don't perform boundry-condition checks, so this is neither here nor 
there.  Currently we translate P1 to:

interpreter->num_reg->registers[ cur_opcode[1] ]

Which requires 4 indirections.  (multiple Px instances should be optimized by 
gcc so that subsequent accesses only require 2 indirections) To avoid 
core-dumping on invalid arguments, we could up the reg-set to 256  
and convert the above to:

interpreter->num_reg->registers[ (unsigned char)cur_opcode[1] ]

and adjust the assembler accordingly.  Alternatively to work with 32 regs, 
we'd have:

interpreter->num_reg->registers[ cur_opcode[1] & 0x001F ]

As for MMIX.  I don't see a need for globals, since we're going to have 
various global symbol stashes available to us.  Further, I don't see a value 
in providing special trapping code to return zero when reading from a 
Marginal or extending the local variable space when writing.  For writing, an 
explicit "reserve" (once (or less)per function call) shouldn't be too much 
bother.  And if the code is silly enough to write or read from this 
"marginal" region, then we'll pretend that they're using uninitialized 
values.  Further, the "reserve"er must require that there is enough space to 
fully utilize the n-register set (currently 32), so that set $r31, 5 doesn't 
spill into the tail of the register set (since that was previously handled by 
the trapping code).  This modifies the MMIX spec such that we potentially 
waste up to 31 register slots in the register window (which is trivial when 
we have a window of size >= 1024).

The rolling register set can be accomplished via three methods.  First, 
realloc the register stack every time an extend exceeds the size.  (This 
sucks for recursive functions).  Second use paged register sets and copy 
values durring partial spillover (very complex).  Lastly utilize a [pow2] 
modulous on a fixed size register stack.  This has a very interesting 
implication; that we completely do away with the (push|pop)[ipsn] and their 
associated data-structures.  Thus the P1 translation becomes:

//(for a 1K rolling stack)
#define STACK_MASK 0x03FF 
interpreter->num_reg[ ( interp->num_offset + cur_opcode[1] ) & STACK_MASK ]

This has 4 indirections and two integer ops.  As above, for multiple uses of 
Px in an op-code, this should be optimized to 2 indirections.  Since 
indirections are significantly slower than bitwise logical operations, this 
should be roughly equivalent in speed to our current interpreter.  If we were 
hell-bent on speed, we could utilize STACK_SIZE alligned memory regions and 
perform direct memory arithmetic as with:

interp->x_reg_base = [ ... ] 1 K chunk of memory aligned to 1K boundry
interp->x_reg = interp->x_reg_base + interp->x_offset

#define P1  *(int*)(((int)( interp->x_reg + cur_opcode[1] )) & STACK_MASK)

which only requires 2 indirections + 1 logical operator for the first access 
(and 1 indirection for subsequent accesses).

This is compatible with most of the existing current core-logic.

Another interesting aspect is that push/pop only occurs at subroutine 
call/return time (as with MMIX).  This enhances performance since fewer ops 
need execute for a subroutine call, while at the same time reduces the 
likelihood of stack-corruption (since only two ops modify the register stack).

If we unify the register sets, then there is less slack space and more 
importantly better cache utilization.  Additionally, the order of all named 
parameters becomes implicit.  There is no confusion as to what register 
parameters go onto the stack and in what order. Additionally we could make 
NUM_REGISTERS really large like 256 or even 1024, since we only allocate as 
many as a subroutine requires.  On the flip side, a garbage collector is 
going to have fits figuring out PMC references (unless additional data-type 
info is stored next to each register.  Only two things need change, first, 
strings and PMCs become pointers to string and PMC structures instead of 
inlining the data (so that everything is a unioned 64bits), and any op-codes 
or vtable ops need to work with the extra indirection.

If we maintained 4 separate register sets, we could simply produce the call / 
return op-codes with multiple reg-set save-values as:
call_I_i_i_i_i
return_i_i_i_i

since we're not constrained to 4 Bytes for our RISC-like op-code that Knuth 
was.

They would be somewhat complex, but more efficient than manual pushing onto 
the stack.

We'd also need the explicit reserve_i_i_i_i opcode at the beginning of a 
subroutine it needed to use additional local variables.

According to Knuth, the call routine should be trivial, since we're just 
incrementing the base register address (not allocating any new space).  The 
reserve routine (equiv to Knuth's writing into marginal registers) might need 
to save old registers (at the tail of the reg-window) to the hidden stack.  
There's obviously more work involved for 4 separate registers than with just 
one big one.  Lastly the return routine might need to restore values from the 
stack to the register set.

I'm in the process of writing up some proto-type code to see how well this 
benchmarks.

Since this avoids the issue of saving local variables to the stack, this 
competes against the aforementioned standard of using R0..4 for IO 
parameters, but I think it successfully avoids the use of an external 
parameter stack, since I doubt that more than 32 perl6 explicitly prototyped 
parameters will be required.. (256 if we unify the register sets).  Perl5 
just uses a single list-PMC to my understanding.


> [snip]
>
> > >Yeah. I'm trying very hard not to put anything really sophisticated
> > >into jakoc (at least not yet). Right now I can still tweak things
> > >reasonably well. If I add much more complexity, I'm going to have
> > >to actually write a real compiler, and if I write a real compiler I
> > >probably won't be able to resist the temptation to turn Jako into the
> > >language I *really* wish I had, and that would be a bigger project.
> >

This simplifies the compiler since all it needs to do is add reserve_i_i_i_i 
at the begining of each routine which should include the max number of 
parameters it'll use in a function call.  Around each function call, it'll 
just use top-num_params as the save-value to call, thus being a little 
wasteful since it's not reusing registers, but this is much simpler than 
writing stack management code from the compiler.

-Michael

Reply via email to