> > > A split between local, marginal, and global registers would be an > > interesting thing to do, and I can see it making the code more elegant. I > > worry about it making things more complex, though, especially with us > > already having multiple register types. (We'd double or triple the number > > of register types essentially, and to some extent blow cache even more > > than we do now. Might be a win in other ways, though. I'll have to ponder > > a bit) > > Yeah, I didn't like the idea of proliferating that more either. I still > sometimes dream about a single register file of N regs into which we can > put whatever we want. Each block of registers has room for the reg > contents and the type info too. Seems you've got some of the support for > that figured out in the stack already. Just declare that either (a) it > is illegal (or behavior undefined) to do > > set $2, 5 > set $3, "foo bar" > add $1, $2, $3 > > [just because we have higher-level data types than a "real" machine > doesn't mean we can't still have general-purpose registers, I think] > > or (b) that if you do something numeric with a register that is > non-numeric type mucking happens behind the scenes and throws an > exception if there is a problem. Certainly this wouldn't be surprising > to anyone who had been looking at what we do with PMCs and arithmetic > ops. > > If we ever did move to such a single-register-file model, I'd support > looking seriously at the calling conventions of MMIX to see if we can > get the appropriate performance characteristics. And, BTW, we have > 4*32 = 128 regs now. We could even match the logical register count > of MMIX (256) with only a doubling of total register count. And, if > we ever determined we needed another kind of register (such as one > that can be used for address arithmetic, since INTVAL doesn't cut it), > we wouldn't have to add a fifth file, we'd just add another type > (thinking again about the stack implementation). >
After reading the entire MMIX chapter, my mind went back and forth. First of all, the only reason that 256 registers were used was because of the byte aligned register arguments to the op-codes (plus modern intuition is that more registers = faster execution). Currently we have 4B arguments and don't perform boundry-condition checks, so this is neither here nor there. Currently we translate P1 to: interpreter->num_reg->registers[ cur_opcode[1] ] Which requires 4 indirections. (multiple Px instances should be optimized by gcc so that subsequent accesses only require 2 indirections) To avoid core-dumping on invalid arguments, we could up the reg-set to 256 and convert the above to: interpreter->num_reg->registers[ (unsigned char)cur_opcode[1] ] and adjust the assembler accordingly. Alternatively to work with 32 regs, we'd have: interpreter->num_reg->registers[ cur_opcode[1] & 0x001F ] As for MMIX. I don't see a need for globals, since we're going to have various global symbol stashes available to us. Further, I don't see a value in providing special trapping code to return zero when reading from a Marginal or extending the local variable space when writing. For writing, an explicit "reserve" (once (or less)per function call) shouldn't be too much bother. And if the code is silly enough to write or read from this "marginal" region, then we'll pretend that they're using uninitialized values. Further, the "reserve"er must require that there is enough space to fully utilize the n-register set (currently 32), so that set $r31, 5 doesn't spill into the tail of the register set (since that was previously handled by the trapping code). This modifies the MMIX spec such that we potentially waste up to 31 register slots in the register window (which is trivial when we have a window of size >= 1024). The rolling register set can be accomplished via three methods. First, realloc the register stack every time an extend exceeds the size. (This sucks for recursive functions). Second use paged register sets and copy values durring partial spillover (very complex). Lastly utilize a [pow2] modulous on a fixed size register stack. This has a very interesting implication; that we completely do away with the (push|pop)[ipsn] and their associated data-structures. Thus the P1 translation becomes: //(for a 1K rolling stack) #define STACK_MASK 0x03FF interpreter->num_reg[ ( interp->num_offset + cur_opcode[1] ) & STACK_MASK ] This has 4 indirections and two integer ops. As above, for multiple uses of Px in an op-code, this should be optimized to 2 indirections. Since indirections are significantly slower than bitwise logical operations, this should be roughly equivalent in speed to our current interpreter. If we were hell-bent on speed, we could utilize STACK_SIZE alligned memory regions and perform direct memory arithmetic as with: interp->x_reg_base = [ ... ] 1 K chunk of memory aligned to 1K boundry interp->x_reg = interp->x_reg_base + interp->x_offset #define P1 *(int*)(((int)( interp->x_reg + cur_opcode[1] )) & STACK_MASK) which only requires 2 indirections + 1 logical operator for the first access (and 1 indirection for subsequent accesses). This is compatible with most of the existing current core-logic. Another interesting aspect is that push/pop only occurs at subroutine call/return time (as with MMIX). This enhances performance since fewer ops need execute for a subroutine call, while at the same time reduces the likelihood of stack-corruption (since only two ops modify the register stack). If we unify the register sets, then there is less slack space and more importantly better cache utilization. Additionally, the order of all named parameters becomes implicit. There is no confusion as to what register parameters go onto the stack and in what order. Additionally we could make NUM_REGISTERS really large like 256 or even 1024, since we only allocate as many as a subroutine requires. On the flip side, a garbage collector is going to have fits figuring out PMC references (unless additional data-type info is stored next to each register. Only two things need change, first, strings and PMCs become pointers to string and PMC structures instead of inlining the data (so that everything is a unioned 64bits), and any op-codes or vtable ops need to work with the extra indirection. If we maintained 4 separate register sets, we could simply produce the call / return op-codes with multiple reg-set save-values as: call_I_i_i_i_i return_i_i_i_i since we're not constrained to 4 Bytes for our RISC-like op-code that Knuth was. They would be somewhat complex, but more efficient than manual pushing onto the stack. We'd also need the explicit reserve_i_i_i_i opcode at the beginning of a subroutine it needed to use additional local variables. According to Knuth, the call routine should be trivial, since we're just incrementing the base register address (not allocating any new space). The reserve routine (equiv to Knuth's writing into marginal registers) might need to save old registers (at the tail of the reg-window) to the hidden stack. There's obviously more work involved for 4 separate registers than with just one big one. Lastly the return routine might need to restore values from the stack to the register set. I'm in the process of writing up some proto-type code to see how well this benchmarks. Since this avoids the issue of saving local variables to the stack, this competes against the aforementioned standard of using R0..4 for IO parameters, but I think it successfully avoids the use of an external parameter stack, since I doubt that more than 32 perl6 explicitly prototyped parameters will be required.. (256 if we unify the register sets). Perl5 just uses a single list-PMC to my understanding. > [snip] > > > >Yeah. I'm trying very hard not to put anything really sophisticated > > >into jakoc (at least not yet). Right now I can still tweak things > > >reasonably well. If I add much more complexity, I'm going to have > > >to actually write a real compiler, and if I write a real compiler I > > >probably won't be able to resist the temptation to turn Jako into the > > >language I *really* wish I had, and that would be a bigger project. > > This simplifies the compiler since all it needs to do is add reserve_i_i_i_i at the begining of each routine which should include the max number of parameters it'll use in a function call. Around each function call, it'll just use top-num_params as the save-value to call, thus being a little wasteful since it's not reusing registers, but this is much simpler than writing stack management code from the compiler. -Michael