Re: ARM JIT (just about)

Daniel Grunblatt Sat, 02 Feb 2002 12:22:49 -0800


On Fri, 1 Feb 2002, Nicholas Clark wrote:


> On Fri, Feb 01, 2002 at 01:32:13AM +0000, Nicholas Clark wrote:
> > This just about implements a jit for ARM. It doesn't actually do any ops in
> > assembler yet, except for end. It's names on the basis that it's for v3 or
>
> This is where I give up on the current format.
> Others are welcome to carry on either based on what I did, or starting
> afresh. And we have a fresh format I'm interested.
> What I've written will call parrot ops.
>
> > Problems that I remember that I encountered. (Comments in the code may
> > indicate more). Part of these were understanding things - it doesn't mean
> > that the current way is wrong, just that it wasn't obvious to me :-(
> >
> > 1: '}' is a necessary character in ARM assembler syntax, so jit2h.pl needs
> >    to be a bit smarter about deciding when to chop the end of a function
> >
> > 2: There is no terse way to load arbitrary 32 bit constants into a register
> >    with ARM instructions. There are 2 usual methods
> >    1: Put the constant in a constant pool within +- 4092 or so bytes of the
> >       PC, and load it with an offset from the PC.
We should probably go this way.

> >    2: Make it with 1, 2 or 3 instructions. I believe that currently it is
> >       conjectured that it is possible to make any 32 bit value with 3 ARM
> >       instructions, and so far no-one has found any value that they couldn't
> >       make, but no-one has proved it possible and thereby made an algorithm
> >       that lets a program generate instructions to build a constant
> >
> >    Either way, I found I was fighting the current jit which expects (at worst)
> >    to be able to split a 32 bit constant into 2 (possibly unequal) halves
> >    stored in two machine instructions. To be more flexible jit would need to

Wrong, that's not possible on the ALPHA, or may be it is (don't know much
about ALPHA assembly) but I don't use it.

I think the main problem with ARM is that you don't have intructions like
the alpha "ldah" or the sparc "sethi" which do something like:

 "ldah $1,2($2)"

 $1 = (2 * 65536 + $2)

That why we should use a constant pool.

> >    know what some CPU registers contain (ie things like the current
> >    interpreter pointer), and be able to choose whether to get a value or
> >    pointer by arithmetic from a CPU register, by deferencing a CPU register
> >    (possibly with offset) or by giving up and loading a constant
> >
> >    This will make more sense to anyone who gets hold of an ARM machine and
> >    then tries to write ops :-)
> >
> > 3: I wanted to put the pointer to the current interpreter in r7. This made
> >    the default precompiled "call" function have its branch somewhere wonky.
> >    It seems to me that Parrot::Jit->call should be returning a 2 item list
> >    the  bytecode, and the offset of the branching instruction in there.
>
> Actually, I'd like to do arbitrary call like this:
>
>         mov     r1, r7                        ; say arg 2 is *interpreter
>         adr     r14,  .L1             ; pseudocode for pc relative calc.
>         ldmia   r14!,  {r0, r2, r3, pc} ; register list built by jit
> ..L1:    r0 data
>         r2 data
>         r3 data
>         <where ever>          ; address of function.
> ..L2:                         ; next instruction - return point from func.
>
> Which to me doesn't look much like the way the current system expects to
> prime the registers in order.
>
> ARM SPECIFIC BIT:
>
> I'm taking advantage of the way that a branch to subroutine instruction
> (bl) stores the return address in r14 (lr, the Link register), and loads
> pc (r15) with the subroutine address.
> The above (untested) code takes advantage of r14 (and r15) being regular
> registers, by replacing all the load registers with function parameters,
> call function into 1 (yay!) instruction which (effectively) treats r14 as a
> stack pointer.
>
> 1 instruction to prime r14 with the address of label .L1 (1 clock cycle)
> 1 instruction to:
>
>  loads all the registers with parameters
>  load the program counter with the subroutine address (so branch into it)
>  write back new pointer value to r14 (which will be pointing at .L2)
>    which has effectively set the return address for the function.
>
> admittedly that load takes >1 clock cycle. But it just seems a cool way to do
> it.
>
>
> LESS ARM SPECIFIC BIT:
>
> However, building the ldmia instruction means setting the bitmap of registers
> to load based on which are values in the hitlist between .L1 and .L2. And if
> some are already in CPU registers, or are actually to be loaded from Parrot
> registers, then they don't need to be pulled from the hitlist, because they
> are being evaluated some other way.
> (eg I think it is a good idea to keep the current interpreter pointer in a
> CPU register (eg r7), hence if that is needed as a function parameter it's
> a mov, rather than a memory load)
>
> And if arguments need deferencing first, then I need to load the pointer, then
> dereference, and hence they don't want to be in the hitlist.
>
> CONCLUSION:
>
> So I start needing to build simple, parameterisable code, but more complex
> than the current system allows.

The current system handles object code at build_asm time, if understand
what you mean, you need it to handle assembler?


>
> > 4: I think in a RISC way, so expect the offset to be of the start of the
> >    instruction that needs butchering, not the byte within it. (How the sparc
> >    position was expressed confused me for a while).
>
> To be very undiplomatic:
>
> 5: The current way the jit is done turns into madness on ARM.
>
> To be specific
>
> The current system seems to be well suited to how x86 wants to work.
> It's really cool to have x86 going much faster.
>
> On ARM I think the best way to get the constants of parrot registers into/
> out of CPU registers is to put the address of I1 into an ARM register, and
> load parrot registers into/out of CPU registers with memory load/store offset,
> which seems to be radically different from how x86 is working. This appears
> to be how Sparc is working.
>
> I also guess I need to have a global register for integer constants, as they
> can't go inline. Actually, a global register for a merged constant pool is
> a better idea. This appears to differ from Sparc.
>
> So for set_i_i what I actually want the jit to translate that to is
>
>         ldr     ip, [r4, #8]
>         str     ip, [r4, #4]
>
> if I have the address of I1 in r4, and I'm doing set_i I2, I3
>
>
> As far as I can tell, currently I have to write this in core.ops:
>
>
> Parrot_set_i_i {
>     ldr ip, &INT_REG[2]
>     str ip, &INT_REG[1]
> }
>
> The current jit2h and module code
>
> 1: reads that
> 2: mangles &INT_REG[1] to something that is syntactically legal
> 3: calls out to as
> 4: calls objdump to disassemble it
> 5: looks for a pattern to spot where the special bit is.
>    AARGH. The "special bit" is the last 12 bits of the instruction.
>    If I have to convert the disassembled instruction back to binary, and
>    then match /^....010...(.)....................$/ to find out if it's
>    LDR or STR (with $1 determining which) I feel I might as well write my
>    own ARM assembler in perl
No, you can leave everything as it is in this stage and then over write
the necesary bits at build_asm time.

Anyway steps 3 to 4/5 may disappear as we should write our own assembler
because of the objdump problems.

>
> and then
>
> 6: I need to write more specialised C code in jit.c to mangle the LDR or STR
>    instruction at jit building time, and that too needs to be taught the
>    instruction format.
> 7: I need to teach jit.c that on this architecture it is doing INT_REG
>    loads in this way. (There is a forest of #ifdef rapidly growing there, as
>    it seems every architecture is not as "simple" as x86)
There are some ways to get rid of the #ifdef forest, but all will make
build_asm slower, I still need to find the fastest.
And that's because every architecture is different.

>
>
> And HELL. I'm going to need to do the same hoop jumping for NUM_REG, string
> REG, INT_CONST, NUM_CONST, current opcode, aargh.
Did I ever said it was easy?

>
> This is why I feel it's getting futile.
>
> I'd like to be able to write a subroutine that the jit calls. Parameters are
> the parrot opcode and parameters, the address to assemble at (so I know what
> the program counter will be) and probably some other stuff about what the
> CPU registers contain. Output is the section of assembler code.
>
> (Like C signal handlers I may be able to merge several opcodes into 1
> generator function, hence I'd like the opcode number (or name?) as a parameter)
>
> So for set_i_i I'd be passed something like (0xdeadbeef, "set_i_i", 2, 3)
> and I'd return 8 bytes of ARM assembler that do it.
May be you can explain all that with a little bit more details.

>
> And all the knowledge about how to "do" ARM instructions is in exactly one
> place. **Not mixed across core.ops, arm*Generic and jit.c**
>
>
> This would actually let me micro-code the parrot ops. I could implement
> set_i_i as set_cpu_i, set_i_cpu, and in turn call 2 functions to generate
> code to load parrot a reg to a CPU register, and store the CPU register back
> to parrot code.
> Whilst that seems a lot of effort for a 2 instruction job such as set_i_i,
> the load from RAM to CPU is going to be needed at (or near) the front of
> every parrot op, and the store from CPU to RAM at the end, so on a RISC CPU
> being able to subdivide the parrot ops seems to make sense.
>
> It would also mean that something like set I2, 0 doesn't need to use the
> arbitrary 32 bit constant pool, as I could make my generator encode that as
>
>     mov ip, #0
>     str ip, [r4, #4]
>
> Hmm. Going to need to pass state about constant pools in and out of generators.
> Also, knowing which CPU registers contain fixed values such as 0 could be
> useful. Maybe that's getting to optimiser stage.
>
> Nicholas Clark
> --
> EMCFT http://www.ccl4.org/~nick/CV.html
>

Re: ARM JIT (just about)

Reply via email to