[PROPOSAL] for a new calling scheme

Leopold Toetsch Tue, 16 Nov 2004 02:44:44 -0800

Below inline/attached is a proposal for new calling conventions - for the archive as Dan doesn't like changes now, but I haven't to backup it, when its out ;)

A proposal for new calling conventions - for the archive as Dan
doesn't like changes now, but I haven't to backup it, when its out ;)


Central keywords:

* ~64 KB sized register chunks
* variable sized frames carved out of chunks
* sliding register window
* distinct non-volatile and volatile register usage


Rational

The current calling scheme has to major drawbacks: too few allocatable
registers and suboptimal performance.

Register allocation issues

The current scheme has no notion of non-volatile and volatile
registers. Declaring just the upper half (or any other part) of
registers as non-volatile doesn't take any actual register usage of
functions into account. It's too unflexible to accomodate RL programs.

This caues either register spilling or wrong code, as the registers
R5-R15 may be used as return results. This can happen w/o the
knowledge of the caller. It makes a 16 (x4)-register machine out of
Parrot, which is inadequate for complex code.

Actually the current situation is worse. As the caller has to preserve
P0, P1, and P2 in the upper half of registers, there are in fact only
13 allocatable P registers around a method call.


Performance issues

I've already shown that Parrot's cache usage is subobtimal. The
fibonacchi benchmark (or other call-bound code) spends more then half
of equivalent CPU cycles in level 2 cache misses. This makes Parrot
incompetitively slow and thus unattractive for potential users.

Additionally the fib benchmark shows that almost 20% of cycles is
burned due to function argument and return value copying.

Having fixed sized register frames has additionally two major
drawbacks during DOD/GC. First: all P and S registers have to be
cleared prior to usage, so that no garbage can be marked live. Second
all 32 registers per kind have to be scanned during the mark phase of
DOD for all existing register frames for the whole call chain. This is
a big degradation during DOD and of course causing more cache
pollution.


The indirect register addressing has prooven to be a big improvement
with overall benchmark performance but it's not utilizing the
potential of that scheme because of the fixed sized register frame,
which causes unneeded cache pollution for small function and method
calls like attribute setter/getters or for the fib benchmark.

This proposed scheme addresses two major impacts on speed of the
existing scheme (cache pollution and argument copying) and the problem
of return values clobbering caller's register, if a function is called
in void context.

So here we go.


Prelims:

* the register allocator patch is done and it allocates registers from
  zero up
* the register allocator has counts of needed non- and volatiles per
  sub call and function
* non-volatile registers are assigned to register numbers 0..n-1.
  For optimal register utilization, the register allocator should
  renumber registers and we might insert register moves around
  function calls, so that non-volatiles are arranged first.
* the register structure is like in interpreter.h with the define
  of SLIDING_BP, that is an array of structs instead of a struct of
  arrays

Changes to calling conventions:

* only I0,I1 are used for denoting used registers. This reduces opcount
  count a bit and it's better for cache usage.
  (we are not suffering from too much cycles the code executes, the
   cache usage is the problem - these few additional shifts for
   extracting register usage aren't going to kill performance)
* outgoing registers are placed in the incoming area of the called
  sub, which overlaps the volatile register area of the caller
* return values are used/fetched from that area


Changes to opcodes:

* opcodes do reflect the actual register consumption
  (all the implicit register usage needs special treatment inside
  imcc, which complicates the code in an unneeded way and slows down
  compilation)

* call opcodes get an extra argument how far to bump the register
  frame pointer

  invokecc PSub, Ibp_incr
  callmethcc PObj, Smeth, Ibp_incr

* the return opcode is distinct now. This makes the code more readable
  and clearly denotes the difference to a call:

  returncc

* all old register stack opcode except saveall/restoreall are unneeded
  and tossed. (And even these 2 opcodes are just needed for stack
  calling conventions, which will be slower then this scheme. I don't
  see much reasons to support it.)


Register usage of current sub

(Rx is one of the four register kinds)

  n   ... non-volatile register count (max of any kind)
  I0  ... argcP
  I1  ... argcIS_N := argcI + argcS << 8 + argcN << 24
  R2-R10 . incoming args / return values
  P0  ... optional overflow array
  P1  ... optional keyword/named args hash?
  P2  ... object / 1st arg

(or maybe alternatively P2 object, R3..R10 args, I1 argcI, I2 argc_SN)

The current sub, method, continuation, and object are available through
these new opcodes:

  get_sub, get_method, get_cc, get_self

The current object is a bit special. Perl5 and Python have it as the
first argument of a function call. Perl6 has a dedicated syntax to
pass the invocant(s) of methods. But as there are possibly more then
one, the current scheme with P2 seems inadequate too.

Register numbers for function arguments/returns:

  I0 + n   outgoing argcP (and returns)
  I1 + n   outgoing argcIS_N (and returns)
  R2+n - R10+n outgoing values (and returns)
  P0+n, P1+n like above


Register frame layout

Register frames are variable sized, they use just the needed amount of
registers per call:

  +---------+--------------+----+------
  | sub1    |  sub2        | s3 |
  +---------+--------------+----+------
                           ^    ^ - threshold/next_free
                           |
                           + - framepointer

For plain function calls the framepointer is just moved by the
required frame size.

After a continuation captures some frame, new frames are only
allocated beyond that threshold (or watermark)

  +---------+--------------+---------------+
  | free    |  sub2/cc     |               |
  +---------+--------------+---------------+
  ^                        ^
  |                        |
  next_free                + - threshold


Register frame for a function call

  +-----------~~~~~              ~~~+
  | R0..in-1   |                    .         (1)
  +------------+--+-----------------+
  | R0       Rn-1 | Rvol            |         (2)
  +---------------+-----------+-----+
  | R0            | Rout      |     .         (3)
  +---------------+-----------+---+-------+
                  | R0      Rn'-1 | Rvol' |   (4)
                  +-------+-------+-------+
                  | Rret' |                   (5)
  +---------------+-------+----~~~~~+
  | R0            | Rret' |         .         (6)
  +------+--------+-------+----~~~~~+
  | Rret |                          .         (7)
  +------+~~~~                   ~~~+

  R0..in-1   ... incoming args
  R0..Rn-1   ... non-volatile registers / preserved around call
  Rvol       ... volatile registers
  Rout       ... outgoing args for a call
  Rret       ... return values from call

  (1) directly after call of this function
  (2) function uses non- and volatiles
  (3) befor calling another sub
  (4) during the call the frame pointer is increased by n
  (5) function places return values
  (6) caller extracts return values
  (7) this function returns and places return values

  Framesize = min(32, n + max(Rvol, Rout, Rret))

A register frame is basically divided into two regions:
* non-volatile registers R0.. Rn-1
* volatile registers and call/return values Rn...

The called sub/method sees the latter in R0... because the frame
pointer is advanced by n. If we cross chunk boundaries, the current
scheme of copying arguments preserves that view of registers.

Copying is done if:

   continuation threshold > next_free
   free_space < n + 32

The count of non-volatiles C<n> is a constant, calculated at compile
time. So argument addressing is always a plain register access as now.
C<n> should be eventually adjusted for every call depending on the
register usage at that instruction.


A simple expansion of this scheme could allow register frames with
more then 32 registers per kind, so that register spilling isn't
needed at all. But I doubt that it's needed if we have a clear notion
of preserved (non-volatile) registers per call, which aren't clobbered
by possible return values like now.

May the dicussion begin,

leo

[PROPOSAL] for a new calling scheme

Reply via email to