I'm aware that something similar can be achieved by different approaches.

But, note how these IMs are a nice compromise between two extremes: 1.
doing no inlining at all, leaving all function calls intact or 2. inlining
absolutely everything.

Of course such an IM based scheme could be supplemented by run-time
invocation statistics to make better choices of what to instantiate and
compile---as is being done in several interpreters/compilers.

Den 9 jan 2018 15:30 skrev "Mikael Djurfeldt" <mik...@djurfeldt.com>:

> Hi,
>
> Hi think this is a marvelous development and, for what it's worth, in the
> right direction. Many, many thanks!
>
> Maybe this is all completely obvious to you, but I want to remind, again,
> about the plans and ideas I had for GOOPS before I had to leave it at its
> rather prototypical and unfinished state:
>
> As you recall, generic functions (GFs) then carried a cache (ideas taken
> from PCL) with "instantiated methods" (IM; there is probably a better term
> and I might have called them "compiled methods" or "cmethods"
> before)---method code where the types of the arguments are known since each
> instance come from an actual invocation of the GF. Given a new invocation,
> the dispatch mechanism would just use argument types to look up the correct
> IM.
>
> I then had the idea that since we know the types of the arguments of the
> IM, a lot of the type dispatch could be eliminated within the IM based on
> flow information---very much in line with what you are doing here. If we
> now add one more piece, things get really exciting: Wherever there is a
> call to other GFs within one IM and the types of the arguments can be
> deduced at the point of the call, then the polymorphic type dispatch of the
> GF can be eliminated and we can replace the GF call with a direct call to
> the corresponding IM.
>
> Given now that most of the standard scheme functions can be regarded as
> polymorphic GFs, I then imagined that most of the type dispatch in a
> program could be eliminated. Actual execution would mostly be direct calls
> of IMs to IMs, something which the optimizer could continue to work on,
> especially if it all was represented as CPS.
>
> Given your work here, I think that something like this could now rather
> easily be implemented. That is, we re-introduce IMs, make them directly
> callable, and substitute IM calls for GF calls when compiling them. I
> gather that the code of IMs do not necessarily have to be hung onto GFs but
> could be handled by some separate manager/data structures.
>
> Happy new year!
> Mikael
>
>
> On Mon, Jan 8, 2018 at 4:01 PM, Andy Wingo <wi...@pobox.com> wrote:
>
>> Hey all!
>>
>> This is an update along the road to Guile 3.  Check
>> https://lists.gnu.org/archive/html/guile-devel/2017-11/msg00016.html for
>> the previous entry.
>>
>> Since 25 November there have been around 100 commits or so.  Firstly I
>> merged in patches from stable-2.0, including patches corresponding to
>> the improvements in the Guile 2.2.3 stable series release.
>>
>> Then, I started to look at "instruction explosion" for vector-ref et
>> al.  Basically the idea is to transform the various subcomponents of
>> e.g. vector-ref into their constituent parts.  In the concrete case of
>> vector-ref, we have to check that the vector is a heap object, that its
>> heap object tag is "vector", we have to extract the length from the heap
>> object, then we have to check that the index is an integer between 0 and
>> length-1, and finally we dereference the field in the vector.
>> Instruction explosion turns all of these into different primcalls and
>> branches.
>>
>> One thing that became apparent was that with instruction explosion, we'd
>> have a lot more control flow.  Information that the optimizer would
>> learn in a specific way (e.g. via specialzied type inference / effects
>> analysis handlers for vector-ref) would instead be learned by generic
>> control flow.
>>
>> Concretely --
>>
>>     scheme@(guile-user)> ,x (lambda (v i) (vector-ref v i))
>>     Disassembly of #<procedure 29f5e50 at <unknown port>:1:3 (v i)> at
>> #x29f5b4c:
>>
>>        0    (assert-nargs-ee/locals 3 1)    ;; 4 slots (2 args)   at
>> (unknown file):1:3
>>        1    (immediate-tag=? 2 7 0)         ;; heap-object?       at
>> (unknown file):1:17
>>        3    (jne 23)                        ;; -> L3
>>        4    (heap-tag=? 2 127 13)           ;; vector?
>>        6    (jne 20)                        ;; -> L3
>>        7    (word-ref/immediate 3 2 0)
>>        8    (ursh/immediate 3 3 8)
>>        9    (immediate-tag=? 1 3 2)         ;; fixnum?
>>       11    (jne 13)                        ;; -> L2
>>       12    (untag-fixnum 0 1)
>>       13    (s64-imm<? 0 0)
>>       14    (jl 8)                          ;; -> L1
>>       15    (s64<? 0 3)
>>       16    (jnl 6)                         ;; -> L1
>>       17    (mov 3 0)
>>       18    (uadd/immediate 3 3 1)
>>       19    (scm-ref 2 2 3)
>>       20    (handle-interrupts)
>>       21    (return-values 2)               ;; 1 value
>>     L1:
>>       22    (throw/value+data 1 201)        ;; #(out-of-range
>> "vector-ref" "Argument 2 out of range: ~S")
>>     L2:
>>       24    (throw/value+data 1 225)        ;; #(wrong-type-arg
>> "vector-ref" "Wrong type argument in position 2 (expecting small integer):
>> ~S")
>>     L3:
>>       26    (throw/value+data 2 239)        ;; #(wrong-type-arg
>> "vector-ref" "Wrong type argument in position 1 (expecting vector): ~S")
>>
>> So this is a bit horrible and I need to make the disassembler do a
>> better job, but anyway.  Instructions 1 through 6 check that V is a
>> vector; instructions 7 and 8 extract the length; 9 and 11 check that the
>> index is a fixnum, 12 extracts the fixnum value as an untagged 64-bit
>> integer, and 13 through 16 check that the index is in range.
>>
>> L1, L2, and L3 are bailouts.  The idea here is that if this vector-ref
>> is followed by some other operation on the vector, we'll at least get to
>> elide the vector? checks, and maybe we can reuse the length extraction
>> too.  Et cetera.
>>
>> The optimizer makes decisions like when to elide redundant checks based
>> on flow information.  However for historical reasons unfortunately the
>> "throw" terms actually did "continue" from the optimizer's perspective;
>> whereas the information flowing to e.g. L3 shouldn't flow at all to
>> instruction 7, the IR didn't have a way to denote terms that didn't
>> continue at all.
>>
>> To fix this I had to make some changes to the IR.  On the plus side,
>> $throw is its own term kind now that doesn't have a continuation.  Also,
>> $branch is a term instead of an expression shoehorned into $continue;
>> $prompt is a term too.  Maybe one day we'll get a $select term for
>> proper case compilation.
>>
>> Finally, the instruction explosion.  I refactored the Tree-IL->CPS
>> compiler to allow individual primcalls to have custom lowering routines,
>> and tightened up $primcall so that it now always continues to $kargs.
>> Then I added custom lowering routines for vector-ref et al, and cons and
>> other things.  The CPS IR refactors allowed me to remove some useless
>> passes (elide-values, prune-bailouts).  There were some optimizer bugs
>> but generally things were already in a good state; e.g. here's a vector
>> sum routine:
>>
>>   scheme@(guile-user)> (define (v-sum v)
>>                          (let lp ((n 0) (sum 0))
>>                            (if (< n (vector-length v))
>>                                (lp (+ n 1) (+ sum (vector-ref v n)))
>>                                sum)))
>>   scheme@(guile-user)> ,x v-sum
>>   Disassembly of #<procedure v-sum (v)> at #x1fa2750:
>>
>>      0    (assert-nargs-ee/locals 2 3)    ;; 5 slots (1 arg)    at
>> (unknown file):1:0
>>      1    (make-short-immediate 4 2)      ;; 0
>>      2    (immediate-tag=? 3 7 0)         ;; heap-object?       at
>> (unknown file):1:51
>>      4    (jne 39)                        ;; -> L5
>>      5    (heap-tag=? 3 127 13)           ;; vector?
>>      7    (jne 36)                        ;; -> L5
>>      8    (word-ref/immediate 2 3 0)
>>      9    (ursh/immediate 2 2 8)
>>     10    (imm-s64<? 2 0)                                       at
>> (unknown file):1:46
>>     11    (jnl 29)                        ;; -> L4
>>     12    (load-s64 1 0 0)                                      at
>> (unknown file):1:89
>>     15    (s64<? 1 2)
>>     16    (jnl 22)                        ;; -> L3
>>     17    (mov 4 1)
>>     18    (uadd/immediate 4 4 1)
>>     19    (scm-ref 4 3 4)
>>     20    (add/immediate 4 4 0)                                 at
>> (unknown file):1:82
>>     21    (load-s64 1 0 1)
>>     24    (s64<? 1 2)                                           at
>> (unknown file):1:46
>>     25    (jnl 11)                        ;; -> L2
>>   L1:
>>     26    (handle-interrupts)
>>     27    (mov 0 1)                                             at
>> (unknown file):1:74
>>     28    (uadd/immediate 0 0 1)
>>     29    (uadd/immediate 1 1 1)                                at
>> (unknown file):1:89
>>     30    (scm-ref 1 3 1)
>>     31    (add 4 4 1)                                           at
>> (unknown file):1:82
>>     32    (s64<? 0 2)                                           at
>> (unknown file):1:46
>>     33    (jnl 7)                         ;; -> L4
>>     34    (mov 1 0)                                             at
>> (unknown file):1:70
>>     35    (j -9)                          ;; -> L1
>>   L2:
>>     36    (mov 0 1)                                             at
>> (unknown file):1:46
>>     37    (j 3)                           ;; -> L4
>>   L3:
>>     38    (throw/value+data 4 204)        ;; #(out-of-range "vector-ref"
>> "Argument 2 out of range: ~S") at (unknown file):1:89
>>   L4:
>>     40    (handle-interrupts)
>>     41    (mov 3 4)
>>     42    (return-values 2)               ;; 1 value
>>   L5:
>>     43    (throw/value+data 3 233)        ;; #(wrong-type-arg
>> "vector-length" "Wrong type argument in position 1 (expect…") at (unknown
>> file):1:51
>>
>> The bit between L1 and L2 is the loop.  Note that the loop does no
>> dynamic checks besides checking that the unboxed index is within bounds
>> -- notably the vector? check, the length computation, and the index
>> untagging were hoisted out.  The "scm-ref" instruction is the raw
>> unchecked memory accessor instruction that will correspond to a load
>> when compiled to machine code.
>>
>> This code runs a little slower than it did a month ago because
>> instruction explosion is generally more instructions than what we had
>> before.  But this is expected.  I expect the corresponding amount of
>> machine code to be lower, when we emit machine code, sometimes
>> significantly so.  Doing things this way takes away the temptation for
>> an eventual JIT to do optimization-like tasks -- we keep it all in
>> generic CPS, and the JIT will be easy to write and generate good code.
>> Until then, hacking on Guile is a bit slow though, in terms of compile
>> time.
>>
>> Next up is instruction explosion for other data structures (boxes,
>> structs, strings, etc); then I have to figure out what to do for
>> arithmetic that I can't specialize (like that "add" at IP 31).  I guess
>> polymorphic inline caches are the real solution there; we'll see.
>>
>> OK that's the update.  Happy new year and happy hacking with Guile :)
>>
>> Cheers,
>>
>> Andy
>>
>>
>

Reply via email to