> On Feb 4, 2017, at 5:35 AM, Andrew Trick <atr...@apple.com> wrote:
> 
> 
>> On Feb 3, 2017, at 9:37 PM, John McCall <rjmcc...@apple.com 
>> <mailto:rjmcc...@apple.com>> wrote:
>> 
>>>> IV. The function that performs the lookup:
>>>>  IV1) is parameterized by an isa
>>>>  IV2) is not parameterized by an isa
>>>> IV1 allows the same function to be used for super-dispatch but requires 
>>>> extra work to be inlined at the call site (possibly requiring a chain of 
>>>> resolution function calls).
>>> 
>>> In my first message I was trying to accomplish IV1. But IV2 is simpler
>>> and I can't see a fundamental advantage to IV1.
>> 
>> Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we 
>> add it)
>> by passing in the isa of either the superclass or the current class.  IV2 
>> means
>> that the dispatch function is always based on the isa from the object, so 
>> those
>> dispatch schemes need something else to implement them.
>> 
>>> Why would it need a lookup chain?
>> 
>> Code size, because you might not want to inline the isa load at every call 
>> site.
>> So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
>> that just loads the isa and calls the IV1 function (defined by the class).
> 
> Right. Looks like I wrote the opposite of what I meant. The important thing 
> to me is that the vtable offset load + check is issued in parallel with the 
> isa load. I was originally pushing IV2 for this reason, but now think that 
> optimization could be entirely lazy via a client-side cache.
> 
>>>> V. For any particular function or piece of information, it can be accessed:
>>>>  V1) directly through a symbol
>>>>  V2) through a class-specific table
>>>>  V3) through a hierarchy-specific table (e.g. the class object)
>>>> V1 requires more global symbols, especially if the symbol is per-method, 
>>>> but doesn't have any index-computation problems, and it's generally a bit 
>>>> more efficient.
>>>> V2 allows stable assignment of fixed indexes to entries because of 
>>>> availability-sorting.
>>>> V3 does not; it requires some ability to (at least) slide indexes of 
>>>> entries because of changes elsewhere in the hierarchy.
>>>> If there are multiple instantiations of a table (perhaps with different 
>>>> information, like a v-table), V2 and V3 can be subject to table bloat.
>>> 
>>> I had proposed V2 as an option, but am strongly leaning toward V1 for
>>> ABI simplicity and lower static costs (why generate vtables and offset
>>> tables?)
>> 
>> V1 doesn't remove the need for tables, it just hides them from the ABI.
> 
> I like that it makes the offset tables lazy and optional. They don’t even 
> need to be complete.
> 
>>>> So I think your alternatives were:
>>>> 1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a 
>>>> per-method global function that performs the dispatch.  We could apply V2 
>>>> to this to decrease the number of global symbols required, but at the cost 
>>>> of inflating the call site and requiring a global variable whose address 
>>>> would have to be resolved at load time.  Has an open question about super 
>>>> dispatch.
>>>> 2. I1, V3 (for the v-table), V1 (for the global offset): a load of a 
>>>> per-method global variable giving an offset into the v-table.  Joe's 
>>>> suggestion adds a helper function as a code-size optimization that follows 
>>>> I2, II1, III1, IV2.  Again, we could also use V2 for the global offset to 
>>>> reduce the symbol-table costs.
>>>> 3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism 
>>>> table).  At least I think this is right?  The difference between 3a and 3b 
>>>> seems to be about initialization, but maybe also shifts a lot of 
>>>> code-generation to the call site?
>>> 
>>> I'll pick the following option as a starting point because it constrains 
>>> the ABI the least in
>>> terms of static costs and potential directions for optimization:
>>> 
>>> "I2; (II1+II2); III2; IV1; V1"
>>> 
>>> method_entry = resolveMethodAddress_ForAClass(isa, method_index, 
>>> &vtable_offset)
>>> 
>>> (where both modules would need to opt into the vtable_offset.)
>> 
>> Wait, remind me what this &vtable_offset is for at this point?  Is it 
>> basically just a client-side cache?  I can't figure out what it's doing for 
>> us.
> 
> It’s a client side cache that can be checked in parallel with the `isa` load. 
> The resolver is not required to provide an offset, and the client does not 
> need cache all the method offsets. It does burn an extra register, but gains 
> the ability to implement vtable dispatch entirely on the client side.

Okay.  I'm willing to consider that it might be an optimization. :)

> You might be thinking of caching the method entry itself and checking `isa` 
> within `resolveMethod`. I didn’t mention that possibility because the cost of 
> calling the non-local `resolveMethod` function followed by an indirect call 
> largely defeats the purpose of something like an inline-cache.

Yes, I think doing a full-entry dynamic cache would not be a good optimization. 
 (Among other things, we'd have to access it with a wide atomic.)

John.

>>> I think any alternative would need to be demonstrably better in terms of 
>>> code size or dynamic dispatch cost.
>> 
>> That's a lot of stuff to materialize at every call site.  It makes calls
>> into something like a 10 instruction sequence on ARM64, ignoring
>> the actual formal arguments:
>> 
>>   %raw_isa = load %object                        // 1 instruction
>>   %isa_mask = load @swift_isaMask                // 3: 2 to materialize 
>> address from GOT (not necessarily with ±1MB), 1 to load from it
>>   %isa = and %raw_isa, %isa_mask                 // 1
>>   %method_index = 13                             // 1
>>   %cache = @local.A.foo.cache                    // 2: not necessarily 
>> within ±1MB
>>   %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
>>   call %method(...)                              // 1
>> 
>> On x86-64, it'd just be 8 instructions because the immediate range for 
>> leaq/movq
>> is ±2GB, which is Good Enough for the standard code model, but of course it 
>> still
>> expands to roughly the same amount of code.
>> 
>> Even without vtable_offset, it's a lot of code to inline.
>> 
>> So we'd almost certainly want a client-side resolver function that handled
>> the normal case.  Is that what you mean when you say II1+II2?  So the local
>> resolver would be I2; II1; III2; IV2; V1, which leaves us with a 
>> three-instruction
>> call sequence, which I think is equivalent to Objective-C, and that function
>> would do this sequence:
>> 
>> define @local_resolveMethodAddress(%object, %method_index)
>>   %raw_isa = load %object                        // 1 instruction
>>   %isa_mask = load @swift_isaMask                // 3: 2 to materialize 
>> address from GOT (not necessarily with ±1MB), 1 to load from it
>>   %isa = and %raw_isa, %isa_mask                 // 1
>>   %cache_table = @local.A.cache_table            // 2: not necessarily 
>> within ±1MB
>>   %cache = add %cache_table, %method_index * 8   // 1
>>   tailcall @A.resolveMethod(%isa, %method_index, %cache)  // 1
>> 
>> John.
> 
> Yes, exactly, except we haven’t even done any client-side vtable optimization 
> yet.
> 
> To me the point of the local cache is to avoid calling @A.resolveMethod in 
> the common case. So we need another load-compare-and-branch, which makes the 
> local helper 12-13 instructions. Then you have the vtable load itself, so 
> that’s 13-14 instructions. You would be saving on dynamic instructions but 
> paying with 4 extra static instructions per class.
> 
> It would be lame if we can't force @local.A.cache_table to be ±1MB relative 
> to the helper.
> 
> Inlining the cache table address might be worthwhile because %method_index 
> would then be an immediate and hoisted to the top of the function.
> 
> -Andy
> 
> 

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to