> On Feb 4, 2017, at 5:35 AM, Andrew Trick <atr...@apple.com> wrote: > > >> On Feb 3, 2017, at 9:37 PM, John McCall <rjmcc...@apple.com >> <mailto:rjmcc...@apple.com>> wrote: >> >>>> IV. The function that performs the lookup: >>>> IV1) is parameterized by an isa >>>> IV2) is not parameterized by an isa >>>> IV1 allows the same function to be used for super-dispatch but requires >>>> extra work to be inlined at the call site (possibly requiring a chain of >>>> resolution function calls). >>> >>> In my first message I was trying to accomplish IV1. But IV2 is simpler >>> and I can't see a fundamental advantage to IV1. >> >> Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we >> add it) >> by passing in the isa of either the superclass or the current class. IV2 >> means >> that the dispatch function is always based on the isa from the object, so >> those >> dispatch schemes need something else to implement them. >> >>> Why would it need a lookup chain? >> >> Code size, because you might not want to inline the isa load at every call >> site. >> So, for a normal dispatch, you'd have an IV2 function (defined client-side?) >> that just loads the isa and calls the IV1 function (defined by the class). > > Right. Looks like I wrote the opposite of what I meant. The important thing > to me is that the vtable offset load + check is issued in parallel with the > isa load. I was originally pushing IV2 for this reason, but now think that > optimization could be entirely lazy via a client-side cache. > >>>> V. For any particular function or piece of information, it can be accessed: >>>> V1) directly through a symbol >>>> V2) through a class-specific table >>>> V3) through a hierarchy-specific table (e.g. the class object) >>>> V1 requires more global symbols, especially if the symbol is per-method, >>>> but doesn't have any index-computation problems, and it's generally a bit >>>> more efficient. >>>> V2 allows stable assignment of fixed indexes to entries because of >>>> availability-sorting. >>>> V3 does not; it requires some ability to (at least) slide indexes of >>>> entries because of changes elsewhere in the hierarchy. >>>> If there are multiple instantiations of a table (perhaps with different >>>> information, like a v-table), V2 and V3 can be subject to table bloat. >>> >>> I had proposed V2 as an option, but am strongly leaning toward V1 for >>> ABI simplicity and lower static costs (why generate vtables and offset >>> tables?) >> >> V1 doesn't remove the need for tables, it just hides them from the ABI. > > I like that it makes the offset tables lazy and optional. They don’t even > need to be complete. > >>>> So I think your alternatives were: >>>> 1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a >>>> per-method global function that performs the dispatch. We could apply V2 >>>> to this to decrease the number of global symbols required, but at the cost >>>> of inflating the call site and requiring a global variable whose address >>>> would have to be resolved at load time. Has an open question about super >>>> dispatch. >>>> 2. I1, V3 (for the v-table), V1 (for the global offset): a load of a >>>> per-method global variable giving an offset into the v-table. Joe's >>>> suggestion adds a helper function as a code-size optimization that follows >>>> I2, II1, III1, IV2. Again, we could also use V2 for the global offset to >>>> reduce the symbol-table costs. >>>> 3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism >>>> table). At least I think this is right? The difference between 3a and 3b >>>> seems to be about initialization, but maybe also shifts a lot of >>>> code-generation to the call site? >>> >>> I'll pick the following option as a starting point because it constrains >>> the ABI the least in >>> terms of static costs and potential directions for optimization: >>> >>> "I2; (II1+II2); III2; IV1; V1" >>> >>> method_entry = resolveMethodAddress_ForAClass(isa, method_index, >>> &vtable_offset) >>> >>> (where both modules would need to opt into the vtable_offset.) >> >> Wait, remind me what this &vtable_offset is for at this point? Is it >> basically just a client-side cache? I can't figure out what it's doing for >> us. > > It’s a client side cache that can be checked in parallel with the `isa` load. > The resolver is not required to provide an offset, and the client does not > need cache all the method offsets. It does burn an extra register, but gains > the ability to implement vtable dispatch entirely on the client side.
Okay. I'm willing to consider that it might be an optimization. :) > You might be thinking of caching the method entry itself and checking `isa` > within `resolveMethod`. I didn’t mention that possibility because the cost of > calling the non-local `resolveMethod` function followed by an indirect call > largely defeats the purpose of something like an inline-cache. Yes, I think doing a full-entry dynamic cache would not be a good optimization. (Among other things, we'd have to access it with a wide atomic.) John. >>> I think any alternative would need to be demonstrably better in terms of >>> code size or dynamic dispatch cost. >> >> That's a lot of stuff to materialize at every call site. It makes calls >> into something like a 10 instruction sequence on ARM64, ignoring >> the actual formal arguments: >> >> %raw_isa = load %object // 1 instruction >> %isa_mask = load @swift_isaMask // 3: 2 to materialize >> address from GOT (not necessarily with ±1MB), 1 to load from it >> %isa = and %raw_isa, %isa_mask // 1 >> %method_index = 13 // 1 >> %cache = @local.A.foo.cache // 2: not necessarily >> within ±1MB >> %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1 >> call %method(...) // 1 >> >> On x86-64, it'd just be 8 instructions because the immediate range for >> leaq/movq >> is ±2GB, which is Good Enough for the standard code model, but of course it >> still >> expands to roughly the same amount of code. >> >> Even without vtable_offset, it's a lot of code to inline. >> >> So we'd almost certainly want a client-side resolver function that handled >> the normal case. Is that what you mean when you say II1+II2? So the local >> resolver would be I2; II1; III2; IV2; V1, which leaves us with a >> three-instruction >> call sequence, which I think is equivalent to Objective-C, and that function >> would do this sequence: >> >> define @local_resolveMethodAddress(%object, %method_index) >> %raw_isa = load %object // 1 instruction >> %isa_mask = load @swift_isaMask // 3: 2 to materialize >> address from GOT (not necessarily with ±1MB), 1 to load from it >> %isa = and %raw_isa, %isa_mask // 1 >> %cache_table = @local.A.cache_table // 2: not necessarily >> within ±1MB >> %cache = add %cache_table, %method_index * 8 // 1 >> tailcall @A.resolveMethod(%isa, %method_index, %cache) // 1 >> >> John. > > Yes, exactly, except we haven’t even done any client-side vtable optimization > yet. > > To me the point of the local cache is to avoid calling @A.resolveMethod in > the common case. So we need another load-compare-and-branch, which makes the > local helper 12-13 instructions. Then you have the vtable load itself, so > that’s 13-14 instructions. You would be saving on dynamic instructions but > paying with 4 extra static instructions per class. > > It would be lame if we can't force @local.A.cache_table to be ±1MB relative > to the helper. > > Inlining the cache table address might be worthwhile because %method_index > would then be an immediate and hoisted to the top of the function. > > -Andy > >
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev