On 28.10.2021 16:32, Juergen Gross wrote:
> On 21.10.21 16:41, Jan Beulich wrote:
>> On 15.10.2021 14:51, Juergen Gross wrote:
>>> Instead of using a function table use the generated switch statement
>>> macros for calling the appropriate hypercall handlers.
>>>
>>> This is beneficial to performance and avoids speculation issues.
>>>
>>> With calling the handlers using the correct number of parameters now
>>> it is possible to do the parameter register clobbering in the NDEBUG
>>> case after returning from the handler. This in turn removes the only
>>> users of hypercall_args_table[] which can be removed now.
>>
>> "removed" reads misleading to me: You really replace it by new tables,
>> using script-generated initializers. Also it looks like you're doubling
>> the data, as the same sets were previously used by pv64/hvm64 and
>> pv32/hvm32 respectively.
> 
> Yes, I'll change that paragraph.
> 
> Regarding having 4 tables on x86 now: merging the pv/hvm tables would be
> possible, but this would add some complexity to the script generating
> the tables (it should test whether the number of parameters of pv and
> hvm match). As the tables are present in debug build only I don't think
> this is a real issue.

Sure, but that imo wants saying in the description.

>> Overall, besides these mainly cosmetic aspects the main thing missing
>> is an approach to prioritize the handful most frequently used functions,
>> for them to be pulled out of the switch() so we don't depend on the
>> compiler's choice for the order of comparisons done.
> 
> I have already prepared that step by generating the complete call
> sequence, so any change for prioritizing some hypercalls can be local to
> the generator script and the used input data.
> 
> The main question is how to do that. I've collected some hypercall
> statistics data for PV and PVH guests running some simple tests (once a
> build of the Xen hypervisor, and once a scp of a large file). The data
> is split between guest and dom0 (PV) counts. There is no clear "winner"
> which hypercall should be fastest, but several hypercalls are clearly
> not important.
> 
> Here is the data:
> 
> PV-hypercall    PV-guest build   PV-guest scp    dom0 build     dom0 scp
> mmu_update           186175729           2865         20936        33725

Builds should be local to the guest and I/O should involve gnttab ops
but no mmu-update. Hence I have a hard time seeing where the huge
difference here would be coming from. Did you have any thoughts here?

> stack_switch           1273311          62381        108589       270764
> multicall              2182803             50           302          524

A fair amount of the mmu-updates is going to be coming through
muticalls, I would guess. Priorities therefore may even differ for
the two separate dispatch points.

> update_va_mapping       571868             10            60           80
> xen_version              73061            850           859         5432
> grant_table_op               0              0         35557       139110
> iret                  75673006         484132        268157       757958

The huge differences for builds is puzzling mere here ...

> vcpu_op                 453037          71199        138224       334988
> set_segment_base       1650249          62387        108645       270823
> mmuext_op             11225681            188          7239         3426

... and here as well. Did Dom0 and DomU use identical numbers of
vCPU-s and identical -j make option values?

> sched_op                280153         134645         70729       137943
> event_channel_op        192327          66204         71409       214191
> physdev_op                   0              0          7721         4315
> (the dom0 values are for the guest running the build or scp test, so
> dom0 acting as backend)
> 
> HVM-hypercall   PVH-guest build    PVH-guest scp
> vcpu_op                  277684             2324
> event_channel_op         350233            57383
> (the related dom0 counter values are in the same range as with the test
> running in the PV guest)
> 
> It should be noted that during boot of the guests the numbers for the PV
> guest are more like the ones for the build test with the exception of
> iret and sched_op being higher, while for PVH sched_op is by far the
> most often used hypercall.
> 
> I'm not sure how to translate those numbers into a good algorithm for
> generating the call sequence.

Well, there's never going to be a clear cut fitting everything, I
suppose.

> I could add priorities to each hypercall in hypercall-defs.c and have a
> cascade of if (likely(foo)) call_foo; else if (likely(bla)) ... else
> switch(rest).

Personally I'd lean to an approach like this one; perhaps there's not
even a need to specify priorities for every hypercall, but just the
ones we deem most frequently used?

Jan

> Or I could have groups of hypercalls with a priority for each group and:
> 
> mask = 1ULL << num;
> if (likely(mask & prio_1_mask)) switch(num) ...
> else if (likely(mask & prio_2_mask)) switch (num) ...
> ...
> else switch (num) ...
> 
> Or I could combine those approaches using the mask variant for cases of
> multiple entries having the same priority and the direct call variant
> for the cases of only a single entry having a specific priority.
> 
> And then there is the problem to set the priorities (fairly simple for
> HVM, PV is more diffcult).
> 
> 
> Juergen
> 


Reply via email to