On Tue, May 30, 2017 at 01:55:20PM +0100, Chris Wilson wrote:
> When looking at simple ioctls coupled with conveniently small user
> parameters, the overhead of the syscall and drm_ioctl() present large
> low hanging fruit. Profiling trivial microbenchmarks around
> i915_gem_busy_ioctl, the low hanging fruit comprises of the call to
> copy_user(). Those calls are only inlined by the macro where the
> constant is known at compile-time, but the ioctl argument size depends
> on the ioctl number. To help the compiler, explicitly add switches for
> the small sizes that expand to simple moves to/from user. Doing the
> multiple inlines does add significant code bloat, so it is very
> debatable as to its value. Back to the trivial, but frequently used,
> example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives
> us a 15-25% improvement:
> 
>                        before           after
>       single          100.173ns        84.496ns
>       parallel (x4)   204.275ns       152.957ns
> 
> On a baby Broxton nuc:
> 
>                       before           after
>       single          245.355ns       199.477ns
>       parallel (x2)   280.892ns       232.726ns
> 
> Looking at the cost distribution by moving an equivalent switch into
> arch/x86/lib/usercopy, the overhead to the copy user is split almost
> equally between the function call and the actual copy itself. It seems
> copy_user_enhanced_fast_string simply is not that good at small (single
> register) copies. Something as simple as
> 
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>         unsigned ret;
> 
> +       if (len <= 16)
> +               return copy_user_generic_unrolled(to, from, len);
> 
> is enough to speed up i915_gem_busy_ioctl() by 10% :|
> 
> Note that this overhead may entirely be x86 specific.
> 
> Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahti...@linux.intel.com>
> ---
> +     if (in_size) {
> +             if (unlikely(!access_ok(VERIFY_READ, arg, in_size)))
> +                     goto err_invalid_user;
> +
> +             switch (in_size) {
> +             case 4:
> +                     if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +                                                   4)))
> +                             goto err_invalid_user;
> +                     break;
> +             case 8:
> +                     if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +                                                   8)))
> +                             goto err_invalid_user;
> +                     break;
> +             case 16:
> +                     if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +                                                   16)))
> +                             goto err_invalid_user;
> +                     break;

For example, currently x86-32 only converts case 4 above. It could
trivially do case 8 as well, but by case 16 it may as well use the
function call to its loop in assembly. And currently x86-32 has no
optimisations for fixed sized puts.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to