> From: Jeremie Courreges-Anglas <[email protected]>
> Date: Fri, 23 Jul 2021 11:54:51 +0200
> Content-Type: text/plain
>
>
> I've been using a variation of this diff on my hifive unmatched since
> a few days. The goal is to at least optimize the aligned cases by using
> 8 or 4 bytes loads/stores. On this hifive unmatched, I found that
> unaligned 8 or 4 bytes loads/stores loops are utterly slow, much slower
> than equivalent 1 byte loads/stores (say 40x slower).
>
> This improves eg i/o throughput and shaves off between 10 and 15s out of
> a total 11m30s in ''make clean; make -j4'' kernel builds.
>
> I have another diff that tries to re-align initially unaligned addresses
> if possible but it's uglier and it's hard to tell whether it makes any
> difference in real life.
>
> ok?
>
>
> Index: copy.S
> ===================================================================
> RCS file: /d/cvs/src/sys/arch/riscv64/riscv64/copy.S,v
> retrieving revision 1.6
> diff -u -p -p -u -r1.6 copy.S
> --- copy.S 28 Jun 2021 18:53:10 -0000 1.6
> +++ copy.S 23 Jul 2021 07:45:16 -0000
> @@ -49,8 +49,38 @@ ENTRY(copyin)
> SWAP_FAULT_HANDLER(a3, a4, a5)
> ENTER_USER_ACCESS(a4)
>
> -// XXX optimize?
> .Lcopyio:
> +.Lcopy8:
> + li a5, 8
> + bltu a2, a5, .Lcopy4
> +
> + or a7, a0, a1
> + andi a7, a7, 7
> + bnez a7, .Lcopy4
> +
> +1: ld a4, 0(a0)
> + addi a0, a0, 8
> + sd a4, 0(a1)
> + addi a1, a1, 8
> + addi a2, a2, -8
> + bgtu a2, a5, 1b
Shouldn't this be
bgeu a2, a5, 1b
> +
> +.Lcopy4:
> + li a5, 4
> + bltu a2, a5, .Lcopy1
> +
> + andi a7, a7, 3
> + bnez a7, .Lcopy1
> +
> +1: lw a4, 0(a0)
> + addi a0, a0, 4
> + sw a4, 0(a1)
> + addi a1, a1, 4
> + addi a2, a2, -4
> + bgtu a2, a5, 1b
Same here?
> +
> +.Lcopy1:
> + beqz a2, .Lcopy0
> 1: lb a4, 0(a0)
> addi a0, a0, 1
> sb a4, 0(a1)
> @@ -58,6 +88,7 @@ ENTRY(copyin)
> addi a2, a2, -1
> bnez a2, 1b
>
> +.Lcopy0:
> EXIT_USER_ACCESS(a4)
> SET_FAULT_HANDLER(a3, a4)
> .Lcopyiodone:
>
>
> --
> jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE
>
>