https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62631
--- Comment #26 from Eric Botcazou <ebotcazou at gcc dot gnu.org> ---
> The generated code on PA looks optimal to me:
>
> zdep %r25,29,30,%r28
> b .L2
> ldi 99,%r19
> .L6:
> zdep %r25,29,30,%r28
> .L2:
> addl %r26,%r28,%r28
> ldo 1(%r25),%r25
> comb,>>= %r19,%r25,.L6
> stw %r0,0(%r28)
> bv,n %r0(%r2)
For most other architectures the BIV (%r25) is eliminated to the GIV (%r28) so
you only have one additive operation in the loop. This happens for 64-bit PA:
.L5:
ldo 4(%r26),%r26
cmpb,*>>,n %r28,%r26,.L5
stw %r0,0(%r26)
bve,n (%r2)
Why couldn't such a code be generated for 32-bit PA too?