Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

Ilya Enkovich Wed, 01 Jun 2016 03:41:17 -0700

2016-06-01 13:06 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
> On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich....@gmail.com> wrote:
>> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>:
>>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrum...@gmail.com> wrote:
>>>> Hi Uros,
>>>>
>>>> Here is initial patch to improve performance of 64-bit integer
>>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>>> Te problem function looks like
>>>> typedef unsigned long long ull;
>>>> typedef unsigned long ul;
>>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>>  {
>>>>  ul c1=0;
>>>>  ull t;
>>>>  for (;;)
>>>>   {
>>>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); 
>>>> };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); 
>>>> };
>>>>   if (--num == 0) break;
>>>>   ap+=4;
>>>>   rp+=4;
>>>>   }
>>>>  return(c1);
>>>>  }
>>>>
>>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>>
>>>> The patch looks loke (not complete since there are other 64-bit
>>>> instructions e.g. subtraction):
>>>>
>>>> Index: i386.md
>>>> ===================================================================
>>>> --- i386.md     (revision 236181)
>>>> +++ i386.md     (working copy)
>>>> @@ -5439,7 +5439,7 @@
>>>>     (clobber (reg:CC FLAGS_REG))]
>>>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>>    "#"
>>>> -  "reload_completed"
>>>> +  "1"
>>>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>>>                    (compare:CCC
>>>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>>>
>>>> What is your opinion?
>>>
>>> This splitter doesn't depend on hard registers, so there is no
>>> technical obstacle for the proposed patch. OTOH, this is a very old
>>> splitter, it is possible that it was introduced to handle some of
>>> reload deficiencies. Maybe Jeff knows something about this approach.
>>> We have LRA now, so perhaps we have to rethink the purpose of these
>>> DImode splitters.
>>
>> The change doesn't spoil splitter for hard register case and therefore
>> splitter still should be able to handle any reload deficiencies.  I think
>> we should try to split all instructions working on multiword registers
>> (not only PLUS case) at earlier passes to allow more optimizations on
>> splitted code and relax registers allocation (now we need to allocate
>> consequent registers).  Probably make a separate split right after STV?
>> This should help with PR70321.
>
> There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
> DImode ops though.


Looking at pass description I see it works when "all the uses of a multi-word
register are via SUBREG, or are copies of the register to another location".
It doesn't cover cases when we operate with whole multi-word registers.

Thanks,
Ilya

>
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> A pragmatic approach would be - if the patch shows measurable benefit,
>>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>>
>>> BTW: Use "&&  1" in the split condition of the combined insn_and_split
>>> pattern to copy the enable condition from the insn part. If there is
>>> no condition, you should just use "".
>>>
>>> Uros.

Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

Reply via email to