2016-06-01 13:06 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: > On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich....@gmail.com> wrote: >> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubiz...@gmail.com>: >>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrum...@gmail.com> wrote: >>>> Hi Uros, >>>> >>>> Here is initial patch to improve performance of 64-bit integer >>>> arithmetic in 32-bit mode. We discovered that gcc is significantly >>>> behind icc and clang on rsa benchmark from eembc2.0 suite. >>>> Te problem function looks like >>>> typedef unsigned long long ull; >>>> typedef unsigned long ul; >>>> ul mul_add(ul *rp, ul *ap, int num, ul w) >>>> { >>>> ul c1=0; >>>> ull t; >>>> for (;;) >>>> { >>>> { t=(ull)w * ap[0] + rp[0] + c1; >>>> rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); }; >>>> if (--num == 0) break; >>>> { t=(ull)w * ap[1] + rp[1] + c1; >>>> rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); }; >>>> if (--num == 0) break; >>>> { t=(ull)w * ap[2] + rp[2] + c1; >>>> rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); >>>> }; >>>> if (--num == 0) break; >>>> { t=(ull)w * ap[3] + rp[3] + c1; >>>> rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); >>>> }; >>>> if (--num == 0) break; >>>> ap+=4; >>>> rp+=4; >>>> } >>>> return(c1); >>>> } >>>> >>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont. >>>> >>>> The patch looks loke (not complete since there are other 64-bit >>>> instructions e.g. subtraction): >>>> >>>> Index: i386.md >>>> =================================================================== >>>> --- i386.md (revision 236181) >>>> +++ i386.md (working copy) >>>> @@ -5439,7 +5439,7 @@ >>>> (clobber (reg:CC FLAGS_REG))] >>>> "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)" >>>> "#" >>>> - "reload_completed" >>>> + "1" >>>> [(parallel [(set (reg:CCC FLAGS_REG) >>>> (compare:CCC >>>> (plus:DWIH (match_dup 1) (match_dup 2)) >>>> >>>> What is your opinion? >>> >>> This splitter doesn't depend on hard registers, so there is no >>> technical obstacle for the proposed patch. OTOH, this is a very old >>> splitter, it is possible that it was introduced to handle some of >>> reload deficiencies. Maybe Jeff knows something about this approach. >>> We have LRA now, so perhaps we have to rethink the purpose of these >>> DImode splitters. >> >> The change doesn't spoil splitter for hard register case and therefore >> splitter still should be able to handle any reload deficiencies. I think >> we should try to split all instructions working on multiword registers >> (not only PLUS case) at earlier passes to allow more optimizations on >> splitted code and relax registers allocation (now we need to allocate >> consequent registers). Probably make a separate split right after STV? >> This should help with PR70321. > > There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting > DImode ops though.
Looking at pass description I see it works when "all the uses of a multi-word register are via SUBREG, or are copies of the register to another location". It doesn't cover cases when we operate with whole multi-word registers. Thanks, Ilya > > Richard. > >> Thanks, >> Ilya >> >>> >>> A pragmatic approach would be - if the patch shows measurable benefit, >>> and doesn't introduce regressions, then Stage 1 is the time to try it. >>> >>> BTW: Use "&& 1" in the split condition of the combined insn_and_split >>> pattern to copy the enable condition from the insn part. If there is >>> no condition, you should just use "". >>> >>> Uros.