http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-07-04 10:49:18 UTC --- Created attachment 24670 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24670 memcpy/memset testing script HJ, can you please run the attached script with new glibc as sh test_stringop 64 640000000 gcc -march=native | tee out In my quick testing on glibc2.11 and core i5 & AMD machine, inline memcpy/memset is still win on I5 for all blocks sizes (our optimization table is however wrong since it is inherited from generic one). For blocks of 512b and above however the inline code is about as fast as glibc code and obviously longer. On AMD machine libcall is win for blocks of 1k to 8k. For large blocks inline seems to be win again, for whatever reason. Probably prefetch logic is wrong on the older glibc. If glibc stringops has been finally made sane, we ought to revisit the tables we generate inline versions from.