https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118072
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- So we are just left with the unstablity of the chosing based on the cache and sometimes the cache is different when first based on divide vs mod. I suspect if you do timing on the mod with/without using the udiv instruction, both might end up being similar. NOTE you need to do large values too and not just small values since udiv instruction has an early out on almost all aarch64 cores.