On Sat, Feb 18, 2023 at 1:30 PM Palmer Dabbelt <pal...@dabbelt.com> wrote: > > On Sat, 18 Feb 2023 13:06:02 PST (-0800), jeffreya...@gmail.com wrote: > > > > > > On 2/18/23 11:26, Palmer Dabbelt wrote: > >> On Fri, 17 Feb 2023 06:02:40 PST (-0800), gcc-patches@gcc.gnu.org wrote: > >>> Hi all, > >>> If we have division and remainder calculations with the same operands: > >>> > >>> a = b / c; > >>> d = b % c; > >>> > >>> We can replace the calculation of remainder with multiplication + > >>> subtraction, using the result from the previous division: > >>> > >>> a = b / c; > >>> d = a * c; > >>> d = b - d; > >>> > >>> Which will be faster. > >> > >> Do you have any benchmarks that show that performance increase? The ISA > >> manual specifically says the suggested sequence is div+mod, and while > >> those suggestions don't always pan out for real hardware it's likely > >> that at least some implementations will end up with the ISA-suggested > >> fusions. > > It'll almost certainly be visible in mcf. Been there, done that. In > > fact, that's why I asked the team Matevos works on to poke at this case > > as I went through this issue on another processor. > > > > It can also be run through LLVM's MCA to estimate counts if you've got a > > pipeline description. THe div+rem will come out at around ~40c while a > > div+mul+sub should weigh in around 25c for Veyron v1. > > Do you have a link to the patches somewhere? I couldn't find them > online, just the custom instruction support. Or even just some docs > describing what the pipeline does, as just basing one performance model > on another is kind of a double-edged sword. > > That said, I think just knowing the processor doesn't do the div+mod > fusion is sufficient to turn something like this on for the mtune for > that processor. That's different than turning it on globally, though -- > unless it turns out nobody is actually doing the fusion suggested in the > ISA manual, which wouldn't be super surprising. > > Maybe some of the SiFive and T-Head folks can chime in on whether or not > their processors perform the fusion in question -- and if so, do the > instructions need to say back-to-back?
AFAIK, the sequence with the multiplication will normally be faster on SiFive cores when both the quotient and the remainder are needed. > It doesn't look like we're > really targeting the code sequences the ISA suggests as it stands, so > maybe it's OK to just switch the default over too? > > It also brings up the question of mulh+mul fusions, which I don't think > we've really looked at (though maybe they're a lot less important for > rv64).