On Fri, Oct 27, 2023 at 6:55 AM Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 10/27/23 01:49, Robin Dapp wrote: > >> @@ -346,7 +346,7 @@ static const struct riscv_tune_param rocket_tune_info > >> = { > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */ > >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */ > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */ > >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */ > >> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */ > >> 1, /* issue_rate */ > >> 3, /* branch_cost */ > >> 5, /* memory_cost */ > >> @@ -361,7 +361,7 @@ static const struct riscv_tune_param > >> sifive_7_tune_info = { > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */ > >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */ > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */ > >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */ > >> + {COSTS_N_INSNS (33), COSTS_N_INSNS (65)}, /* int_div */ > >> 2, /* issue_rate */ > >> 4, /* branch_cost */ > >> 3, /* memory_cost */ > >> @@ -376,7 +376,7 @@ static const struct riscv_tune_param > >> thead_c906_tune_info = { > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (5)}, /* fp_mul */ > >> {COSTS_N_INSNS (20), COSTS_N_INSNS (20)}, /* fp_div */ > >> {COSTS_N_INSNS (4), COSTS_N_INSNS (4)}, /* int_mul */ > >> - {COSTS_N_INSNS (6), COSTS_N_INSNS (6)}, /* int_div */ > >> + {COSTS_N_INSNS (18), COSTS_N_INSNS (34)}, /* int_div */ > >> 1, /* issue_rate */ > >> 3, /* branch_cost */ > >> 5, /* memory_cost */ > > > > Instruction costs don't really correspond to latencies even though > > sometimes they are used as if they were. I'm a bit wary of using > > e.g. 65 which would disparage each use of an integer division inside > > a sequence. > > > > Could you check which costs we need in order to still emit your wanted > > sequence? Maybe we can use values a bit lower than yours and still > > get the proper code. Where is the decision being made actually? > The main use of costing of a div/mod instruction is to guide the > reciprocal division code when dividing by a constant. In that context > we're comparing costs against a sequence of multiplies, shifts, add/sub > insns which are almost always costed by their latency. So using latency > for division is a reasonable place to start. > > The other thing that might be worth investigating for those processors > would be to set "use_divmod_expansion" in the cost structure. I've > heard talk of fusing div/mod into divmod, though I'm not aware of any > part implementing that fusion
I'm also unaware of existing implementations that fuse these operations; div + mul + sub is probably best for most uarches... > (from a prior life, that would seem to > require a 2nd output port on the integer unit which could be highly > undesirable). ...but it can be done more cheaply than this, so I wouldn't foreclose on the possibility. Nevertheless, future work, as you say. > Anyway, this could be a followup item for Yangyu if it > looks profitable. > > jeff