The strategy that the SHMEDIA port uses to do cse and loop invariant code motion for division by invariant reciprocal can in principle be used by any processor that has a reasonable fast instructions for count leading sign bits (can be substituted with count leading zero bits), widening or highpart multiply, and dynamic shifts, and that typically allows to add a few more pseudo registers without much cost (i.e. 32 GPRS, or efficient stack access like on x86).
Basically, the division is split into an invariant computation which takes the divisor as input, and computes a denormalization shift count (SHIFT), an approximate reciprocal factor (INV1), and a reciprocal adjustment factor (INV2). The division is then performed by multiplying the INV2 with the dividend, shifting it right by a constant amount, subtracting that from (or adding to) the product of INV2 with the dividend, denormalize the result, and do an adjustment to take account oif different rounding for signed / unsigned results. The details of how to best compute SHIFT, INV1 and INV2, if INV2 has the same or opposite signe of INV2, and the constant shift count depend on the target instruction set and microarchitecture. However, not all processors are as good as the SH5 in computing the invariant values compared to how fast they can do a straight division; thus instead of chopping the division into multiple RTL pieces and then allowing any kind of CSE and PRE to rip them apart, we would like some finer control that requires a minimum number of divisions with the same divisor before this optimization is applied. Also, exposing this at the tree level will enable other optimizations to work better, e.g. they can make better unrolling decisions, and take advantage of the non-trapping nature of the division by multiply with reciprocal. The question is now how best to respresent the invariant operations as trees. We could have a special tree code for this, but then - It would eat up a tree code. - Producing three results at once, it is likely to be trouble for ssa transformations. Having a single built-in function for this purpose would avoid the tree code cost, but the we'd be faced with a three-valued built-in function. So I think that the easiest way to integrate this with the rest of the compiler is to have a target hook that emits trees to compute SHIFT, INV1 and INV2. These might use additional temporaries, and could use standard arithmetic and/or memory operations, machine-specific builtin functions, or a mixture of both; but any one tree expression would compute only one value. Emitting the computation of the actual division using SHIFT, INV1, INV2 is then done by a second target hook. The parameter for the minimum number of divisions with the same divisor to trigger this optimization should be separate from the one for floating point.