I have 16 vectorial registers in the machine R16-R31 which all have 128 cells of 16 bits each. These support ALU operations and load/stores just as normal registers, but in one clock. So an add R16 R17 R18 will add the whole R17 array with R18 (corresponding cells) and place the result in R16. The 'where' instruction places a mask on the array so the operation is done only where a certain condition is met. In the example in the previous e-mail, where `a` is less than `b`. I've read the description of doloop and I don't think I can use it in this case. I'll have to dig more or settle with -O0 and cry.
Is it possible to abstract out such pieces of code in the input program in an independent function whose prologue and epilogue have the necessary setting? Just curious. Uday.