Quoting Tom de Vries <vr...@codesourcery.com>:
About the penalty, I don't really know. But since the optimization is both filling delay slots and removing duplicate code, it looks like a good idea to me.
It's usually beneficial, but for some microarchitectures, this kind of code confuses the branch predictor. So there should be a way for the port to turn this off.