As you can see, the compiler uses r9 to store data and then uses that for data[0] but also loads in r7 data+8 instead of directly using r9. If I remove the loop then it does not do this.
This optimization is done by CSE only, currently. That's why it cannot look through loops.
Paolo