https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86680
--- Comment #8 from Florian La Roche <florian.laroche at googlemail dot com> --- I've found something the compiler optimized quite nicely: (Good for the compiler, but I'd be happy to stay with the original code that was much easier to read for humans.) extern unsigned long __bss_start[]; extern unsigned long __bss_end[]; //extern unsigned long __bss_size; void clear_bss(void) { unsigned long *bss = __bss_start; unsigned long i, end = __bss_end - __bss_start; //unsigned long i = __bss_size; for (i = 0; i < end; i += sizeof (unsigned long)) *bss++ = 0UL; } This results on aarch64 into this code: 0000000000000000 <clear_bss>: 0: 90000001 adrp x1, 0 <__bss_end> 4: 90000002 adrp x2, 0 <__bss_start> 8: f9400021 ldr x1, [x1] c: f9400042 ldr x2, [x2] 10: cb020021 sub x1, x1, x2 14: 9343fc21 asr x1, x1, #3 18: b40000c1 cbz x1, 30 <clear_bss+0x30> 1c: d2800000 mov x0, #0x0 // #0 20: f822681f str xzr, [x0, x2] 24: 91002000 add x0, x0, #0x8 28: eb00003f cmp x1, x0 2c: 54ffffa8 b.hi 20 <clear_bss+0x20> // b.pmore 30: d65f03c0 ret Jakub, your example code did also result in pretty large code (but I've only tested 8.0.1, not the newest release on this). Thanks a lot, best regards, Florian La Roche