On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote: > > Incidentally, combination of --enable-gprof and (default) --enable-pie > > won't build - it dies with ld(1) complaining about relocs in gcrt1.o. > > This sounds like a toolchain bug to me :-)
Debian stable/amd64, gcc 4.7.2, binutils 2.22. And google search finds this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html. That one has gcc 4.4.3. Anyway, adding --disable-pie to --enable-gprof gets it to build, but as I said, gprof is no better than perf and oprofile - same problem. Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and going through their kernel package build. I have perf report in front of me right now; the top ones are 41.77% qemu-system-alp perf-24701.map [.] 0x7fbbee558930 11.78% qemu-system-alp qemu-system-alpha [.] cpu_alpha_exec 4.95% qemu-system-alp [vdso] [.] 0x7fffdd7ff8de 2.40% qemu-system-alp qemu-system-alpha [.] phys_page_find 1.49% qemu-system-alp qemu-system-alpha [.] address_space_translate_internal 1.34% qemu-system-alp [kernel.kallsyms] [k] read_hpet 1.26% qemu-system-alp qemu-system-alpha [.] tlb_set_page 1.23% qemu-system-alp qemu-system-alpha [.] find_next_bit 1.04% qemu-system-alp qemu-system-alpha [.] get_page_addr_code 1.01% qemu-system-alp libpthread-2.13.so [.] pthread_mutex_lock 0.88% qemu-system-alp qemu-system-alpha [.] helper_cmpbge 0.80% qemu-system-alp libc-2.13.so [.] __memset_sse2 0.72% qemu-system-alp libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0.70% qemu-system-alp qemu-system-alpha [.] get_physical_address 0.69% qemu-system-alp qemu-system-alpha [.] address_space_translate 0.68% qemu-system-alp qemu-system-alpha [.] tcg_optimize 0.67% qemu-system-alp qemu-system-alpha [.] ldq_phys 0.63% qemu-system-alp qemu-system-alpha [.] qemu_get_ram_ptr 0.62% qemu-system-alp qemu-system-alpha [.] helper_le_ldq_mmu 0.57% qemu-system-alp qemu-system-alpha [.] memory_region_is_ram and cpu_alpha_exec() spends most of the time in inlined tb_find_fast(). It might be worth checking the actual distribution of the hash of virt address used by that sucker - I wonder if dividing its argument by 4 wouldn't improve the things, but I don't have stats on actual frequency of conflicts, etc. In any case, the first lump (42%) seems to be tastier ;-) There are all kinds of microoptimizations possible (e.g. helper_cmpbge() could be done by a couple of MMX insns on amd64 host[1]), but it would be nice to have some details on what we spend the time on in tcg output... [1] The reason why helper_cmpbge() shows up is that string functions on alpha use that insn a lot; it _might_ be worth optimizing.