Hi, On Wed, 29 Aug 2007, Paul Brook wrote:
> > I solved that by placing one of the T[012] operands into memory > > for HOST_I386, thereby freeing one reg. Here's some justification > > of why that doesn't really cost performance: with three free regs > > GCC is already spilling like mad in the snippets, we just trade one > > of those memory accesses (to stack) with one other mem access to > > the cpu_state structure, which will be in cache. > > Do you have any evidence to support this claim? Not really, only an apple and orange comparison. A 10000 iteration tests/sha1 run in the same Linux image, with -no-kqemu, on host and target i386: time ./sha1 with qemu-0.8.2 (compiled by gcc 3.3-hammer): 7.92 seconds with qemu-0.9.0-cvs (gcc4.1 compiled, with the patch): 8.15 seconds I'll try to get a better comparison. Note, though, that this is the only easy solution. Any other solution would either involve improving reload pretty much, or rewriting many of the op.c patterns (for all targets) to never require more than three registers, and ensure that gcc doesn't become clever and mashes some insns together again (and in trying to do that probably slow down the snippets again). Or doing away with dyngen at all. All three solutions are fairly involved, so I'm personally fine with the above slowdown (for i386 targets you won't normally get any noticable slowdown as you'd use kqemu). > Last time I did this it caused a significant performance hit. I'd guess > that most common ops are simple enough that we don't need more than 3 > registers. For i386 target I'm redefining only T1 (as I figured that A0 and T0 are used a bit more), it might be that some of the code could be generated in a way to not make use of T1 too much, I haven't really poked at that. > > --- qemu-0.9.0.cvs.orig/softmmu_header.h > > - : "%eax", "%ecx", "%edx", "memory", "cc"); > > + : "%eax", "%edx", "memory", "cc"); > > This change is wrong. The inline asm calls C code which clobbers %ecx. Indeed, must have been blind. Okay these are too many clobbers for poor gcc4 nevertheless, so a push/pop %ecx around the call needs to be done. Courious that this didn't lead to fire all over the place. See patch below. Ciao, Michael. Index: softmmu_header.h =================================================================== RCS file: /sources/qemu/qemu/softmmu_header.h,v retrieving revision 1.15 diff -u -p -r1.15 softmmu_header.h --- softmmu_header.h 23 May 2007 19:58:10 -0000 1.15 +++ softmmu_header.h 29 Aug 2007 17:27:37 -0000 @@ -230,9 +230,11 @@ static inline void glue(glue(st, SUFFIX) #else #error unsupported size #endif + "pushl %%ecx\n" "pushl %6\n" "call %7\n" "popl %%eax\n" + "popl %%ecx\n" "jmp 2f\n" "1:\n" "addl 8(%%edx), %%eax\n" @@ -250,14 +252,18 @@ static inline void glue(glue(st, SUFFIX) : "r" (ptr), /* NOTE: 'q' would be needed as constraint, but we could not use it with T1 ! */ +#if DATA_SIZE == 1 || DATA_SIZE == 2 + "q" (v), +#else "r" (v), +#endif "i" ((CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS), "i" (TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS), "i" (TARGET_PAGE_MASK | (DATA_SIZE - 1)), "m" (*(uint32_t *)offsetof(CPUState, tlb_table[CPU_MEM_INDEX][0].addr_write)), "i" (CPU_MEM_INDEX), "m" (*(uint8_t *)&glue(glue(__st, SUFFIX), MMUSUFFIX)) - : "%eax", "%ecx", "%edx", "memory", "cc"); + : "%eax", "%edx", "memory", "cc"); } #else