On 26.01.2011, at 09:53, Edgar E. Iglesias wrote: > On Tue, Jan 25, 2011 at 02:07:15PM -0800, Richard Henderson wrote: >> On 01/25/2011 08:48 AM, Edgar E. Iglesias wrote: >>> OK, I see. Maybe we should try to emit an insn sequence more similar >>> to what tcg was emitting (for the non 8 & 16-bit deposits)? >>> That ought too at least give similar results as before for those and >>> give us a speedup for the byte and word moves. >> >> Please try this replacement version for tcg/i386/*. >> >> I was able to run your cris testsuite, and stuff looked ok there. >> But for some reason the microblaze kernel would not boot. It seems >> that the kernel commandline isn't in place properly and it isn't >> finding the disk image. > > Yes, you need to build qemu with libfdt for that image to boot. > IIRC, libfdt comes with the dtc package on some dists. In the future > I'll see if can upload an image that boots without the need of > devicetree manipulation in qemu. > > I tried your new patch and got similar results as before. Maaybe slightly > faster but within the noise. > > I looked a little bit more at it and realized that I'm probably not doing > a fair comparition with microblaze. The write_carry sequence actually > bit deposits a bit into two locations with a sequence of tcg ops that > is not much longer than the one to deposit a single bit. So I'm basically > comparing the cost of a single tcg deposit sequence with the cost of two > tcg deposit backend ops. I should probably just accept that the new > deposit op is not worth using for that particular case. > > It would be nice if somebody else also tested this patch. Before we > agree on applying it. > > One note, the tcg_scratch_alloc hunk from the previous version was > missing on this one.
I'm using the deposit instruction on S/390 with the implementation patch for x86 where it appears to not improve performance: Booting into initrd shell: 3.53s w/ x86 deposit patch 3.52s w/o x86 deposit patch But maybe this is the same problem as with the shift opcode that was reported earlier in the thread: agraf@toonie:/studio/s390/qemu-s390> grep deposit target-s390x/translate.c tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 32); tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 32); tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 16); tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 8); tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 48, 16); tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 32, 16); The 0, 32 and 0, 16 versions should get accelerated pretty well while the 32, 16 and 48, 16 are not I assume? I'm not saying that the deposit instruction doesn't make sense btw. Code cleanup wise it was awesome - 10 lines of tcg combined into a single call :). Maybe a simple andi and ori emission for unknown masks might make more sense though? Alex