On 10.04.2011, at 22:08, Aurelien Jarno wrote: > On Sun, Apr 10, 2011 at 09:25:33PM +0200, Alexander Graf wrote: >> >> On 10.04.2011, at 21:23, Aurelien Jarno wrote: >> >>> On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote: >>>> >>>> On 05.04.2011, at 06:54, Aurelien Jarno wrote: >>>> >>>>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote: >>>>>> With the s390x target we use the deposit instruction to store 32bit >>>>>> values >>>>>> into 64bit registers without clobbering the upper 32 bits. >>>>>> >>>>>> This specific operation can be optimized slightly by using the ext >>>>>> operation >>>>>> instead of an explicit and in the deposit instruction. This patch adds >>>>>> that >>>>>> special case to the generic deposit implementation. >>>>>> >>>>>> Signed-off-by: Alexander Graf <ag...@suse.de> >>>>>> --- >>>>>> tcg/tcg-op.h | 6 +++++- >>>>>> 1 files changed, 5 insertions(+), 1 deletions(-) >>>>> >>>>> Have you really measuring a difference here? This should already be >>>>> handled, at least on x86, by this code: >>>>> >>>>> if (TCG_TARGET_REG_BITS == 64) { >>>>> if (val == 0xffffffffu) { >>>>> tcg_out_ext32u(s, r0, r0); >>>>> return; >>>>> } >>>>> if (val == (uint32_t)val) { >>>>> /* AND with no high bits set can use a 32-bit operation. */ >>>>> rexw = 0; >>>>> } >>>>> } >>>> >>>> I've certainly looked at the -d op logs and seen that instead of creating >>>> a const tcg variable plus an AND there was now an extu opcode issued, yes. >>>> No idea why the case up there didn't trigger. >>>> >>> >>> The question there is looking at -d out_asm. They should be the same at >>> the end as the code I pasted above is from tcg/i386/tcg-target.c. >> >> Yes. I was trying to optimize for maximum op length. TCG defines a maximum >> number of tcg ops to be issued by each target instruction. Since s390 is >> very CISCy, there are instructions that translate into lots of microops, but >> are still faster than a C call (register save/restore mostly). >> >> Without this patch, there are some places where we hit that number :). > > Is it on 32-bit on or 64-bit? If we reach this number, it's probably > better to either implement this instruction with an helper, or maybe > increase the number of maximum ops. What is this instruction?
This was on x86_64. I hit limits with LMH and LM, but reduced them to fit into the picture with this optimization :). If you like, I can give you a statically linked binary that could exceed the limits. Alex