On Mon, Aug 27, 2012 at 7:23 AM, Yeongkyoon Lee <yeongkyoon....@samsung.com> wrote: > On 2012년 07월 29일 00:39, Yeongkyoon Lee wrote: >> >> On 2012년 07월 25일 23:00, Richard Henderson wrote: >>> >>> On 07/25/2012 12:35 AM, Yeongkyoon Lee wrote: >>>> >>>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU) >>>> +/* Macros/structures for qemu_ld/st IR code optimization: >>>> + TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in >>>> exec-all.h. */ >>>> +#define TCG_MAX_QEMU_LDST 640 >>> >>> Why statically size this ... >> >> >> This just followed the other TCG's code style, the allocation of the >> "labels" of "TCGContext" in tcg.c. >> >> >>> >>>> + /* labels info for qemu_ld/st IRs >>>> + The labels help to generate TLB miss case codes at the end of TB >>>> */ >>>> + TCGLabelQemuLdst *qemu_ldst_labels; >>> >>> ... and then allocate the array dynamically? >> >> >> ditto. >> >>> >>>> + /* jne slow_path */ >>>> + /* XXX: How to avoid using OPC_JCC_long for peephole optimization? >>>> */ >>>> + tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0); >>> >>> You can't, not and maintain the code-generate-until-address-reached >>> exception invariant. >>> >>>> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION >>>> uint8_t __ldb_mmu(target_ulong addr, int mmu_idx); >>>> void __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx); >>>> uint16_t __ldw_mmu(target_ulong addr, int mmu_idx); >>>> @@ -28,6 +30,30 @@ void __stl_cmmu(target_ulong addr, uint32_t val, int >>>> mmu_idx); >>>> uint64_t __ldq_cmmu(target_ulong addr, int mmu_idx); >>>> void __stq_cmmu(target_ulong addr, uint64_t val, int mmu_idx); >>>> #else >>>> +/* Extended versions of MMU helpers for qemu_ld/st optimization. >>>> + The additional argument is a host code address accessing guest >>>> memory */ >>>> +uint8_t ext_ldb_mmu(target_ulong addr, int mmu_idx, uintptr_t ra); >>> >>> Don't tie LDST_OPTIMIZATION directly to the extended function calls. >>> >>> For a host supporting predication, like ARM, the best code sequence >>> may look like >>> >>> (1) TLB check >>> (2) If hit, load value from memory >>> (3) If miss, call miss case (5) >>> (4) ... next code >>> ... >>> (5) Load call parameters >>> (6) Tail call (aka jump) to MMU helper >>> >>> so that (a) we need not explicitly load the address of (3) by hand >>> for your RA parameter and (b) the mmu helper returns directly to (4). >>> >>> >>> r~ >> >> >> The difference between current HEAD and the code sequence you said is, I >> think, code locality. >> My LDST_OPTIMIZATION patches enhances the code locality and also removes >> one jump. >> It shows about 4% rising of CoreMark performance on x86 host which >> supports predication like ARM. >> Probably, the performance enhancement for AREG0 cases might get more >> larger. >> I'm not sure where the performance enhancement came from now, and I'll >> check it by some tests later. >> >> In my humble opinion, there are no things to lose in LDST_OPTIMIZATION >> except >> for just adding one argument to MMU helper implicitly which doesn't look >> so critical. >> How about your opinion? >> >> Thanks. >> > > It's been a long time. > > I've tested the performances of one jump difference when fast qemu_ld/st > (TLB hit). > The result shows 3.6% CoreMark enhancement when reducing one jump where slow > paths are generated at the end of block as same for the both cases. > That means reducing one jump dominates the majority of performance > enhancement from LDST_OPTIMIZATION. > As a result, it needs extended MMU helper functions for attaining that > performance rising, and those extended functions are used only implicitly. > > BTW, who will finally confirm my patches? > I have sent four version of my patches in which I have applied all the > reasonable feedbacks from this community. > Currently, v4 is the final candidate though it might need merge with latest > HEAD because it was sent 1 month before.
I think the patches should be applied when 1.3 development opens. > > Thanks. > >