st by generating slow paths at the end of a block

Blue Swirl Mon, 27 Aug 2012 11:24:47 -0700

On Mon, Aug 27, 2012 at 7:23 AM, Yeongkyoon Lee
<yeongkyoon....@samsung.com> wrote:
> On 2012년 07월 29일 00:39, Yeongkyoon Lee wrote:
>>
>> On 2012년 07월 25일 23:00, Richard Henderson wrote:
>>>
>>> On 07/25/2012 12:35 AM, Yeongkyoon Lee wrote:
>>>>
>>>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
>>>> +/* Macros/structures for qemu_ld/st IR code optimization:
>>>> +   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in
>>>> exec-all.h. */
>>>> +#define TCG_MAX_QEMU_LDST       640
>>>
>>> Why statically size this ...
>>
>>
>> This just followed the other TCG's code style, the allocation of the
>> "labels" of "TCGContext" in tcg.c.
>>
>>
>>>
>>>> +    /* labels info for qemu_ld/st IRs
>>>> +       The labels help to generate TLB miss case codes at the end of TB
>>>> */
>>>> +    TCGLabelQemuLdst *qemu_ldst_labels;
>>>
>>> ... and then allocate the array dynamically?
>>
>>
>> ditto.
>>
>>>
>>>> +    /* jne slow_path */
>>>> +    /* XXX: How to avoid using OPC_JCC_long for peephole optimization?
>>>> */
>>>> +    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
>>>
>>> You can't, not and maintain the code-generate-until-address-reached
>>> exception invariant.
>>>
>>>> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION
>>>>   uint8_t __ldb_mmu(target_ulong addr, int mmu_idx);
>>>>   void __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx);
>>>>   uint16_t __ldw_mmu(target_ulong addr, int mmu_idx);
>>>> @@ -28,6 +30,30 @@ void __stl_cmmu(target_ulong addr, uint32_t val, int
>>>> mmu_idx);
>>>>   uint64_t __ldq_cmmu(target_ulong addr, int mmu_idx);
>>>>   void __stq_cmmu(target_ulong addr, uint64_t val, int mmu_idx);
>>>>   #else
>>>> +/* Extended versions of MMU helpers for qemu_ld/st optimization.
>>>> +   The additional argument is a host code address accessing guest
>>>> memory */
>>>> +uint8_t ext_ldb_mmu(target_ulong addr, int mmu_idx, uintptr_t ra);
>>>
>>> Don't tie LDST_OPTIMIZATION directly to the extended function calls.
>>>
>>> For a host supporting predication, like ARM, the best code sequence
>>> may look like
>>>
>>>     (1) TLB check
>>>     (2) If hit, load value from memory
>>>     (3) If miss, call miss case (5)
>>>     (4) ... next code
>>>     ...
>>>     (5) Load call parameters
>>>     (6) Tail call (aka jump) to MMU helper
>>>
>>> so that (a) we need not explicitly load the address of (3) by hand
>>> for your RA parameter and (b) the mmu helper returns directly to (4).
>>>
>>>
>>> r~
>>
>>
>> The difference between current HEAD and the code sequence you said is, I
>> think, code locality.
>> My LDST_OPTIMIZATION patches enhances the code locality and also removes
>> one jump.
>> It shows about 4% rising of CoreMark performance on x86 host which
>> supports predication like ARM.
>> Probably, the performance enhancement for AREG0 cases might get more
>> larger.
>> I'm not sure where the performance enhancement came from now, and I'll
>> check it by some tests later.
>>
>> In my humble opinion, there are no things to lose in LDST_OPTIMIZATION
>> except
>> for just adding one argument to MMU helper implicitly which doesn't look
>> so critical.
>> How about your opinion?
>>
>> Thanks.
>>
>
> It's been a long time.
>
> I've tested the performances of one jump difference when fast qemu_ld/st
> (TLB hit).
> The result shows 3.6% CoreMark enhancement when reducing one jump where slow
> paths are generated at the end of block as same for the both cases.
> That means reducing one jump dominates the majority of performance
> enhancement from LDST_OPTIMIZATION.
> As a result, it needs extended MMU helper functions for attaining that
> performance rising, and those extended functions are used only implicitly.
>
> BTW, who will finally confirm my patches?
> I have sent four version of my patches in which I have applied all the
> reasonable feedbacks from this community.
> Currently, v4 is the final candidate though it might need merge with latest
> HEAD because it was sent 1 month before.


I think the patches should be applied when 1.3 development opens.

>
> Thanks.
>
>

Re: [Qemu-devel] [RFC][PATCH v4 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block

Reply via email to