On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
richard.hender...@linaro.org> wrote:

> On 5/1/20 6:10 AM, Alex Bennée wrote:
> >
> > 罗勇刚(Yonggang Luo) <luoyongg...@gmail.com> writes:
> >
> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <bala...@eik.bme.hu>
> wrote:
> >>
> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> >>>> That's what I suggested,
> >>>> We preserve a  float computing cache
> >>>> typedef struct FpRecord {
> >>>>  uint8_t op;
> >>>>  float32 A;
> >>>>  float32 B;
> >>>> }  FpRecord;
> >>>> FpRecord fp_cache[1024];
> >>>> int fp_cache_length;
> >>>> uint32_t fp_exceptions;
> >>>>
> >>>> 1. For each new fp operation we push it to the  fp_cache,
> >>>> 2. Once we read the fp_exceptions , then we re-compute
> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
> >>>> and clear  fp_cache_length.
> >>>
> >>> Why do you need to store more than the last fp op? The cumulative bits
> can
> >>> be tracked like it's done for other targets by not clearing fp_status
> then
> >>> you can read it from there. Only the non-sticky FI bit needs to be
> >>> computed but that's only determined by the last op so it's enough to
> >>> remember that and run that with softfloat (or even hardfloat after
> >>> clearing status but softfloat may be faster for this) to get the bits
> for
> >>> last op when status is read.
> >>>
> >> Yeap, store only the last fp op is also an option. Do you means that
> store
> >> the last fp op,
> >> and calculate it when necessary?  I am thinking about a general fp
> >> optmize method that suite
> >> for all target.
> >
> > I think that's getting a little ahead of yourself. Let's prove the
> > technique is valuable for PPC (given it has the most to gain). We can
> > always generalise later if it's worthwhile.
>
> Indeed.
>
> > Rather than creating a new structure I would suggest creating 3 new tcg
> > globals (op, inA, inB) and re-factor the front-end code so each FP op
> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
> > and automatically eliminate the dead ones. We might need some new
> > machinery for the TCG to avoid spilling the values over potentially
> > faulting loads/stores but that is likely a phase 2 problem.
>
> There's no point in new tcg globals.
>
> Every fp operation can raise an exception, and therefore every fp operation
> will flush tcg globals to memory.  Therefore there is no optimization to be
> done at the tcg opcode level.
>
> However, every fp operation calls a helper function, and the quickest
> thing to
> do is store the inputs to env->(op, inA, inB, inC) in the helper before
> performing the operation.
>
>
> > Next you will want to find places that care about the per-op bits of
> > cpu_fpscr and call a helper with the new globals to re-run the
> > computation and feed the values in.
>
> Before we even get to this deferred fp operation thing, there are several
> giant
> improvements to ppc emulation that can be made:
>
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate
> function
> call.  I suspect this is worth a decent speedup by itself.
>
Hi Richard, what kinds of rearrange the fp need to be done? Can you give me
a more detailed
example? I am still not get the idea.

>
> Step 2 is to notice when all fp exceptions are masked, so that no
> exception can
> be raised, and set a tb_flags bit.  This is the default fp environment that
> libc enables and therefore extremely common.
>
> Currently, ppc has 3 helpers called per fp operation.  If step 1 is handled
> correctly, then we're down to 2 fp helpers per fp operation.  If no
> exceptions
> need raising, then we can perform the entire operation with a single
> function call.
>
> We would require a parallel set of fp helpers that (1) performs the
> operation
> and (2) does any post-processing of the exception bits straight away, but
> (3)
> without raising any exceptions.  Sort of like helper_fadd +
> do_float_check_status, but less.  IIRC the only real extra work is
> categorizing
> invalid exceptions.  We could even plausibly extend softfloat to do that
> while
> it is recording the invalid exception.
>
> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
> inexact
> from the inverse hardfloat operation.  This would let us relax the
> restriction
> of only using hardfloat when we have already have an accrued inexact
> exception.
>
> Only after all of these are done is it worth experimenting with caching the
> last fp operation.
>
>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

Reply via email to