BALATON Zoltan <bala...@eik.bme.hu> writes:
> On Fri, 1 May 2020, Alex Bennée wrote: >> 罗勇刚(Yonggang Luo) <luoyongg...@gmail.com> writes: >>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <bala...@eik.bme.hu> wrote: >>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote: >>>>> That's what I suggested, >>>>> We preserve a float computing cache >>>>> typedef struct FpRecord { >>>>> uint8_t op; >>>>> float32 A; >>>>> float32 B; >>>>> } FpRecord; >>>>> FpRecord fp_cache[1024]; >>>>> int fp_cache_length; >>>>> uint32_t fp_exceptions; >>>>> >>>>> 1. For each new fp operation we push it to the fp_cache, >>>>> 2. Once we read the fp_exceptions , then we re-compute >>>>> the fp_exceptions by re-running the fp FpRecord sequence. >>>>> and clear fp_cache_length. >>>> >>>> Why do you need to store more than the last fp op? The cumulative bits can >>>> be tracked like it's done for other targets by not clearing fp_status then >>>> you can read it from there. Only the non-sticky FI bit needs to be >>>> computed but that's only determined by the last op so it's enough to >>>> remember that and run that with softfloat (or even hardfloat after >>>> clearing status but softfloat may be faster for this) to get the bits for >>>> last op when status is read. >>>> >>> Yeap, store only the last fp op is also an option. Do you means that store >>> the last fp op, >>> and calculate it when necessary? I am thinking about a general fp >>> optmize method that suite >>> for all target. >> >> I think that's getting a little ahead of yourself. Let's prove the >> technique is valuable for PPC (given it has the most to gain). We can >> always generalise later if it's worthwhile. >> >> Rather than creating a new structure I would suggest creating 3 new tcg >> globals (op, inA, inB) and re-factor the front-end code so each FP op >> loaded the TCG globals. > > So that's basically wherever you see helper_reset_fpstatus() in > target/ppc we would need to replace it with saving op and args to > globals? Or just repurpose this helper to do that. This is called > before every fp op but not before sub ops within vector ops. Is that > correct? Probably it is, as vector ops are a single op but how do we > detect changes in flags by sub ops for those? These might have some > existing bugs I think. I'll defer to the PPC front end experts on this. I'm not familiar with how it all goes together at all. > >> The TCG optimizer should pick up aliased loads >> and automatically eliminate the dead ones. We might need some new >> machinery for the TCG to avoid spilling the values over potentially >> faulting loads/stores but that is likely a phase 2 problem. > > I have no idea how to do this or even where to look. Some more > detailed explanation may be needed here. Don't worry about it now. Let's worry about it when we see how often faulting instructions are interleaved with fp ops. > >> Next you will want to find places that care about the per-op bits of >> cpu_fpscr and call a helper with the new globals to re-run the >> computation and feed the values in. > > So the code that cares about these bits are in guest thus we would > need to compute it if we detect the guest accessing these. Detecting > when the individual bits are accessed might be difficult so at first > we could go for checking if the fpscr is read and recompute FI bit > then before returning value. You previously said these might be when > fpscr is read or when generating exceptions but not sure where exactly > are these done for ppc. (I'd expect to have mffpscr but there seem to > be different other ops instead accessing parts of fpscr which are > found in target/ppc/fp-impl.inc.c:567 so this would need studying the > PPC docs to understand how the guest can access the FI bit of fpscr > reg.) > >> That would give you a reasonable working prototype to start doing some >> measurements of overhead and if it makes a difference. >> >>> >>>> >>>>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and >>>>> clear fp_exceptions. >>>>> 4. If the fp_cache are full, then we re-compute >>>>> the fp_exceptions by re-running the fp FpRecord sequence. >>>> >>>> All this cache management and more than one element seems unnecessary to >>>> me although I may be missing something. >>>> >>>>> Now the keypoint is how to tracking the read and write of FPSCR register, >>>>> The current code are >>>>> cpu_fpscr = tcg_global_mem_new(cpu_env, >>>>> offsetof(CPUPPCState, fpscr), "fpscr"); >>>> >>>> Maybe you could search where the value is read which should be the places >>>> where we need to handle it but changes may be needed to make a clear API >>>> for this between target/ppc, TCG and softfloat which likely does not >>>> exist yet. >> >> Once the per-op calculation is fixed in the PPC front-end I thing the >> only change needed is to remove the #if defined(TARGET_PPC) in >> softfloat.c - it's only really there because it avoids the overhead of >> checking flags which we always know to be clear in it's case. > > That's the theory but I've found that removing that define currently > makes general fp ops slower but vector ops faster so I think there may > be some bugs that would need to be found and fixed. So testing with > some proper test suite might be needed. You might want to do what Laurent did and hack up a testfloat with "system" implementations: https://github.com/vivier/m68k-testfloat/blob/master/testfloat/M68K-Linux-GCC/systfloat.c I would be nice to plumb that sort of support into our existing testfloat fork in the code base (tests/fp) but I suspect getting an out-of-tree fork building and running first would be the quickest way forward. > > Regards, > BALATON Zoltan -- Alex Bennée