Since its been working fine for me and Dibakar I think, I'll get it into a state to be submitted (currently have ruby and classic changes in the same patch). I'll split them into two patches and post them to reviewboard today.
On Tue, Nov 20, 2012 at 9:20 AM, Ali Saidi <sa...@umich.edu> wrote: > ** > > Any updates Mitch? > > Thanks, > > Ali > > > > On 11.10.2012 20:44, Mitch Hayenga wrote: > > Hi, > > I have a patch that fixes this in classic and ruby. I was waiting for > another student (Dibakar, he runs a lot more parallel code than I do) to > test it out before submitting to the reviewboard. I'll bug him and see if > he's tested it out yet. > > On Thu, Oct 11, 2012 at 7:32 PM, Ali Saidi <sa...@umich.edu> wrote: > >> Hi Mitch, >> >> Did you end up getting it working? >> >> Thanks, >> Ali >> >> On Sep 26, 2012, at 3:39 PM, Steve Reinhardt wrote: >> >> That's a reasonable hardware implementation. Actually you need a >> register per hardware thread context, not just per core. >> >> Our software implementation is intended to model such a hardware >> implementation, but the actual software is different for a couple of >> reasons. The main one is that we don't want to do two address-based >> lookups on every access; CAMs are much cheaper in HW than in SW. >> Associating the LL state with each cache block means you can check the >> lock state much more cheaply than iterating over a set of lock registers, >> particularly in the common case where there are no locks. Also, the cache >> typically doesn't know how many CPUs or SMT thread contexts it's >> supporting, so it's tricky to allocate the right number of registers, and >> the block-based model avoids this problem. >> >> I think you're right that the one thing we're not emulating properly is >> that the recorded lock range should be tight and not be implicitly expanded >> to cover the whole block as we've done. So you've convinced me that that's >> not just the most straightforward fix, but probably the right one. >> >> If you get it working, please submit the patch. >> >> Thanks! >> >> Steve >> >> On Wed, Sep 26, 2012 at 1:25 PM, Mitch Hayenga < >> mitch.hayenga+g...@gmail.com> wrote: >> >>> Hmm, I had normally thought that LL/SC were handled with special address >>> range registers @ the cache controller. Since a core should really only >>> have one outstanding LL/SC pair, a register per core would suffice and >>> exactly encode the range. Basically doing the same thing that your >>> more-fine grained locks within the cache block would achieve. >>> >>> >>> On Wed, Sep 26, 2012 at 3:08 PM, Steve Reinhardt <ste...@gmail.com>wrote: >>> >>>> This is a pretty interesting issue. I'm not sure how it would be >>>> handled in practice. Since the loads and stores in question are not to the >>>> same address, in theory at least store set predictor should not be >>>> involved. My guess is that the most straightforward fix would be to record >>>> the actual range of the LL in the request structure and only clear the lock >>>> flag on a store if the store truly overlaps (not just if it's to the same >>>> block). >>>> >>>> Steve >>>> >>>> >>>> On Wed, Sep 26, 2012 at 12:50 PM, Mitch Hayenga < >>>> mitch.hayenga+g...@gmail.com> wrote: >>>> >>>>> Thanks for the reply. >>>>> >>>>> Thinking about this... I don't know too much about the O3 store-set >>>>> predictor, but it would seem that load-linked instructions should care >>>>> about the entire cache line, not just if the store happens to overlap. >>>>> Since, it looks like the pending stores write to the address range >>>>> [0xf9c2c-0xf9c33], but the load-linked is to [0xf9c28-0xf9c2b] >>>>> (non-overlapping, same cache line). So the load issues early, but the >>>>> stores come in and clear the lock from the cacheline. So, either non-LLSC >>>>> stores (from the same core) shouldn't clear the locks to a cacheline >>>>> (src/cache/blk.hh:279). Or the store-set predictor should hold the >>>>> linked-load until the stores (to the same cacheline, but not overlapping) >>>>> have written back. Dibakar, another grad student here, says this impacts >>>>> Ruby as well. >>>>> >>>>> On Wed, Sep 26, 2012 at 1:27 PM, Ali Saidi <sa...@umich.edu> wrote: >>>>> >>>>>> Hi Mitch, >>>>>> >>>>>> >>>>>> I wonder if this happens in the steady state? With the implementation >>>>>> the store-set predictor should predict that the store is going to >>>>>> conflict >>>>>> the load and order them. Perhaps that isn't getting trained correctly >>>>>> with >>>>>> LLSC ops. You really don't want to mark the ops as serializing as that >>>>>> slows down the cpu quite a bit. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Ali >>>>>> >>>>>> >>>>>> On 26.09.2012 13:14, Mitch Hayenga wrote: >>>>>> >>>>>> Background: >>>>>> I have a non-o3, out of order CPU implemented on gem5. Since I don't >>>>>> have a checker implemented yet, I tend to diff committed instructions vs >>>>>> o3. Yesterday's patches caused a few of these diffs change because of >>>>>> load-linked/store-conditional behavior (better prediction on data ops >>>>>> that >>>>>> write the PC leads to denser load/store scheduling). >>>>>> Issue: >>>>>> It seems O3's own loads/stores can cause its >>>>>> load-linked/store-conditional pair to fail. Previously running a single >>>>>> core under SE, every load-linked/store-conditional pair would succeed. >>>>>> Now >>>>>> I'm observing them failing 21% of the time (on single-threaded programs). >>>>>> Although the programs functionally work given how the LL/SC is coded >>>>>> currently, I think this points to the fact LL/SC should be handled >>>>>> slightly >>>>>> differently. >>>>>> Example: >>>>>> From "Hello World" on ARM+O3+Single Core+SE+Classic Memory that shows >>>>>> this. This contains locks because I assume the C++ library is >>>>>> thread-safe. >>>>>> http://pastebin.com/sNjTPBWY >>>>>> The O3 CPU is effectively doing a "Test and TestAndSet". It looks >>>>>> like the load for the Test and the load-linked for the race for memory. >>>>>> Also, the CPU has a pending writeback to the same line. So effectively, >>>>>> the TestAndSet fails (haven't dug into it to determine if it was the >>>>>> racing >>>>>> load or the writeback that caused the failure). >>>>>> Given this, shouldn't load-linked (in this case ldrex) instructions >>>>>> be marked as non-speculative (or one of the other flags) so that they >>>>>> don't >>>>>> contend with earlier operations? >>>>>> Thanks. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> gem5-users mailing list >>>>>> gem5-users@gem5.org >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gem5-users mailing list >>>>> gem5-users@gem5.org >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>> >>>> >>>> _______________________________________________ >>>> gem5-users mailing list >>>> gem5-users@gem5.org >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >>> >>> _______________________________________________ >>> gem5-users mailing list >>> gem5-users@gem5.org >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> >> _______________________________________________ >> gem5-users mailing list >> gem5-users@gem5.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> >> >> _______________________________________________ >> gem5-users mailing list >> gem5-users@gem5.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > _______________________________________________ > gem5-users mailing > listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users