[gem5-users] Re: Persistent memory in gem5: How to test persistent memory workload properly.

Eliot Moss via gem5-users Thu, 22 Jun 2023 14:32:15 -0700

On 6/22/2023 4:54 PM, Khan Shaikhul Hadi via gem5-users wrote:

Hi,
I want to simulate a Persistent Memory machine in gem5. Gem5 has an NVMe module but at instructionlevel ,for most part, it does not simulate CLFLUSH ( specially for MESI cache coherence protocol ).I am also not sure if it simulates memory fence properly (For out of order cpu, it seems likeMFenceOp.execute just returns no fault without doing anything. I was expecting it would do somethingto ensure the store buffer is clean before other instructions could proceed or something likethat.). In that case, how one runs a persistent memory benchmark in gem5.
Side Note : To make a update persistent in X86 architecture, an update must 
follow by FLUSH and FENCE.


I think you're mostly correct about this.

I coded up an improved version, but have not extracted the code and submitted
it back.  Maybe I can give that (largish) patch some priority, to help others
out.  There are several issues involved:

- Decoding clwb (which I presume you want), which is fairly easy.

- Giving clflush, clflushopt, and clwb, along with sfence, mfence, and lfence
  the right ordering properties to model the x86 semantics (a given x86
  implementation *might* impose more order, but, clflushopt and clwb do not
  follow total-store-order as if they were other stores).

- Having clflush, clflushopt, and clwb not complete until the data have
  reached the memory controller.  (I believe the version at present treats
  them as done when the request reaches the cache.)

- I also added support for bulk cache flush operations (wbinvd and wbnoinvd),
  which may be of less use because they're privileged, for security reasons.


The resulting design has clflush and friends send a packet up to the Point of
Coherence (that's an ARM term, but means where coherence is resolved,
generally the memory bus).  Then snoops are sent back down to all caches.
This means that asking for flush of a line that is residing dirty in some
other cpu's L1 cache (for example) will indeed flush the line.  When the data
(if any) are sent to the memory bus, the bus then sends a response down.  (If
no caches hold the data, a response is also sent.)  In principle it would be
possible to force a wait until the data are recorded in memory array, but
since Intel guarantee persistence once data reach the controller, having the
packet cross the memory bus suffices.

Dealing with the weaker ordering of clflushopt and clwb required substantial
surgery to the store queue part of gem5's out of order cpu model, since it
processed items strictly on order when TSO was set (which is the appropriate
setting for x86).

The bulk clean ops (wbinvd, etc.) required a kind of additional "engine" in
the caches to find dirty lines and write them back, and then to detect when
they had all reached the memory bus before indicating completion.

Anyway, yes, the setup as is will give you the semantics, but not the timing.

One further observation.  More recent Intel models support eADR, an obscure
but which turns out to mean that if data reach the cache, they will be
persistent.  This means you no longer need to use clflush and friends.
Further, given that x86 implements total-store-order on ordinary stores, for
the most part you don't even need fences, unless for some reason you need to
know that a given store has actually reached the cache.  (If you're processing
things in a transaction-like way, you simply updated the commit record after
updating everything else.  If, after a crash, the commit record indicates
"committed", you know the previous stores also reached the cache so the
transaction is durable (persisted).)  Therefore, if you wanted a fence just for
ordering purposes, you don't need it.  The fence *does* guarantee that the
store queue empties before you proceed - but on a substantially out of order
machine, emptying a queue like that might have a noticeable impact on
performance for small transactions.  There might be occasions where you need
fences to prevent loads and stores from passing each other, but (except for
the above noted clflushopt and clwb) x86 semantics requires loads to be
handled in order, and stores to be handled in order, but the two queues are
separate.  (A load does need to see preceding stores to the same byte by the
same cpu, though.)

Regards - Eliot Moss
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

[gem5-users] Re: Persistent memory in gem5: How to test persistent memory workload properly.

Reply via email to