Hi Javed,

It seems there is a bug in handling CleanUnique requests. From the code 
(src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):

transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
            SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, 
CleanUnique, BUSY_BLKD) {
  Initiate_Request;
  Initiate_CleanUnique;
  Pop_ReqRdyQueue;
  ProcessNextState;
}

Profile_Miss/Profile_Hit are not being called so the stats are not being 
incremented for a CleanUnique arriving at the L3.

Could you create a JIRA ticket to track this bug ?

Also note that some requests that miss in the L2 never go the the L3. E.g.: if 
the line is UC/UD at one of the other cores L1, it will always count as miss in 
the L2 because you have to get the copy from the other core L1, but no request 
is generated to the L3.

Thanks,
Tiago


________________________________
From: Javed Osmany <javed.osm...@huawei.com>
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list <gem5-users@gem5.org>
Cc: Javed Osmany <javed.osm...@huawei.com>
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and 
LLC (in HNF)


Hello



I am modelling the following system:

a)       Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)

b)      All CPUs have private L1I and L1D caches.

c)       Each cluster has a shared and unified L2$.

d)      Model a shared and unified L3$, shared between [middle, little] 
clusters. The L3$ is modelled as a CHI_Node.

e)       4 x HNF/LLC/Directory

f)        1 x SNF



I am using gem5-21.2.1.0.



An example of the command used to run the lu_ncb benchmark being:

./build/ARM/gem5.opt 
--outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache
 –debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby 
--topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 
--num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 
--num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 
--cpu-type-littleclust=a76 --bigclust-l2cache=shared 
--middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB 
--l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB 
--l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 
--l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 
--l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 
--l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB 
--l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 
--num-middleclust-subclust=1 --num-littleclust-subclust=1 
--num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 
--num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE 
–bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl 
--l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl 
--l2-big-data-latency=12 --l2-middle-data-latency=12 
--l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 
--l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 
--l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 
--sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz 
--middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz 
--ruby-clock=2GHz --cacheline_size=64 --verbose=t

rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'





I am running the Parsec/Splash2 benchmark suite.



Extracting the stats from the stats.txt file, I have the following:







Blackscoles

Canneal

Swaptions

Cholesky

FFT

Fmm

Lu_cb

Lu_ncb

Raytrace

Volrend

Water_sq

Water_sp

Demand L2$ miss, little cluster

7019

9605353

7656

2724902

2930037

1365976

58955

1026556

594351

93401

24063

11435

Demand L2$ accesses, little cluster

13506

33101031

1207307

6206252

3511657

3199668

794479

4665754

2471593

1039411

393792

166955

Demand L3$ accesses, total

7165

10359847

9992

2686126

2929728

1321580

54026

51745

131095

22744

12840

8843





























If I compare row1 and row3, the number of demand L3$ accesses is lower for the 
Splash2 benchmarks (and in some benchmarks, considerably lower) than the number 
of demand L2$ misses for the little cluster (the little cluster is the main 
compute cluster).



QS: Why don’t all the L2$ misses make their way to the L3$?



In the attachment, I have included my versions of CHI.py, CHI_config.py, 
config.ini, stats.txt.



Any insight greatly appreciated.



Best regards

JO





IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to