Hi Javed, It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):
transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC, SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) { Initiate_Request; Initiate_CleanUnique; Pop_ReqRdyQueue; ProcessNextState; } Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3. Could you create a JIRA ticket to track this bug ? Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3. Thanks, Tiago ________________________________ From: Javed Osmany <javed.osm...@huawei.com> Sent: Friday, July 29, 2022 5:22 AM To: gem5 users mailing list <gem5-users@gem5.org> Cc: Javed Osmany <javed.osm...@huawei.com> Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF) Hello I am modelling the following system: a) Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU) b) All CPUs have private L1I and L1D caches. c) Each cluster has a shared and unified L2$. d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node. e) 4 x HNF/LLC/Directory f) 1 x SNF I am using gem5-21.2.1.0. An example of the command used to run the lu_ncb benchmark being: ./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache –debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE –bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16' I am running the Parsec/Splash2 benchmark suite. Extracting the stats from the stats.txt file, I have the following: Blackscoles Canneal Swaptions Cholesky FFT Fmm Lu_cb Lu_ncb Raytrace Volrend Water_sq Water_sp Demand L2$ miss, little cluster 7019 9605353 7656 2724902 2930037 1365976 58955 1026556 594351 93401 24063 11435 Demand L2$ accesses, little cluster 13506 33101031 1207307 6206252 3511657 3199668 794479 4665754 2471593 1039411 393792 166955 Demand L3$ accesses, total 7165 10359847 9992 2686126 2929728 1321580 54026 51745 131095 22744 12840 8843 If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster). QS: Why don’t all the L2$ misses make their way to the L3$? In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt. Any insight greatly appreciated. Best regards JO IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org