Re: [gem5-users] discussion on modeling shared L3 cache, hierarchy

Mitch Hayenga Wed, 13 Mar 2013 08:14:57 -0700

Last level cache miss rates can be quite high on SPEC.  Effectively the
higher level caches "filter" all of the easily cacheable accesses, so that
the last level cache only sees accesses that tend to miss.


Aameer Jaleel, of Intel, has published miss rates for L1/L2/L3 cache
configurations on the reference workload for SPEC on his website.  These
show this trend.

http://www.jaleels.org/ajaleel/workload/



On Wed, Mar 13, 2013 at 9:03 AM, hanfeng QIN <hanfeng.h...@gmail.com> wrote:

> Hi Andreas,
>
> Many thanks for your suggestion.  The generated figure is just what I
> expect.  So I can ensure the correctness of my cache configurations.
>
> However, I still need to meditate these surprising simulation results.
>
> Hanfeng
>
> On 03/13/2013 09:36 PM, gem5-users-requ...@gem5.org wrote:
>
>> Hi Hanfeng,
>>
>> Without having looked at the numbers, have you had a look at the generated
>> config.dot.pdf in m5out to ensure that the system architecture ends up
>> being what you intend?
>>
>> You need py-dot installed for the figure to be generated.
>>
>> Andreas
>>
>>
>> On 13/03/2013 08:19, "hanfeng QIN"<hanfeng.h...@gmail.com>  wrote:
>>
>>  >Hi all,
>>> >
>>> >I model a shared L3 while private L2 cache hierarchy with gem5. During
>>> >my experiments (running gem5 under classic memory model and SE mode), I
>>> >control the main simulation parameters as following.
>>> >
>>> >1 core
>>> >private L1 dcache: 32KB/8-way; icache: 32KB/4-way
>>> >private L2 cache: 256KB/8-way
>>> >shared L3 cahce: 1MB/16-way
>>> >
>>> >I select workloads from SPEC CPU 2K6. The first 500 million instructions
>>> >are fast forwarded and then 500 millions instructions are cache warmed
>>> >up. Later another 500 million instructions are detailed simulated with
>>> >O3 cpu.
>>> >
>>> >I get the simulation stats as following (these stats are about
>>> >'switch_cpus_1' with detailed measurement, 1st column is the benchmark
>>> >name while the 2nd name is the corresponding metric stat)
>>> >
>>> >( A ). L1D$$ miss rate:
>>> >
>>> >401 0.039705
>>> >403 0.154247
>>> >410 0.107384
>>> >450 0.169413
>>> >459 0.000102
>>> >462 0.031115
>>> >471 0.060141
>>> >
>>> >( B ). L2$$ miss rate:
>>> >
>>> >401 0.463055
>>> >403 0.900824
>>> >410 0.344414
>>> >450 0.820350
>>> >459 0.989815
>>> >462 0.997665
>>> >471 0.964760
>>> >
>>> >( C ). L3$$ miss rate:
>>> >
>>> >401 0.334149
>>> >403 0.291530
>>> >410 0.918030
>>> >450 0.561909
>>> >459 0.970418
>>> >462 0.989723
>>> >471 0.080612
>>> >
>>> >I am surprised by these statistics. Several workloads (such as 403.gcc,
>>> >459.GemsFDTD, 462.libquantum and 471.omnetpp) have a large L2 and L3
>>> >cache miss rate, very close to 100%. I am not sure whether it is related
>>> >to my cache configuration settings or intrinsic behavior characteristics
>>> >of benchmarks. Attached please find related cache configuration files.
>>> >
>>> >--------------------------
>>> >config/common/CacheConfig.py-**------------------------------**
>>> ----------
>>> >
>>> >     if options.l3cache:
>>> >         system.l3 = l3_cache_class(clock=options.**clock,
>>> >                                    size=options.l3_size,
>>> >                                    assoc=options.l3_assoc,
>>> >block_size=options.cacheline_**size)
>>> >
>>> >         system.tol3bus = CoherentBus(clock = options.clock, width = 32)
>>> >         system.l3.cpu_side = system.tol3bus.master
>>> >         system.l3.mem_side = system.membus.slave
>>> >     else:
>>> >         if options.l2cache:
>>> >             system.l2 = l2_cache_class(clock=options.**clock,
>>> >                                        size=options.l2_size,
>>> >                                        assoc=options.l2_assoc,
>>> >block_size=options.cacheline_**size)
>>> >
>>> >             system.tol2bus = CoherentBus(clock = options.clock, width =
>>> >32)
>>> >             system.l2.cpu_side = system.tol2bus.master
>>> >             system.l2.mem_side = system.membus.slave
>>> >
>>> >     for i in xrange(options.num_cpus):
>>> >         if options.caches:
>>> >             icache = icache_class(size=options.l1i_**size,
>>> >                                   assoc=options.l1i_assoc,
>>> >                                   block_size=options.cacheline_**size)
>>> >             dcache = dcache_class(size=options.l1d_**size,
>>> >                                   assoc=options.l1d_assoc,
>>> >                                   block_size=options.cacheline_**size)
>>> >
>>> >             if options.l3cache:
>>> >                 system.cpu[i].l2 = l2_cache_class(size =
>>> options.l2_size,
>>> >                                                   assoc =
>>> >options.l2_assoc,
>>> >                                                   block_size =
>>> >options.cacheline_size)
>>> >                 system.cpu[i].tol2bus = CoherentBus()
>>> >                 system.cpu[i].l2.cpu_side =
>>> system.cpu[i].tol2bus.master
>>> >                 system.cpu[i].l2.mem_side = system.tol3bus.slave
>>> >
>>> >             if buildEnv['TARGET_ISA'] == 'x86':
>>> >                     system.cpu[i].**addPrivateSplitL1Caches(**icache,
>>> dcache,
>>> >PageTableWalkerCache(),
>>> >PageTableWalkerCache())
>>> >             else:
>>> >                     system.cpu[i].**addPrivateSplitL1Caches(**icache,
>>> dcache)
>>> >         system.cpu[i].**createInterruptController()
>>> >         if options.l3cache:
>>> >             system.cpu[i].connectAllPorts(**system.cpu[i].tol2bus,
>>> >system.membus)
>>> >         else:
>>> >             if options.l2cache:
>>> >                 system.cpu[i].connectAllPorts(**system.tol2bus,
>>> >system.membus)
>>> >             else:
>>> >                 system.cpu[i].connectAllPorts(**system.membus)
>>> >
>>> >--------------------------
>>> >config/common/Caches.py------**------------------------------**-----
>>> >
>>> >class L1Cache(BaseCache):
>>> >     assoc = 2
>>> >     hit_latency = 2
>>> >     response_latency = 2
>>> >     block_size = 64
>>> >     mshrs = 4
>>> >     tgts_per_mshr = 20
>>> >     is_top_level = True
>>> >
>>> >class L2Cache(BaseCache):
>>> >     assoc = 8
>>> >     block_size = 64
>>> >     hit_latency = 8
>>> >     response_latency = 8
>>> >     mshrs = 16
>>> >     tgts_per_mshr = 16
>>> >     write_buffers = 8
>>> >
>>> >class L3Cache(BaseCache):
>>> >     assoc = 16
>>> >     block_size = 64
>>> >     hit_latency = 20
>>> >     response_latency = 20
>>> >     mshrs = 512
>>> >     tgts_per_mshr = 20
>>> >     write_buffers = 256
>>> >
>>> >
>>> >
>>> >Any mistakes I missing?
>>> >
>>> >
>>> >
>>> >Thanks in advance,
>>> >
>>> >Hanfeng
>>>
>>
> ______________________________**_________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/**mailman/listinfo/gem5-users<http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users>
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] discussion on modeling shared L3 cache, hierarchy

Reply via email to