I did some measurements before and after, and I noticed a few things.
First, the iTLB-load-misses stat drops form 0.25% all the way down to
0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27%
and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from
1.74% => 2.77%.

So it looks like performance is generally about the same or a little better
in most metrics, but for some reason icache hit rate drops.

Performance measurements with partial linking:

        429,882.68 msec task-clock:u              #    1.000 CPUs utilized

                 0      context-switches:u        #    0.000 K/sec

                 0      cpu-migrations:u          #    0.000 K/sec

           145,986      page-faults:u             #    0.340 K/sec

 1,830,956,683,109      cycles:u                  #    4.259 GHz
           (35.71%)
    31,472,946,642      stalled-cycles-frontend:u #    1.72% frontend
cycles idle     (35.71%)
   254,440,746,368      stalled-cycles-backend:u  #   13.90% backend cycles
idle      (35.71%)
 4,117,921,862,700      instructions:u            #    2.25  insn per cycle

                                                  #    0.06  stalled cycles
per insn  (35.71%)
   773,059,098,367      branches:u                # 1798.303 M/sec
           (35.71%)
     2,775,345,450      branch-misses:u           #    0.36% of all
branches          (35.71%)
 2,329,109,097,524      L1-dcache-loads:u         # 5418.011 M/sec
           (35.71%)
    24,907,172,614      L1-dcache-load-misses:u   #    1.07% of all
L1-dcache accesses  (35.71%)
   <not supported>      LLC-loads:u

   <not supported>      LLC-load-misses:u

   872,678,362,265      L1-icache-loads:u         # 2030.038 M/sec
           (35.71%)
    15,221,564,231      L1-icache-load-misses:u   #    1.74% of all
L1-icache accesses  (35.71%)
    48,763,102,717      dTLB-loads:u              #  113.434 M/sec
           (35.71%)
        75,459,133      dTLB-load-misses:u        #    0.15% of all dTLB
cache accesses  (35.71%)
     8,416,573,693      iTLB-loads:u              #   19.579 M/sec
           (35.72%)
        20,650,906      iTLB-load-misses:u        #    0.25% of all iTLB
cache accesses  (35.72%)

     429.911532621 seconds time elapsed

     428.611864000 seconds user
       0.199257000 seconds sys


Performance measurements without partial linking:

        444,598.61 msec task-clock:u              #    1.000 CPUs utilized

                 0      context-switches:u        #    0.000 K/sec

                 0      cpu-migrations:u          #    0.000 K/sec

           145,528      page-faults:u             #    0.327 K/sec

 1,907,560,568,869      cycles:u                  #    4.291 GHz
           (35.71%)
    24,156,412,003      stalled-cycles-frontend:u #    1.27% frontend
cycles idle     (35.72%)
   202,601,144,555      stalled-cycles-backend:u  #   10.62% backend cycles
idle      (35.72%)
 4,118,200,832,359      instructions:u            #    2.16  insn per cycle

                                                  #    0.05  stalled cycles
per insn  (35.72%)
   773,117,144,029      branches:u                # 1738.910 M/sec
           (35.72%)
     2,727,637,567      branch-misses:u           #    0.35% of all
branches          (35.71%)
 2,326,960,449,159      L1-dcache-loads:u         # 5233.845 M/sec
           (35.71%)
    26,778,818,764      L1-dcache-load-misses:u   #    1.15% of all
L1-dcache accesses  (35.71%)
   <not supported>      LLC-loads:u

   <not supported>      LLC-load-misses:u

   903,186,314,629      L1-icache-loads:u         # 2031.465 M/sec
           (35.71%)
    25,017,115,665      L1-icache-load-misses:u   #    2.77% of all
L1-icache accesses  (35.71%)
    50,448,039,415      dTLB-loads:u              #  113.469 M/sec
           (35.71%)
        78,186,127      dTLB-load-misses:u        #    0.15% of all dTLB
cache accesses  (35.71%)
     9,419,644,114      iTLB-loads:u              #   21.187 M/sec
           (35.71%)
         1,479,281      iTLB-load-misses:u        #    0.02% of all iTLB
cache accesses  (35.71%)

     444.623341115 seconds time elapsed

     443.313786000 seconds user
       0.256109000 seconds sys

On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <[email protected]> wrote:

> Out of curiosity I tried a quick x86 boot test, and say that the run time
> with partial linking removed increased from just under 7 minutes to about 7
> and a half minutes.
>
> I thought about this for a while since at first I had no idea why that
> might happen, and a theory I came up with was that when partial linking,
> related bits of the simulator are grouped together since they're generally
> in the same directory, and then those will likely end up in the same part
> of the final binary. If those things are related, then you'll get better
> locality as far as TLB performance and maybe paging things in. gem5 is such
> a big executable that I doubt locality at that scale would make much of a
> difference at the granularity cache lines. Also possibly the type of
> relocations between those entities could be more efficient if the offset
> they need to encode is smaller?
>
> If that's true, there are two ways I've thought of where we could get that
> sort of behavior back without reintroducing partial linking, both of which
> use attributes gcc provides which I assume clang would too.
>
> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized
> particularly aggressively for performance, and "cold" makes the compiler
> optimize for size. According to the docs, both could (probably do?) put the
> items in question into separate sections where they would have better
> locality, and the "cold" functions would stay out of the way.
>
> 2. Put things in different sections explicitly with the "section"
> attribute. This could explicitly group items we'd want to show up near each
> other like what partial linking does explicitly/accidentally.
>
> A third option might be to use profiling based optimization. I don't know
> how to get gcc or clang to use that and what it requires, but I think they
> at least *can* do something along those lines. That would hopefully give
> the compiler enough information that it could figure some of these things
> out on its own.
>
> The problem with this option might be that things we don't exercise in the
> profiling (devices or CPUs or features that aren't used) may look
> unimportant, but would be very important if the configuration of the
> simulator was different.
>
> One other thing we might want to try, and I'm not sure how this would
> work, might be to get gem5 loaded in with a larger page size somehow. Given
> how big the binary is, reducing pressure on the TLB that way would probably
> make a fairly big difference in performance.
>
> Gabe
>
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to