[seL4] Re: some performance problem when test 4 cores SMP答复: Devel Digest, Vol 127, Issue 5

yadong . li Tue, 07 Dec 2021 02:23:59 -0800

Hi:
> So you would expect that the reported metric would scale following Amdahl's 
> law based on the proportion of an operation that is serialized inside the 
> kernel lock which would potentially vary across platforms.
> It cannot simply be throughput, else the doubled delay should be reflected in 
> a significantly reduced throughput, but it has almost no effect.
    1. This is what I also feel confused, I think whether it is depend on the 
implementation of ipc_normal_delay
    ipc_normal_delay will call below code:
         delay = OVERHEAD_FIXUP(REXP(id) * current_delay_cycle, overhead);
         insert a command such as 'zigset(...)'for every core, The ziggurat 
method for generating random variables, when double delay, the random variables 
coefficient will reduce the delay influence?
    2. " the proportion of an operation that is serialized inside the kernel 
lock ", CLH lock is serialized between cores, if 4 cores will have big 
different performance on different ARM platform?


> For 3 or 4  cores the combined latency of two IPCs is larger than the 500cy 
> delay and you expect lock contention, resulting in reduced scaling, while it 
> should still scale almost perfectly with the 1000cy delay.
> > > Addition：
> >   Our seL4_Call performance is same with other platform
> >                      XXXX             IMX8MM_EVK_64     TX2_64
> >   seL4_Call           367(0)             378(2)              492(16)        
> >   client->server, same vspace, ipc_len is 0
> >   seL4_ReplyRecv     396(0)             402(2)               513(16)        
> >   server->client, same vspace, ipc_len is 0
        1. the seL4_Call and seL4_ReplyRecv are measured on SMP kernel
    2. or 3 or 4  cores the combined latency of two IPCs is larger than the 
500cy delay and you expect lock contention ----  that's what I thought at 
first, But It can't be explained very well on IMX8MM_EVK_64, base on the test 
seL4_Call data, when run on 3 cores and 4cores, IMX8MM_EVK_64 still have a good 
scalability, so I think not all seL4_Call and ReplyRecv flow which cost maybe 
300 or 400 is locked(like el trap), TX2 should have a good scalability as 
expected when run on 3 or 4 cores, but it did not show good, from data 
below：1497740（3core， 500cycles），1545872（4core） 


> > My test results below:
> >                                                                       ARM 
> > platform
> >       Test item             XXX                  IMX8MM_EVK_64          TX2
> >                             mean(Stddev)
> >      500 cycles, 1 core        636545(46)            625605(29)            
> > 598142(365)
> >      500 cycles, 2 cores      897900(2327)      1154209(44)           
> > 994298(94)
> >      500 cycles, 3 cores     1301679(2036)      1726043(65)        
> > 1497740(127)
> >      500 cycles, 4 cores     1387678(549)       2172109(12674)   
> > 1545872(109)
> >     1000 cycles, 1 core       636529(42)            625599(22)            
> > 597627(161)
> >     1000 cycles, 2 cores      899212(3384)      1134110(34)          
> > 994437(541)
> >     1000 cycles, 3 cores     1297322(5028)      1695385(45)        
> > 1497547(714)
> >     1000 cycles, 4 cores     1387149(456)        2174605(81)        
> > 1545716(614)


> I notice your standard deviations for 2 and 3 cores are surprisingly high 
> (although still small in relative terms).
>
> Did you try running the same again? Are the numbers essentially the same or 
> are multiple runs all over the shop?
        1.  Yes，I test several times，I always get the same results, have little 
different
    2. I think the stddev is small relative to our test results, IMX8MM_EVK_64 
also have a big stddev(500 cycles, 3core) from the data.


-----邮件原件-----
发件人: [email protected] [mailto:[email protected]]
发送时间: 2021年12月7日 11:45
收件人: [email protected]
主题: Devel Digest, Vol 127, Issue 5

Send Devel mailing list submissions to
        [email protected]

To subscribe or unsubscribe via email, send a message with subject or body 
'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of Devel digest..."

Today's Topics:

   1. Re: Use TimeServer by Group Components Questions (Kent Mcleod)
   2. Re: Incorporating seL4 into your project, risc-v architecture
      (Kent Mcleod)
   3. Re: some performance problem when test 4 cores SMP benchmark of seL4bench 
project 答复: Devel Digest, Vol 127, Issue 1
      (Kent Mcleod)
   4. Re: Use TimeServer by Group Components Questions
      ([email protected])
   5. Re: Use TimeServer by Group Components Questions (Kent Mcleod)
   6. Re: some performance problem when test 4 cores SMP benchmark of seL4bench 
project 答复: Devel Digest, Vol 127, Issue 1
      (Gernot Heiser)


----------------------------------------------------------------------

Message: 1
Date: Tue, 7 Dec 2021 13:03:00 +1100
From: Kent Mcleod <[email protected]>
Subject: [seL4] Re: Use TimeServer by Group Components Questions
To: [email protected]
Cc: devel <[email protected]>
Message-ID:
        <CA+-ozWeJgZm3B902EKBU6dUGxy6wU3Cp92jEUYwRKNFcVS=7...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

On Fri, Dec 3, 2021 at 2:51 PM <[email protected]> wrote:
>
> I got an problem when I use TimeServer with Group Components. I wish more 
> than one component in the same Group Components, that can use TimeServer. 
> While the cdl only have one ep for all components in the Group to communicate 
> to TimeServer with an badge value that not equal to zero. And TimeServer must 
> to distinguish requester by the badge, so only one component can receive 
> TimeServer's response. How can I resolve this problem?

The TimeServer component cannot be used as part of a Group component.
The Group component mechanism isn't designed to transparently work for any 
component and the Connection types that the TimeServer uses are not compatible 
with the Group mechanism.



> _______________________________________________
> Devel mailing list -- [email protected] To unsubscribe send an email 
> to [email protected]

------------------------------

Message: 2
Date: Tue, 7 Dec 2021 13:14:30 +1100
From: Kent Mcleod <[email protected]>
Subject: [seL4] Re: Incorporating seL4 into your project, risc-v
        architecture
To: [email protected]
Cc: devel <[email protected]>
Message-ID:
        <ca+-ozweu7bi-byxhsvo9jnq_tcn90u7qra8_hh39p2rvguc...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

On Fri, Dec 3, 2021 at 8:14 AM <[email protected]> wrote:
>
> Dear seL4 devs and community.
>
> I try to create a simple project using the seL4 micro kernel. For setting up 
> my environment I followed this (https://github.com/manu88/SeL4_101) and this 
> (https://docs.sel4.systems/projects/buildsystem/incorporating.html) article.  
> As kernel I'm using the last release of SeL4 
> (https://github.com/seL4/seL4/releases/tag/12.1.0). I created some kind of 
> user application, like Manu88 did with a simple Hello World C-Program. While 
> building with the delivered init-build.sh script I got a lot of repeating 
> errors like this:
> CMake Error at tools/cmake-tool/helpers/simulation.cmake:179 
> (add_custom_command):
>   Error evaluating generator expression:
>
>     $<TARGET_PROPERTY:rootserver_image,KERNEL_IMAGE_NAME>
>
>   Target "rootserver_image" not found.
> Call Stack (most recent call first):
>   CMakeLists.txt:22 (GenerateSimulateScript) While going through 
> Manu88s guide I saw that he also got this errors. But after adding a 
> CMakeList.txt file inside his user code, and declaring it as rootserver 
> inside the CMakeList.txt file (I just copied the contents from here: 
> https://github.com/manu88/SeL4_101/blob/master/projects/Hello/CMakeLists.txt) 
> he got rid of the errors. I still get them and searching through the internet 
> could not find a solution for me. Bellow I added the folder structure of my 
> project.
> Project/
> ├── kernel/
> ├── projects/
> │       ├── Hello/
> │       ├── musllibc/
> │       ├── utils_libs/
> │       ├── seL4_libs/
> ├── build/
> ├── tools/
> │   └── cmake-tool/
> │   └── elfloader-tool/
> ├── init-build.sh   -> tools/cmake-tool/init-build.sh
> ├── CMakeLists.txt
> ├── application_settings.cmake -> 
> tools/cmake-tool/helpers/application_settings.cmake
> ├── settings.cmake
>
> So I cannot tell where the issue lies. Any pointers would be appreciated.

If you are using
https://github.com/manu88/SeL4_101/blob/master/projects/Hello/CMakeLists.txt
inside projects/Hello/ then you need to make sure that the top level CMakeLists 
inside Project/ is either a symlink to tools/cmake-tool/default-CMakeLists.txt 
or is similar to https://github.com/manu88/SeL4_101/blob/master/CMakeLists.txt. 
 With this file layout, ../init-build.sh will end up calling the following 
CMake invocation: `cmake -DCMAKE_TOOLCHAIN_FILE=../kernel/gcc.cmake -G Ninja 
-DSEL4_CACHE_DIR=../.sel4_cache -C "../settings.cmake" "../"`.
What is the contents of your settings.cmake file?

>
> Kind regards,
> Sophia
> _______________________________________________
> Devel mailing list -- [email protected] To unsubscribe send an email 
> to [email protected]

------------------------------

Message: 3
Date: Tue, 7 Dec 2021 13:43:48 +1100
From: Kent Mcleod <[email protected]>
Subject: [seL4] Re: some performance problem when test 4 cores SMP
        benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1
To: Gernot Heiser <[email protected]>
Cc: "[email protected]" <[email protected]>
Message-ID:
        <CA+-ozWcihN6my4WOyPbpCcfp3=rksysnnegze2kc6c-bv4k...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

On Thu, Dec 2, 2021 at 9:28 PM Gernot Heiser <[email protected]> wrote:
>
>
>
> > On 2 Dec 2021, at 16:17, yadong.li <[email protected]> wrote:
> >
> > First, I got the data of IMX8MM_EVK_64 and TX2 from 
> > https://github.com/seL4/sel4bench/actions/runs/1469475721#artifacts, 
> > the sel4bench-results-imx8mm_evk file and sel4bench-results-tx2 file, 
> > unpack the file out, I find xxxx_SMP_64.json Secondly, the test is the smp 
> > benchmark form sel4bench-manifest project, the source file is 
> > sel4bench/apps/smp/src/main.c The test scenario look like below:
> >   A pair thread of ping-pong on the same core, the ping thread will 
> > wait for "ipc_normal_delay" time then send 0 len ipc message to pong 
> > thread, then return. I think the 500 cycles mean how long 
> > ipc_normal_delay will really delay
>
> >   The above scenario will test on one core, or mutil core. If we run 4 
> > cores, every core will have a ping thread and a pong thread run like above 
> > description, then record the sum of all cores ping-pong counts.
>
> ok, but what is the metric reported? [Apologies for not being on top 
> of the details of our benchmarking setups.]

Looking at the sel4bench smp benchmark implementation, the metric is the total 
number of "operations" in a single second.  An operation is a round trip intra 
address space seL4_Call + seL4_ReplyRecv between 2 threads on the same core 
with each thread delaying for the cycle count before performing the next 
operation.  After 1 second of all cores performing these operations 
continuously and maintaining a core-local (on a separate cache line) count, the 
total number of operations is added together and reported as the final number. 
So you would expect that the reported metric would scale following Amdahl's law 
based on the proportion of an operation that is serialized inside the kernel 
lock which would potentially vary across platforms.


>
> It cannot simply be throughput, else the doubled delay should be reflected in 
> a significantly reduced throughput, but it has almost no effect.
>
> >   I think this experiment is used to illustrate in multi core, our seL4 
> > kernel big lock will not affect mutli-core performance, am I right ?
>
> Not quite. As there’s only one big lock, only one core can execute the kernel 
> at any time. If one core is in the IPC while another core is trying to IPC, 
> even though both IPCs are core-local, the second will have to wait until the 
> first gets out of the lock.
>
> As the delay is higher than the syscall latency, you’d expect perfect 
> scalability from one core to two (with the lock essentially synchronising the 
> threads). For 3 or 4  cores the combined latency of two IPCs is larger than 
> the 500cy delay and you expect lock contention, resulting in reduced scaling, 
> while it should still scale almost perfectly with the 1000cy delay. This is 
> exactly what you see for the i.MX8 and the TX2.
>
> > Addition：
> >   Our seL4_Call performance is same with other platform
> >                      XXXX             IMX8MM_EVK_64     TX2_64
> >   seL4_Call           367(0)             378(2)              492(16)        
> >   client->server, same vspace, ipc_len is 0
> >   seL4_ReplyRecv     396(0)             402(2)               513(16)        
> >   server->client, same vspace, ipc_len is 0
>
> OK, so baseline performance is good. But are these measured on a single-core 
> or SMP kernel (i.e. is locking included)?
>
> > My test results below:
> >                                                                       ARM 
> > platform
> >       Test item                         XXX                  IMX8MM_EVK_64  
> >         TX2
> >                                            mean(Stddev)
> >      500 cycles, 1 core        636545(46)            625605(29)            
> > 598142(365)
> >      500 cycles, 2 cores      897900(2327)      1154209(44)           
> > 994298(94)
> >      500 cycles, 3 cores     1301679(2036)      1726043(65)        
> > 1497740(127)
> >      500 cycles, 4 cores     1387678(549)       2172109(12674)   
> > 1545872(109)
> >     1000 cycles, 1 core       636529(42)            625599(22)            
> > 597627(161)
> >     1000 cycles, 2 cores      899212(3384)      1134110(34)          
> > 994437(541)
> >     1000 cycles, 3 cores     1297322(5028)      1695385(45)        
> > 1497547(714)
> >     1000 cycles, 4 cores     1387149(456)        2174605(81)        
> > 1545716(614)
>
> I notice your standard deviations for 2 and 3 cores are surprisingly high 
> (although still small in relative terms).
>
> Did you try running the same again? Are the numbers essentially the same or 
> are multiple runs all over the shop?
>
> There are some issues with our benchmarking methodology. Fixing up sel4bench 
> is one of the projects I’d like to do if I got a student for it, or maybe 
> someone from the community would want to help?
>
> But just from looking at the data I’m not sure that’s the issue here.
>
> Gernot
> _______________________________________________
> Devel mailing list -- [email protected] To unsubscribe send an email 
> to [email protected]

------------------------------

Message: 4
Date: Tue, 07 Dec 2021 03:06:38 -0000
From: [email protected]
Subject: [seL4] Re: Use TimeServer by Group Components Questions
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

Thank you very much for answering my question. I still have a doubt about the 
TimeServer. I can manually modify the cdl file and then I get the the expected 
source file(capdl_spec.c) from using capdl-parse. I also need to change the 
return value of timeserver <IF>_notification API. I have try and I can run the 
program as I expected. Is there any way to create adapted cdl file automatic by 
modifying the ADL parse project? 
I am looking forward to getting your answer again. Thank you very much.

my ADL file:
import <std_connector.camkes>;                             
import <global-connectors.camkes>;
import <TimeServer/TimeServer.camkes>;

component Client {
    control;
    uses Timer timeout;
}

assembly {
    composition {
        group grp {
            component Client c1;
            component Client c2;
        }
        component TimeServer time_server;

        connection seL4TimeServer ts1(
            from grp.c1.timeout, to time_server.the_timer);
        connection seL4TimeServer ts2(
            from grp.c2.timeout, to time_server.the_timer);
    }

    configuration {
        time_server.timers_per_client = 1;
    }
}

------------------------------

Message: 5
Date: Tue, 7 Dec 2021 14:23:54 +1100
From: Kent Mcleod <[email protected]>
Subject: [seL4] Re: Use TimeServer by Group Components Questions
To: [email protected]
Cc: devel <[email protected]>
Message-ID:
        <ca+-ozwestyqam4wy+x359clzlf13pzdcrbnfsfpy-3nyo9y...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

On Tue, Dec 7, 2021 at 2:09 PM <[email protected]> wrote:
>
> Thank you very much for answering my question. I still have a doubt about the 
> TimeServer. I can manually modify the cdl file and then I get the the 
> expected source file(capdl_spec.c) from using capdl-parse. I also need to 
> change the return value of timeserver <IF>_notification API. I have try and I 
> can run the program as I expected. Is there any way to create adapted cdl 
> file automatic by modifying the ADL parse project?

There isn't a supported way to modify the generated cdl file as CAmkES expects 
to generate a full system specification.

What is your motivation for trying to make the timeserver a group component 
instead of a separate component? As part of a group component it will still 
have its own kernel objects for its threads, code, data and heap memories so 
you still have the costs of isolation but without the benefits because it still 
shares the virtual address space and capability space with other components in 
the same group.


> I am looking forward to getting your answer again. Thank you very much.
>
> my ADL file:
> import <std_connector.camkes>;
> import <global-connectors.camkes>;
> import <TimeServer/TimeServer.camkes>;
>
> component Client {
>     control;
>     uses Timer timeout;
> }
>
> assembly {
>     composition {
>         group grp {
>             component Client c1;
>             component Client c2;
>         }
>         component TimeServer time_server;
>
>         connection seL4TimeServer ts1(
>             from grp.c1.timeout, to time_server.the_timer);
>         connection seL4TimeServer ts2(
>             from grp.c2.timeout, to time_server.the_timer);
>     }
>
>     configuration {
>         time_server.timers_per_client = 1;
>     }
> }
> _______________________________________________
> Devel mailing list -- [email protected] To unsubscribe send an email 
> to [email protected]

------------------------------

Message: 6
Date: Tue, 7 Dec 2021 03:45:09 +0000
From: Gernot Heiser <[email protected]>
Subject: [seL4] Re: some performance problem when test 4 cores SMP
        benchmark of seL4bench project 答复: Devel Digest, Vol 127, Issue 1
To: "[email protected]" <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset="utf-8"

On 7 Dec 2021, at 13:43, Kent Mcleod <[email protected]> wrote:
> 
> Looking at the sel4bench smp benchmark implementation, the metric is 
> the total number of "operations" in a single second.  An operation is 
> a round trip intra address space seL4_Call + seL4_ReplyRecv between 2 
> threads on the same core with each thread delaying for the cycle count 
> before performing the next operation.  After 1 second of all cores 
> performing these operations continuously and maintaining a core-local 
> (on a separate cache line) count, the total number of operations is 
> added together and reported as the final number. So you would expect 
> that the reported metric would scale following Amdahl's law based on 
> the proportion of an operation that is serialized inside the kernel 
> lock which would potentially vary across platforms.

Thanks for the explanation, Kent. 

Observations:

1) The metric is essentially independent of the delay. Looking at the 
single-core figures for the i/MX8, I get 1598.5 ns in both cases, the 
difference being 15ps. Doesn’t make sense to me.

2) Assuming this processor runs at the 1.8GHz it seems speced for, this 
corresponds to 2877 cycles, which is huge, even if the 1000cy delay is 
subtracted!

3) As I said before, intra-AS IPC is a meaningless metric we should never use 
(but that’s incidental to the particular thing we want to measure here).

4) Having to do these calculations to understand the numbers is a sure 
indication that the results are presented in an unsuitable form.

I can’t see how these figures make sense.

Gernot

------------------------------

Subject: Digest Footer

_______________________________________________
Devel mailing list -- [email protected] To unsubscribe send an email to 
[email protected]


------------------------------

End of Devel Digest, Vol 127, Issue 5
*************************************
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[seL4] Re: some performance problem when test 4 cores SMP答复: Devel Digest, Vol 127, Issue 5

Reply via email to