from:"Reinette Chatre"

Re: [PATCH RESEND v3 1/2] selftests/resctrl: Fix schemata write error check

2023-09-14 Thread Reinette Chatre

Hi Maciej,

On 9/13/2023 11:01 PM, Maciej Wieczór-Retman wrote:
> On 2023-09-13 at 11:49:19 -0700, Reinette Chatre wrote:
>> On 9/12/2023 10:59 PM, Maciej Wieczór-Retman wrote:
>>> On 2023-09-12 at 09:00:28 -0700, Reinette Chatre wrote:
>>>> On 9/11/2023 11:32 PM, Maciej Wieczór-Retman wrote:
>>>>> On 2023-09-11 at 09:59:06 -0700, Reinette Chatre wrote:
>>>>>> Hi Maciej,
>>>>>> When I build the tests with this applied I encounter the following:
>>>>>>
>>>>>> resctrlfs.c: In function ‘write_schemata’:
>>>>>> resctrlfs.c:475:14: warning: implicit declaration of function ‘open’; 
>>>>>> did you mean ‘popen’? [-Wimplicit-function-declaration]
>>>>>>  475 | fd = open(controlgroup, O_WRONLY);
>>>>>>  |  ^~~~
>>>>>>  |  popen
>>>>>> resctrlfs.c:475:33: error: ‘O_WRONLY’ undeclared (first use in this 
>>>>>> function)
>>>>>>  475 | fd = open(controlgroup, O_WRONLY);
>>>>>>  | ^~~~
>>>>>> resctrlfs.c:475:33: note: each undeclared identifier is reported only 
>>>>>> once for each function it appears in
>>>>>
>>>>> Hmm, that's odd. How do you build the tests?
>>>>
>>>> I applied this series on top of kselftest repo's "next" branch.
>>>>
>>>> I use a separate build directory and first ran "make headers". After that,
>>>> $ make O= -C tools/testing/selftests/resctrl
>>>
>>> I do the same, just without the build directory, but that shouldn't
>>> matter here I guess.
>>>
>>>>> I use "make -C tools/testing/selftests/resctrl" while in the root kernel
>>>>> source directory. I tried to get the same error you experienced by
>>>>> compiling some dummy test program with "open" and "O_WRONLY". From the
>>>>> experiment I found that the "resctrl.h" header provides the declarations
>>>>> that are causing your errors.
>>>>
>>> >From what I can tell resctrl.h does not include fcntl.h that provides
>>>> what is needed.
>>>
>>> I found out you can run "gcc -M " and it will recursively tell you
>>> what headers are including other headers.
>>>
>>> Using this I found that "resctrl.h" includes  which in turn
>>> includes  out of /usr/include/sys directory. Is that also the
>>> case on your system?
>>>
>>
>> No. The test system I used is running glibc 2.35 and it seems that including
>> fcntl.h was added to sys/mount.h in 2.36. See glibc commit
>> 78a408ee7ba0 ("linux: Add open_tree")
>>
>> Generally we should avoid indirect inclusions and here I think certainly so
>> since it cannot be guaranteed that fcntl.h would be available via 
>> sys/mount.h.
> 
> Okay, would including the fcntl.h header to resctrl.h be okay in this
> case? Or is there some other more sophisticated way of doing that (some
> include guard or checking glibc version for example)?

Ideally fcntl.h would be included in the file it is used. Doing so you may
encounter the same problems as Ilpo in [1]. If that is the case and that fix 
works
for you then you may want to have this series depend on Ilpo's work.

Reinette

[1] 
https://lore.kernel.org/lkml/dfc53e-3f92-82e4-6af-d1a28e8c1...@linux.intel.com/

Re: [PATCH 1/5] selftests/resctrl: Extend signal handler coverage to unmount on receiving signal

2023-09-14 Thread Reinette Chatre

Hi Ilpo,

On 9/14/2023 10:05 AM, Ilpo Järvinen wrote:
> On Thu, 14 Sep 2023, Reinette Chatre wrote:
>> On 9/14/2023 3:16 AM, Ilpo Järvinen wrote:
>>> On Wed, 13 Sep 2023, Reinette Chatre wrote:
>>>> On 9/13/2023 3:01 AM, Ilpo Järvinen wrote:
>>>>> On Tue, 12 Sep 2023, Reinette Chatre wrote:
>>>>>> On 9/11/2023 4:19 AM, Ilpo Järvinen wrote:
>>>>>>> Unmounting resctrl FS has been moved into the per test functions in
>>>>>>> resctrl_tests.c by commit caddc0fbe495 ("selftests/resctrl: Move
>>>>>>> resctrl FS mount/umount to higher level"). In case a signal (SIGINT,
>>>>>>> SIGTERM, or SIGHUP) is received, the running selftest is aborted by
>>>>>>> ctrlc_handler() which then unmounts resctrl fs before exiting. The
>>>>>>> current section between signal_handler_register() and
>>>>>>> signal_handler_unregister(), however, does not cover the entire
>>>>>>> duration when resctrl FS is mounted.
>>>>>>>
>>>>>>> Move signal_handler_register() and signal_handler_unregister() call
>>>>>>> into the test functions in resctrl_tests.c to properly unmount resctrl
>>>>>>> fs. Adjust child process kill() call in ctrlc_handler() to only be
>>>>>>> invoked if the child was already forked.
>>>>>>
>>>>>> Thank you for catching this.
>>>>>>
>>>>>>>
>>>>>>> Fixes: caddc0fbe495 ("selftests/resctrl: Move resctrl FS mount/umount 
>>>>>>> to higher level")
>>>>>>> Signed-off-by: Ilpo Järvinen 
>>>>>>> Cc: 
>>>>>>> ---
>>>>>>>  tools/testing/selftests/resctrl/cat_test.c|  8 ---
>>>>>>>  .../testing/selftests/resctrl/resctrl_tests.c | 24 +++
>>>>>>>  tools/testing/selftests/resctrl/resctrl_val.c | 22 -
>>>>>>>  3 files changed, 34 insertions(+), 20 deletions(-)
>>>>>>>
>>>>>>> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
>>>>>>> b/tools/testing/selftests/resctrl/cat_test.c
>>>>>>> index 97b87285ab2a..224ba8544d8a 100644
>>>>>>> --- a/tools/testing/selftests/resctrl/cat_test.c
>>>>>>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>>>>>>> @@ -167,12 +167,6 @@ int cat_perf_miss_val(int cpu_no, int n, char 
>>>>>>> *cache_type)
>>>>>>> strcpy(param.filename, RESULT_FILE_NAME1);
>>>>>>> param.num_of_runs = 0;
>>>>>>> param.cpu_no = sibling_cpu_no;
>>>>>>> -   } else {
>>>>>>> -   ret = signal_handler_register();
>>>>>>> -   if (ret) {
>>>>>>> -   kill(bm_pid, SIGKILL);
>>>>>>> -   goto out;
>>>>>>> -   }
>>>>>>> }
>>>>>>>  
>>>>>>> remove(param.filename);
>>>>>>> @@ -209,10 +203,8 @@ int cat_perf_miss_val(int cpu_no, int n, char 
>>>>>>> *cache_type)
>>>>>>> }
>>>>>>> close(pipefd[0]);
>>>>>>> kill(bm_pid, SIGKILL);
>>>>>>> -   signal_handler_unregister();
>>>>>>> }
>>>>>>>  
>>>>>>> -out:
>>>>>>> cat_test_cleanup();
>>>>>>>  
>>>>>>> return ret;
>>>>>>> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c 
>>>>>>> b/tools/testing/selftests/resctrl/resctrl_tests.c
>>>>>>> index 823672a20a43..3d66fbdc2df3 100644
>>>>>>> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
>>>>>>> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
>>>>>>> @@ -73,8 +73,13 @@ static void run_mbm_test(const char * const 
>>>>>>> *benchmark_cmd, int cpu_no)
>>>>>>>  
>>>>>>> ksft_print_msg("Starting MBM BW change ...\n");
>>>>>>>  
>>>>>>> +   res = signal_

Re: [PATCH v5 0/8] selftests/resctrl: Rework benchmark command handling

2023-09-26 Thread Reinette Chatre

Hi Shuah,

Could you please consider this series for inclusion?

Thank you very much

Reinette

On 9/4/2023 2:53 AM, Ilpo Järvinen wrote:
> The benchmark command handling (-b) in resctrl selftests is overly
> complicated code. This series turns the benchmark command immutable to
> preserve it for all selftests and improves benchmark command related
> error handling.
> 
> This series also ends up removing the strcpy() calls which were pointed
> out earlier.
> 
> v5:
> - Fix another off-by-one error
> - Reorder local var declarations in main() to follow rev. xmas tree
> 
> v4:
> - Correct off-by-one error in -b processing
> - Reordered code in main() to make freeing span_str simpler (in new patch)
> - Use consistent style for const char * const *
> 
> v3:
> - Removed DEFAULT_SPAN_STR for real and the duplicated copy of defines
>   that made to v2 likely due to my incorrect conflict resolutions
> 
> v2:
> - Added argument length check into patch 1/7
> - Updated also -b line in help message.
> - Document -b argument related "algorithm"
> - Use asprintf() to convert defined constant int to string
> - Improved changelog texts
> - Added \n to ksft_exit_fail_msg() call messages.
> - Print DEFAULT_SPAN with %u instead of %zu to avoid need to cast it
> 
> Ilpo Järvinen (8):
>   selftests/resctrl: Ensure the benchmark commands fits to its array
>   selftests/resctrl: Correct benchmark command help
>   selftests/resctrl: Remove bw_report and bm_type from main()
>   selftests/resctrl: Simplify span lifetime
>   selftests/resctrl: Reorder resctrl FS prep code and benchmark_cmd init
>   selftests/resctrl: Make benchmark command const and build it with
> pointers
>   selftests/resctrl: Remove ben_count variable
>   selftests/resctrl: Cleanup benchmark argument parsing
> 
>  tools/testing/selftests/resctrl/cache.c   |   5 +-
>  tools/testing/selftests/resctrl/cat_test.c|  13 +--
>  tools/testing/selftests/resctrl/cmt_test.c|  34 --
>  tools/testing/selftests/resctrl/mba_test.c|   4 +-
>  tools/testing/selftests/resctrl/mbm_test.c|   7 +-
>  tools/testing/selftests/resctrl/resctrl.h |  16 +--
>  .../testing/selftests/resctrl/resctrl_tests.c | 100 --
>  tools/testing/selftests/resctrl/resctrl_val.c |  10 +-
>  8 files changed, 104 insertions(+), 85 deletions(-)
>

Re: [PATCH v2 5/6] selftests/resctrl: Fix feature checks

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> The MBA and CMT tests expect support of other features to be able to
> run.
> 
> When platform only supports MBA but not MBM, MBA test will fail with:
> Failed to open total bw file: No such file or directory
> 
> When platform only supports CMT but not CAT, CMT test will fail with:
> Failed to open bit mask file '/sys/fs/resctrl/info/L3/cbm_mask': No such file 
> or directory
> 
> Extend feature checks to cover these two conditions.

It may be helpful to expand on this more to justify why this is
stable material. At this point it sounds as though (from user space
perspective) one error is replaced by another. Perhaps just a snippet
that states that this fix will cause the test to be skipped on these
platforms.

> 
> Fixes: ee0415681eb6 ("selftests/resctrl: Use resctrl/info for feature 
> detection")
> Signed-off-by: Ilpo Järvinen 
> Cc:  # selftests/resctrl: Refactor feature check to 
> use resource and feature name

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 4/6] selftests/resctrl: Refactor feature check to use resource and feature name

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> Feature check in validate_resctrl_feature_request() takes in the test
> name string and maps that to what to check per test.
> 
> Pass resource and feature names to validate_resctrl_feature_request()
> directly rather than deriving them from the test name inside the
> function which makes the feature check easier to extend for new test
> cases.
> 
> Use !! in the return statement to make the boolean conversion more
> obvious even if it is not strictly necessary from correctness point of
> view (to avoid it looking like the function is returning a freed
> pointer).
> 
> Signed-off-by: Ilpo Järvinen 
> Cc:  # selftests/resctrl: Remove duplicate feature 
> check from CMT test
> Cc:  # selftests/resctrl: Move _GNU_SOURCE define 
> into Makefile

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 6/6] selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> The initial value of 5% chosen for the maximum allowed percentage
> difference between resctrl mbm value and IMC mbm value in commit
> 06bd03a57f8c ("selftests/resctrl: Fix MBA/MBM results reporting
> format") was "randomly chosen value" (as admitted by the changelog).
> 
> When running tests in our lab across a large number platforms, 5%
> difference upper bound for success seems a bit on the low side for the
> MBA and MBM tests. Some platforms produce outliers that are slightly
> above that, typically 6-7%, which leads MBA/MBM test frequently
> failing.
> 
> Replace the "randomly chosen value" with a success bound that is based
> on those measurements across large number of platforms by relaxing the
> MBA/MBM success bound to 8%. The relaxed bound removes the failures due
> the frequent outliers.
> 
> Fixes: 06bd03a57f8c ("selftests/resctrl: Fix MBA/MBM results reporting 
> format")
> Cc: 
> Signed-off-by: Ilpo Järvinen 

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 2/6] selftests/resctrl: Remove duplicate feature check from CMT test

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> The test runner run_cmt_test() in resctrl_tests.c checks for CMT
> feature and does not run cmt_resctrl_val() if CMT is not supported.
> Then cmt_resctrl_val() also check is CMT is supported.
> 
> Remove the duplicated feature check for CMT from cmt_resctrl_val().
> 
> Signed-off-by: Ilpo Järvinen 
> Cc: 

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 1/6] selftests/resctrl: Extend signal handler coverage to unmount on receiving signal

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> Unmounting resctrl FS has been moved into the per test functions in
> resctrl_tests.c by commit caddc0fbe495 ("selftests/resctrl: Move
> resctrl FS mount/umount to higher level"). In case a signal (SIGINT,
> SIGTERM, or SIGHUP) is received, the running selftest is aborted by
> ctrlc_handler() which then unmounts resctrl fs before exiting. The
> current section between signal_handler_register() and
> signal_handler_unregister(), however, does not cover the entire
> duration when resctrl FS is mounted.
> 
> Move signal_handler_register() and signal_handler_unregister() calls
> from per test files into resctrl_tests.c to properly unmount resctrl
> fs. In order to not add signal_handler_register()/unregister() n times,
> create helpers test_prepare() and test_cleanup().
> 
> Adjust child process kill() call in ctrlc_handler() to only be invoked
> if the child was already forked.
> 
> Fixes: caddc0fbe495 ("selftests/resctrl: Move resctrl FS mount/umount to 
> higher level")
> Signed-off-by: Ilpo Järvinen 
> Cc: 
> ---
>  tools/testing/selftests/resctrl/cat_test.c|  8 ---
>  .../testing/selftests/resctrl/resctrl_tests.c | 65 +++
>  tools/testing/selftests/resctrl/resctrl_val.c | 22 +++
>  3 files changed, 48 insertions(+), 47 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 97b87285ab2a..224ba8544d8a 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -167,12 +167,6 @@ int cat_perf_miss_val(int cpu_no, int n, char 
> *cache_type)
>   strcpy(param.filename, RESULT_FILE_NAME1);
>   param.num_of_runs = 0;
>   param.cpu_no = sibling_cpu_no;
> - } else {
> - ret = signal_handler_register();
> - if (ret) {
> - kill(bm_pid, SIGKILL);
> - goto out;
> - }
>   }
>  
>   remove(param.filename);
> @@ -209,10 +203,8 @@ int cat_perf_miss_val(int cpu_no, int n, char 
> *cache_type)
>   }
>   close(pipefd[0]);
>   kill(bm_pid, SIGKILL);
> - signal_handler_unregister();
>   }
>  
> -out:
>   cat_test_cleanup();
>  
>   return ret;
> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c 
> b/tools/testing/selftests/resctrl/resctrl_tests.c
> index 823672a20a43..524ba83d7568 100644
> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
> @@ -67,21 +67,41 @@ void tests_cleanup(void)
>   cat_test_cleanup();
>  }
>  
> -static void run_mbm_test(const char * const *benchmark_cmd, int cpu_no)
> +static int test_prepare()
>  {
>   int res;
>  
> - ksft_print_msg("Starting MBM BW change ...\n");
> + res = signal_handler_register();
> + if (res)
> + return res;
>  
>   res = mount_resctrlfs();
>   if (res) {
> + signal_handler_unregister();
>   ksft_exit_fail_msg("Failed to mount resctrl FS\n");
> - return;
> + return res;
>   }
> + return 0;
> +}
> +
> +static void test_cleanup()
> +{
> + umount_resctrlfs();
> + signal_handler_unregister();
> +}

Thank you for adding these.

> +
> +static void run_mbm_test(const char * const *benchmark_cmd, int cpu_no)
> +{
> + int res;
> +
> + ksft_print_msg("Starting MBM BW change ...\n");
> +
> + if (test_prepare())
> + return;
>  

I am not sure about this. With this exit the kselftest machinery is not
aware of the test passing or failing. I wonder if there should not rather
be a "goto" here that triggers ksft_test_result()? This needs some more
thought though. First, with this change test_prepare() officially gains
responsibility to determine if a failure is transient (just a single test
fails) or permanent (no use trying any other tests if this fails). For
the former it would then be up to the caller to call ksft_test_result()
and for the latter test_prepare() will call ksft_exit_fail_msg().
Second, that SNC warning may be an inconvenience with a new goto. Here
it may be ok to print that message before the test failure?

>   if (!validate_resctrl_feature_request(MBM_STR) || (get_vendor() != 
> ARCH_INTEL)) {
>   ksft_test_result_skip("Hardware does not support MBM or MBM is 
> disabled\n");
> - goto umount;
> + goto cleanup;
>   }
>  
>   res = mbm_bw_change(cpu_no, benchmark_cmd);
> @@ -89,8 +109,8 @@ static void run_mbm_test(const char * const 
> *benchmark_cmd, int cpu_no)
>   if ((get_vendor() == ARCH_INTEL) && res)
>   ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA 
> Clustering is enabled. Check BIOS configuration.\n");
>  
> -umount:
> - umount_resctrlfs();
> +cleanup:
> + test_cleanup();
>  }
>  
>  static void run_mba_test

Re: [PATCH v2 3/6] selftests/resctrl: Move _GNU_SOURCE define into Makefile

2023-09-26 Thread Reinette Chatre

Hi Ilpo,

On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:
> Currently, _GNU_SOURCE is defined in resctrl.h. Defining _GNU_SOURCE

nitpick: If you do submit a new version, could you please drop the
"Currently".

> has a large impact on what gets defined when including headers either
> before or after it. This can result in compile failures if .c file
> decides to include a standard header file before resctrl.h.
> 
> It is safer to define _GNU_SOURCE in Makefile so it is always defined
> regardless of in which order includes are done.
> 
> Signed-off-by: Ilpo Järvinen 
> Cc: 

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v4 1/2] selftests/resctrl: Fix schemata write error check

2023-09-27 Thread Reinette Chatre

Hi Maciej,

On 9/22/2023 1:10 AM, Maciej Wieczor-Retman wrote:
> Writing bitmasks to the schemata can fail when the bitmask doesn't
> adhere to constraints defined by what a particular CPU supports.
> Some example of constraints are max length or having contiguous bits.
> The driver should properly return errors when any rule concerning
> bitmask format is broken.
> 
> Resctrl FS returns error codes from fprintf() only when fclose() is
> called. Current error checking scheme allows invalid bitmasks to be
> written into schemata file and the selftest doesn't notice because the
> fclose() error code isn't checked.
> 
> Substitute fopen(), flose() and fprintf() with open(), close() and
> write() to avoid error code buffering between fprintf() and fclose().
> 
> Remove newline character from the schema string after writing it to
> the schemata file so it prints correctly before function return.
> 
> Pass the string generated with strerror() to the "reason" buffer so
> the error message is more verbose. Extend "reason" buffer so it can hold
> longer messages.
> 
> Signed-off-by: Maciej Wieczor-Retman 
> ---
> Changelog v4:
> - Unify error checking between open() and write(). (Reinette)
> - Add fcntl.h for glibc backward compatiblitiy. (Reinette)
> 
> Changelog v3:
> - Rename fp to fd. (Ilpo)
> - Remove strlen, strcspn and just use the snprintf value instead. (Ilpo)
> 
> Changelog v2:
> - Rewrite patch message.
> - Double "reason" buffer size to fit longer error explanation.
> - Redo file interactions with syscalls instead of stdio functions.
> 
>  tools/testing/selftests/resctrl/resctrlfs.c | 30 -
>  1 file changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index 3a8111362d26..edc8fc6e44b0 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -8,6 +8,7 @@
>   *Sai Praneeth Prakhya ,
>   *Fenghua Yu 
>   */
> +#include 
>  #include 
>  
>  #include "resctrl.h"
> @@ -490,9 +491,8 @@ int write_bm_pid_to_resctrl(pid_t bm_pid, char *ctrlgrp, 
> char *mongrp,
>   */
>  int write_schemata(char *ctrlgrp, char *schemata, int cpu_no, char 
> *resctrl_val)
>  {
> - char controlgroup[1024], schema[1024], reason[64];
> - int resource_id, ret = 0;
> - FILE *fp;
> + char controlgroup[1024], schema[1024], reason[128];
> + int resource_id, fd, schema_len = -1, ret = 0;

I am trying to understand the schema_len initialization. Could
you please elaborate why you chose -1? I'm a bit concerned with
the robustness here with it being used as an unsigned integer
in write() and also the negative array index later.

>  
>   if (strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR)) &&
>   strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)) &&
> @@ -520,27 +520,31 @@ int write_schemata(char *ctrlgrp, char *schemata, int 
> cpu_no, char *resctrl_val)
>  
>   if (!strncmp(resctrl_val, CAT_STR, sizeof(CAT_STR)) ||
>   !strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR)))
> - sprintf(schema, "%s%d%c%s", "L3:", resource_id, '=', schemata);
> + schema_len = snprintf(schema, sizeof(schema), "%s%d%c%s\n",
> +   "L3:", resource_id, '=', schemata);
>   if (!strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR)) ||
>   !strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)))
> - sprintf(schema, "%s%d%c%s", "MB:", resource_id, '=', schemata);
> + schema_len = snprintf(schema, sizeof(schema), "%s%d%c%s\n",
> +   "MB:", resource_id, '=', schemata);
>  
> - fp = fopen(controlgroup, "w");
> - if (!fp) {
> - sprintf(reason, "Failed to open control group");
> + fd = open(controlgroup, O_WRONLY);
> + if (!fd) {

Be careful ... the error checking appropriate to the original
pointer needs a double check with this new usage.
According to "man 2 open" - open() returns -1 on error so I expect
that this should rather be:
if (fd < 0) {
or
if (fd == -1) {

The rest looks good to me.

Reinette

Re: [PATCH v4 0/2] selftests/resctrl: Bug fix and optimization

2023-09-27 Thread Reinette Chatre

Hi Maciej,

On 9/22/2023 1:09 AM, Maciej Wieczor-Retman wrote:
> The resctrlfs.c file defines functions that interact with the resctrl FS
> while resctrl_val.c file defines functions that perform measurements on
> the cache. Run_benchmark() fits logically into the second file before
> resctrl_val() function that uses it.

nitpick ... if there are comments in one patch of the series please consider
if it applies to other places in the series.

> 
> Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
> redundant part of the kernel-doc comment. Make run_benchmark() static
> and remove it from the header file.
> 
> Patch series is based on [1] which is based on [2] which are based on
> ksefltest next branch.

ksefltest -> kselftest

Reinette

Re: [PATCH v2 1/6] selftests/resctrl: Extend signal handler coverage to unmount on receiving signal

2023-09-28 Thread Reinette Chatre

Hi Ilpo,

On 9/28/2023 5:47 AM, Ilpo Järvinen wrote:
> On Tue, 26 Sep 2023, Reinette Chatre wrote:
>> On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:

...

>>> +
>>> +static void run_mbm_test(const char * const *benchmark_cmd, int cpu_no)
>>> +{
>>> +   int res;
>>> +
>>> +   ksft_print_msg("Starting MBM BW change ...\n");
>>> +
>>> +   if (test_prepare())
>>> +   return;
>>>  
>>
>> I am not sure about this. With this exit the kselftest machinery is not
>> aware of the test passing or failing. I wonder if there should not rather
>> be a "goto" here that triggers ksft_test_result()?
> 
> Yes, ksft_test_result() is needed here (I forgot to add it).
> 
>> This needs some more 
>> thought though. First, with this change test_prepare() officially gains
>> responsibility to determine if a failure is transient (just a single test
>> fails) or permanent (no use trying any other tests if this fails). For
>> the former it would then be up to the caller to call ksft_test_result()
>> and for the latter test_prepare() will call ksft_exit_fail_msg().
> 
> Well, I didn't initially have test_prepare() at all but all this was 
> within the test functions (which will be consolidated to a single function 
> by the series that comes after the two series are done + one patch from 
> Maciej).
> 
> I was just trying to do what was done previously but it seems I forgot to 
> handle the result status on signal reg fail path.
> 
> TBH, I wouldn't mind if also the signal reg fail is just up'ed to 
> ksft_exit_fail_msg(). I don't think it can ever fail with the parameters 
> given to it so its error handling feels pretty much dead-code (unless some 
> crazy thing such as apparmor does something out of the blue, I don't know 
> if apparmor has capability override sigaction() but I've seen apparmor to
> create errors that from the surface make no sense whatsoever comparable
> to this case).
> 
> So basically this discussion is now about what to do with the mount 
> failing which already does _exit() before this patch (and possibly some
> hypotethical, new prepare code after the consolidation work which also
> will have some impact and I believe we might actually want to kill 
> test_prepare() at that point anyway).

Having failure during signal handler registration also trigger 
ksft_exit_fail_msg()
sounds fair to me. I am also ok with keeping the exit when mount fails.

If any future test_prepare() code does not imply a test exit then I hope it 
would
be obvious that ksft_test_result() needs to be called. Perhaps that can
be accomplished if test_prepare() does not exit the test but instead just
returns an error code (if needed it can use ksft_print_msg() internally for
any details about particular failures) and the caller call ksft_exit_fail_msg()
if test_prepare() fails? With the caller responsible for the 
ksft_exit_fail_msg()
as well as ksft_test_result() then any new addition may be guided to the
right calls. This considers hypothetical future changes to code that is
being consolidated so surely no strong opinions from my side.

>> Second, that SNC warning may be an inconvenience with a new goto. Here
>> it may be ok to print that message before the test failure?
> 
> I don't follow what you're referring to with "that SNC warning". To the 
> "Intel CMT may be inaccurate ..." one?

Yes, that is the warning. I envisioned addressing the issue by adding a
goto label right before the ksft_test_result() call within run_cmt_test()
in this case (but also in run_mbm_test()). Doing so would solve the issue
that test counters are incremented on test_prepare() failure but it will
also trigger the message you note and that would be confusing to the user if
the test failure was because of signal handler registration failure.

...

>>> diff --git a/tools/testing/selftests/resctrl/resctrl_val.c 
>>> b/tools/testing/selftests/resctrl/resctrl_val.c
>>> index 51963a6f2186..a9fe61133119 100644
>>> --- a/tools/testing/selftests/resctrl/resctrl_val.c
>>> +++ b/tools/testing/selftests/resctrl/resctrl_val.c
>>> @@ -468,7 +468,9 @@ pid_t bm_pid, ppid;
>>>  
>>>  void ctrlc_handler(int signum, siginfo_t *info, void *ptr)
>>>  {
>>> -   kill(bm_pid, SIGKILL);
>>> +   /* Only kill child after bm_pid is set after fork() */
>>> +   if (bm_pid)
>>> +   kill(bm_pid, SIGKILL);
>>> umount_resctrlfs();
>>> tests_cleanup();
>>> ksft_print_msg("Ending\n\n");
>>> @@ -485,6 +487,8 @@ int signal_handler_register(void)
>>> struct sigaction siga

Re: [PATCH v4 1/2] selftests/resctrl: Fix schemata write error check

2023-09-28 Thread Reinette Chatre

Hi Maciej,

On 9/27/2023 11:46 PM, Maciej Wieczór-Retman wrote:
> On 2023-09-27 at 15:15:06 -0700, Reinette Chatre wrote:
>> On 9/22/2023 1:10 AM, Maciej Wieczor-Retman wrote:

>>> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
>>> b/tools/testing/selftests/resctrl/resctrlfs.c
>>> index 3a8111362d26..edc8fc6e44b0 100644
>>> --- a/tools/testing/selftests/resctrl/resctrlfs.c
>>> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
>>> @@ -8,6 +8,7 @@
>>>   *Sai Praneeth Prakhya ,
>>>   *Fenghua Yu 
>>>   */
>>> +#include 
>>>  #include 
>>>  
>>>  #include "resctrl.h"
>>> @@ -490,9 +491,8 @@ int write_bm_pid_to_resctrl(pid_t bm_pid, char 
>>> *ctrlgrp, char *mongrp,
>>>   */
>>>  int write_schemata(char *ctrlgrp, char *schemata, int cpu_no, char 
>>> *resctrl_val)
>>>  {
>>> -   char controlgroup[1024], schema[1024], reason[64];
>>> -   int resource_id, ret = 0;
>>> -   FILE *fp;
>>> +   char controlgroup[1024], schema[1024], reason[128];
>>> +   int resource_id, fd, schema_len = -1, ret = 0;
>>
>> I am trying to understand the schema_len initialization. Could
>> you please elaborate why you chose -1? I'm a bit concerned with
>> the robustness here with it being used as an unsigned integer
>> in write() and also the negative array index later.
> 
> My idea was that if the initial value for schema_len was 0, then if
> resctrl_val wouldn't equal any of MBA_STR, MBM_STR, CAT_STR, CMT_STR

Ensuring that resctrl_val is equal to one of these seems to be the
first thing write_schemata() does.

> values schema_len would stay zero and write nothing.

Your alternative writes "-1". write() is declared as:
ssize_t write(int fd, const void *buf, size_t count);

note that "count" is size_t, which is an unsigned value. Providing
it -1 is thus a very large number and likely to cause overflow. In fact
if I even try to compile a program where the compiler can figure out
count will be -1 it fails the compile (stringop-overflow).
 
> I think it would be difficult to debug such an error because even later
> in ksft_print_msg the requested schema would get printed as if there was
> no error. In the case I mentioned above the function will just error out
> which I assume could be helpful.

You seem to rely on write() to cleanly catch giving it bad data.

> Other solutions that can accomplish the same goal would be checking
> write() not only for negative values but also for zero (since in
> here this is pretty much an error). Or checking schema_len for only
> positive values after the block of code where it gets assigned a
> value from sprintf.
> 
> Are any of the above safer or more logical in your opinion?

There is no error checking on schema_len. After it has been initialized it
can be checked for errors and write_schemata() can be exited immediately if
an error was encountered without attempting the write().

Reinette

Re: [PATCH v3 1/7] selftests/resctrl: Fix uninitialized .sa_flags

2023-09-29 Thread Reinette Chatre

Hi Ilpo,

On 9/29/2023 4:20 AM, Ilpo Järvinen wrote:
> signal_handler_unregister() calls sigaction() with uninitializing
> sa_flags in the struct sigaction.
> 
> Make sure sa_flags is always initialized in signal_handler_unregister()
> by initializing the struct sigaction when declaring it.
> 
> Fixes: 73c55fa5ab55 (selftests/resctrl: Commonize the signal handler 
> register/unregister for all tests)

Please place the title line in quotes (checkpatch warning).

> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 
> Cc: 
> ---
>  tools/testing/selftests/resctrl/resctrl_val.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrl_val.c 
> b/tools/testing/selftests/resctrl/resctrl_val.c
> index 51963a6f2186..1e8b90077218 100644
> --- a/tools/testing/selftests/resctrl/resctrl_val.c
> +++ b/tools/testing/selftests/resctrl/resctrl_val.c
> @@ -504,7 +504,7 @@ int signal_handler_register(void)
>   */
>  void signal_handler_unregister(void)
>  {
> - struct sigaction sigact;
> + struct sigaction sigact = {};
>  
>   sigact.sa_handler = SIG_DFL;
>   sigemptyset(&sigact.sa_mask);

Could you please add this initialization to signal_handler_register()
also? I understand that the particular issue of sa_flags is not 
relevant to that function but there are other uninitialized fields.
I think initializing the struct makes the code more robust without
needing to reason/understand glibc behavior.

Reinette

Re: [PATCH v3 2/7] selftests/resctrl: Extend signal handler coverage to unmount on receiving signal

2023-09-29 Thread Reinette Chatre

Hi Ilpo,

On 9/29/2023 4:20 AM, Ilpo Järvinen wrote:
...
> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c 
> b/tools/testing/selftests/resctrl/resctrl_tests.c
> index 823672a20a43..4eab2fad97fb 100644
> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
> @@ -67,21 +67,45 @@ void tests_cleanup(void)
>   cat_test_cleanup();
>  }
>  
> -static void run_mbm_test(const char * const *benchmark_cmd, int cpu_no)
> +static int test_prepare()

Please change to test_prepare(void) (checkpatch error).

>  {
>   int res;
>  
> - ksft_print_msg("Starting MBM BW change ...\n");
> + res = signal_handler_register();
> + if (res) {
> + ksft_print_msg("Failed to register signal handler\n");
> + return res;
> + }
>  
>   res = mount_resctrlfs();
>   if (res) {
> - ksft_exit_fail_msg("Failed to mount resctrl FS\n");
> + signal_handler_unregister();
> + ksft_print_msg("Failed to mount resctrl FS\n");
> + return res;
> + }
> + return 0;
> +}
> +
> +static void test_cleanup()

Please change to test_cleanup(void) (checkpatch error).

> +{
> +     umount_resctrlfs();
> + signal_handler_unregister();
> +}
> +

With the above two reports addressed you can add:

Reviewed-by: Reinette Chatre 

Thank you.

Reinette

Re: [PATCH v5 1/2] selftests/resctrl: Fix schemata write error check

2023-09-29 Thread Reinette Chatre

Hi Maciej,

On 9/29/2023 1:21 AM, Maciej Wieczor-Retman wrote:
...

> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index 3a8111362d26..342a3dbcdbb6 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -8,6 +8,7 @@
>   *Sai Praneeth Prakhya ,
>   *Fenghua Yu 
>   */
> +#include 
>  #include 
>  
>  #include "resctrl.h"
> @@ -490,9 +491,8 @@ int write_bm_pid_to_resctrl(pid_t bm_pid, char *ctrlgrp, 
> char *mongrp,
>   */
>  int write_schemata(char *ctrlgrp, char *schemata, int cpu_no, char 
> *resctrl_val)
>  {
> - char controlgroup[1024], schema[1024], reason[64];
> - int resource_id, ret = 0;
> - FILE *fp;
> + char controlgroup[1024], schema[1024], reason[128];
> + int resource_id, fd, schema_len = 0, ret = 0;
>  
>   if (strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR)) &&
>   strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)) &&
> @@ -520,27 +520,37 @@ int write_schemata(char *ctrlgrp, char *schemata, int 
> cpu_no, char *resctrl_val)
>  
>   if (!strncmp(resctrl_val, CAT_STR, sizeof(CAT_STR)) ||
>   !strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR)))
> - sprintf(schema, "%s%d%c%s", "L3:", resource_id, '=', schemata);
> + schema_len = snprintf(schema, sizeof(schema), "%s%d%c%s\n",
> +   "L3:", resource_id, '=', schemata);
>   if (!strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR)) ||
>   !strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)))
> - sprintf(schema, "%s%d%c%s", "MB:", resource_id, '=', schemata);
> + schema_len = snprintf(schema, sizeof(schema), "%s%d%c%s\n",
> +   "MB:", resource_id, '=', schemata);
> + if (schema_len < 1) {

I find that this complicates the code since this is not the typical snprintf()
error checking (0 is a valid snprintf() return). I think it will make the code
easier to understand if it sticks to snprintf() error checking and initialize
schema_len to match. What I mean with this is something like this:

int schema_len = -1;

if (...)
schema_len = snprintf(...);

if (schema_len < 0 || schema_len >= sizeof(schema))
/* error handling */

> + snprintf(reason, sizeof(reason),
> +  "snprintf() failed with return value : %d", 
> schema_len);
> + ret = -1;
> + goto out;
> + }
>  
> - fp = fopen(controlgroup, "w");
> - if (!fp) {
> - sprintf(reason, "Failed to open control group");
> + fd = open(controlgroup, O_WRONLY);
> + if (fd < 0) {
> + snprintf(reason, sizeof(reason),
> +  "open() failed : %s", strerror(errno));
>   ret = -1;
>  
>   goto out;
>   }
> -
> - if (fprintf(fp, "%s\n", schema) < 0) {
> - sprintf(reason, "Failed to write schemata in control group");
> - fclose(fp);
> + if (write(fd, schema, schema_len) < 0) {
> + snprintf(reason, sizeof(reason),
> +  "write() failed : %s", strerror(errno));
> + close(fd);
>   ret = -1;
>  
>   goto out;
>   }
> - fclose(fp);
> + close(fd);
> + schema[schema_len - 1] = 0;
>  
>  out:
>   ksft_print_msg("Write schema \"%s\" to resctrl FS%s%s\n",

Also please note that _if_ there is an early exit then uninitialized schema
will be printed. Maybe this also needs a schema[1024] = {} ?

Reinette

Re: [PATCH v3 1/8] selftests: Add printf attribute to ksefltest prints

2023-09-29 Thread Reinette Chatre

Hi Maciej,

In subject: ksefltest -> kselftest

On 9/22/2023 2:06 AM, Maciej Wieczor-Retman wrote:
> Kselftest header defines multiple variadic functions that use printf
> along with other logic.
> 
> There is no format checking for the variadic functions that use
> printing inside kselftest.h. Because of this the compiler won't
> be able to catch instances of mismatched printf formats and debugging
> tests might be more difficult.
> 
> Add the common __printf attribute macro to kselftest.h.
> 
> Add __printf attribute to every function using formatted printing with
> variadic arguments.
> 
> Reviewed-by: Ilpo Järvinen 
> Signed-off-by: Maciej Wieczor-Retman 

With typo fixed:
Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 8/8] selftests/resctrl: Fix wrong format specifier

2023-09-29 Thread Reinette Chatre

Hi Maciej,

On 9/22/2023 2:06 AM, Maciej Wieczor-Retman wrote:
> A long unsigned int variable is passed to the ksft_print_msg() and the
> format specifier used expects a variable of type int.
> 
> Change the format specifier to match the passed variable.
> 
> Reviewed-by: Ilpo Järvinen 
> Signed-off-by: Maciej Wieczor-Retman 
> ---

Thank you very much for this cleanup.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v4 1/7] selftests/resctrl: Fix uninitialized .sa_flags

2023-10-02 Thread Reinette Chatre

Hi Ilpo,

On 10/2/2023 2:48 AM, Ilpo Järvinen wrote:
> signal_handler_unregister() calls sigaction() with uninitializing
> sa_flags in the struct sigaction.
> 
> Make sure sa_flags is always initialized in signal_handler_unregister()
> by initializing the struct sigaction when declaring it. Also add the
> initialization to signal_handler_register() even if there are no know
> bugs in there because correctness is then obvious from the code itself.
> 
> Fixes: 73c55fa5ab55 ("selftests/resctrl: Commonize the signal handler 
> register/unregister for all tests")
> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 
> Cc: 

Thank you.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v4 0/7] selftests/resctrl: Fixes to failing tests

2023-10-02 Thread Reinette Chatre




On 10/2/2023 4:31 AM, Ilpo Järvinen wrote:
> On Mon, 2 Oct 2023, Ilpo Järvinen wrote:
> 
>> Fix four issues with resctrl selftests.
>>
>> The signal handling fix became necessary after the mount/umount fixes
>> and the uninitialized member bug was discovered during the review.
>>
>> The other two came up when I ran resctrl selftests across the server
>> fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
>> is not part of this series).
>>
>> These are developed and should apply cleanly at least on top the
>> benchmark cleanup series (might apply cleanly also w/o the benchmark
>> series, I didn't test).
> 
> LKP seems to no longer happy to apply this cleanly without the benchmark 
> rework series as the signal handling fix got now a bigger footprint 
> (maybe LKP didn't build v3 at all as the changes from v3->v4 were really 
> small and I did not get error out of v3).
> 
> Shuah, would you want me to reorganize this such that it goes in before 
> the benchmark series? In any case, I'll wait until Reinette has had time 
> to look at the first patch if I'm to send the series reordered.

That sounds unnecessary to me because I assume that doing such reorganization
would require a new version of the benchmark series [1] that has been ready for
a while now.

Both series look good to me. I just added my "Reviewed-by" to the first patch of
this series and it (this series) applies cleanly on top of the benchmark series.

Reinette

[1] 
https://lore.kernel.org/lkml/20230904095339.11321-1-ilpo.jarvi...@linux.intel.com/

Re: [PATCH v2 1/6] selftests/resctrl: Extend signal handler coverage to unmount on receiving signal

2023-10-04 Thread Reinette Chatre

Hi Shaopeng,

On 9/28/2023 1:10 AM, Shaopeng Tan (Fujitsu) wrote:
>> On 9/15/2023 8:44 AM, Ilpo Järvinen wrote:

...

>>> +static void run_mbm_test(const char * const *benchmark_cmd, int
>>> +cpu_no) {
>>> +   int res;
>>> +
>>> +   ksft_print_msg("Starting MBM BW change ...\n");
>>> +
>>> +   if (test_prepare())
>>> +   return;
>>>
>>
>> I am not sure about this. With this exit the kselftest machinery is not 
>> aware of
>> the test passing or failing. I wonder if there should not rather be a "goto" 
>> here
>> that triggers ksft_test_result()? This needs some more thought though. First,
>> with this change test_prepare() officially gains responsibility to determine 
>> if a
>> failure is transient (just a single test
>> fails) or permanent (no use trying any other tests if this fails). For the 
>> former it
>> would then be up to the caller to call ksft_test_result() and for the latter
>> test_prepare() will call ksft_exit_fail_msg().
>> Second, that SNC warning may be an inconvenience with a new goto. Here it
>> may be ok to print that message before the test failure?
> 
> If a failure may be permanent, it may be best to detect it before running all 
> tests, rather than in test_prepare().
> Now some detections are completed before running all tests. For example:
> 273 if (geteuid() != 0)
> 274 return ksft_exit_skip("Not running as root. 
> Skipping...\n");
> 275
> 276 if (!check_resctrlfs_support())
> 277 return ksft_exit_skip("resctrl FS does not exist. Enable 
> X86_CPU_RESCTRL config option.\n");
> 278
> 279 if (umount_resctrlfs())
> 280 return ksft_exit_skip("resctrl FS unmount failed.\n");
> 

You are correct that the tests should aim to detect as early as possible if
no test has a chance of succeeding. This is covered in the checks you mention.
The purpose of test_prepare()/test_cleanup() pair is to perform actions that
should be done for every test. For example, resctrl is mounted before each
test and unmounted after each test. Since these actions are required to be done
for every test it cannot be a single call before all tests are run.

It may be possible to add a test_prepare() directly followed by a test_cleanup()
before any test is run to be more explicit about early detection but that
does not seem necessary considering the checks would be done anyway when the
first test is run. Even when doing so it would not eliminate the need for 
test_prepare()/test_cleanup() to form part of every test run and needing to exit
if, for example, a previous test triggered a fault preventing resctrl from
being mounted.

Reinette

Re: [PATCH v6 1/2] selftests/resctrl: Fix schemata write error check

2023-10-09 Thread Reinette Chatre

  }
> -
> - if (fprintf(fp, "%s\n", schema) < 0) {
> - sprintf(reason, "Failed to write schemata in control group");
> - fclose(fp);
> + if (write(fd, schema, schema_len) < 0) {
> + snprintf(reason, sizeof(reason),
> +  "write() failed : %s", strerror(errno));
> + close(fd);
>   ret = -1;
>  
>   goto out;
>   }
> - fclose(fp);
> + close(fd);
> + schema[schema_len - 1] = 0;
>  
>  out:
>   ksft_print_msg("Write schema \"%s\" to resctrl FS%s%s\n",


As changelog states, the newline is removed from schema to
ensure it is printed correctly. Note that this is not done when an
error is encountered during open() or write() so when an error is
encountered in these places then the print does not look as intended.

I think a new goto label inserted just before the newline removal
should be sufficient, with the open() and write() error paths jumping
to it.

With that addressed:

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v6 0/8] Add printf attribute to kselftest functions

2023-10-13 Thread Reinette Chatre




On 10/13/2023 2:03 PM, Shuah wrote:
> Applied this series and the following three - all 25 patches are
> in linux-kselftest next for Linux 6.7-rc1.
> 
> [1] 
> https://lore.kernel.org/all/cover.1696932728.git.maciej.wieczor-ret...@intel.com/
> [2] 
> https://lore.kernel.org/all/20231002094813.6633-1-ilpo.jarvi...@linux.intel.com/
> [3] 
> https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvi...@linux.intel.com/
> 

Thank you very much Shuah.

Reinette

Re: [PATCH 1/1] selftests/resctrl: Don't fail MBM test when schemata doesn't support MB:x=x line

2023-10-17 Thread Reinette Chatre

Hi Ilpo,

The subject could be shortened to:
selftests/resctrl: Fix MBM test failure when MBA unavailable

On 10/17/2023 5:17 AM, Ilpo Järvinen wrote:
> Commit 20d96b25cc4c ("selftests/resctrl: Fix schemata write error
> check") exposed a problem in feature detection logic in MBM selftest.
> If schemata does not support MB:x=x entries, the schemata write to
> initialize 100% memory bandwidth allocation in mbm_setup() will now
> fail with -EINVAL due to the error handling corrected by 20d96b25cc4c.
> Commit 20d96b25cc4c just uncovers the failed write, it is not wrong
> itself.
> 
> If MB:x=x is not supported by schemata, it is safe to assume 100%
> memory bandwidth is always set. Therefore, the previously ignored error
> does not make the MBM test itself wrong.
> 
> Restore the previous behavior of MBM test by checking MB support before
> attempting to write it into schemata which results in behavior
> equivalent to ignoring the write error.
> 
> Fixes: 20d96b25cc4c ("selftests/resctrl: Fix schemata write error check")
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/mbm_test.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/resctrl/mbm_test.c 
> b/tools/testing/selftests/resctrl/mbm_test.c
> index d3c0d30c676a..85987957e7f5 100644
> --- a/tools/testing/selftests/resctrl/mbm_test.c
> +++ b/tools/testing/selftests/resctrl/mbm_test.c
> @@ -95,7 +95,7 @@ static int mbm_setup(struct resctrl_val_param *p)
>   return END_OF_TESTS;
>  
>   /* Set up shemata with 100% allocation on the first run. */
> - if (p->num_of_runs == 0)
> + if ((p->num_of_runs == 0) && validate_resctrl_feature_request("MB", 
> NULL))
>   ret = write_schemata(p->ctrlgrp, "100", p->cpu_no,
>    p->resctrl_val);
>  

Thank you for catching this.

Could you please ensure that this patch passes a "checkpatch.pl --strict" check?
With that addressed:
Reviewed-by: Reinette Chatre  

Reinette

Re: [PATCH 07/24] selftests/resctrl: Split measure_cache_vals() function

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> The measure_cache_vals() function does a different thing depending on

No need to say "function" when using "()". The above can just be:
"measure_cache_vals() does a different ..."

> the test case that called it:
>   - For CAT, it measures LLC perf misses.
>   - For CMT, it measures LLC occupancy through resctrl.
> 
> Split these two functionalities into own functions the CAT and CMT
> tests can call directly.
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cache.c   | 37 ++-
>  tools/testing/selftests/resctrl/resctrl.h |  2 +-
>  tools/testing/selftests/resctrl/resctrl_val.c |  2 +-
>  3 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cache.c 
> b/tools/testing/selftests/resctrl/cache.c
> index bcbca356d56a..299d9508221f 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -170,35 +170,36 @@ static int print_results_cache(char *filename, int 
> bm_pid,
>   return 0;
>  }
>  
> -int measure_cache_vals(struct resctrl_val_param *param, int bm_pid)
> +static int measure_llc_perf(struct resctrl_val_param *param, int bm_pid)
>  {
> - unsigned long llc_perf_miss = 0, llc_occu_resc = 0, llc_value = 0;
> + unsigned long llc_perf_miss = 0;
>   int ret;
>  
>   /*
>* Measure cache miss from perf.
>*/

I'd expect these comments to move as part of a change like this,
but seems like this is merged with other changes in patch 10. Does not
seem like an issue done in this way.

> - if (!strncmp(param->resctrl_val, CAT_STR, sizeof(CAT_STR))) {
> - ret = get_llc_perf(&llc_perf_miss);
> - if (ret < 0)
> - return ret;
> - llc_value = llc_perf_miss;
> - }
> + ret = get_llc_perf(&llc_perf_miss);
> + if (ret < 0)
> + return ret;
> +
> + ret = print_results_cache(param->filename, bm_pid, llc_perf_miss);
> + return ret;
> +}
> +
> +int measure_llc_resctrl(struct resctrl_val_param *param, int bm_pid)
> +{
> + unsigned long llc_occu_resc = 0;
> + int ret;
>  
>   /*
>* Measure llc occupancy from resctrl.
>*/
> - if (!strncmp(param->resctrl_val, CMT_STR, sizeof(CMT_STR))) {
> - ret = get_llc_occu_resctrl(&llc_occu_resc);
> - if (ret < 0)
> - return ret;
> - llc_value = llc_occu_resc;
> - }
> - ret = print_results_cache(param->filename, bm_pid, llc_value);
> - if (ret)
> + ret = get_llc_occu_resctrl(&llc_occu_resc);
> + if (ret < 0)
>   return ret;
>  
> - return 0;
> + ret = print_results_cache(param->filename, bm_pid, llc_occu_resc);
> + return ret;
>  }
>  
>  /*
> @@ -253,7 +254,7 @@ int cat_val(struct resctrl_val_param *param, size_t span)
>   }
>  
>   sleep(1);
> - ret = measure_cache_vals(param, bm_pid);
> + ret = measure_llc_perf(param, bm_pid);
>   if (ret)
>   goto pe_close;
>   }


Reinette

Re: [PATCH 09/24] selftests/resctrl: Remove unnecessary __u64 -> unsigned long conversion

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> Perf counters are __u64 but the code converts them to unsigned long
> before printing them out.
> 
> Remove unnecessary type conversion and the potential loss of meaningful
> bits due to different sizes of types.

This motivation is not clear to me. Is this work done in
preparation for 32 bit testing? This raises a lot more topics if
this is the case.

Reinette

Re: [PATCH 10/24] selftests/resctrl: Remove nested calls in perf event handling

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> Perf event handling has functions that are the sole caller of another
> perf event handling related function:
>   - reset_enable_llc_perf() calls perf_event_open_llc_miss()
>   - reset_enable_llc_perf() calls ioctl_perf_event_ioc_reset_enable()
>   - measure_llc_perf() calls get_llc_perf()
> 
> Remove the extra layer of calls to make the code easier to follow by
> moving the code into the calling function.
> 
> In addition, converts print_results_cache() unsigned long parameter to
> __u64 that matches the type coming from perf.

Is this referring to work from previous patch?

> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cache.c | 86 +++--
>  1 file changed, 25 insertions(+), 61 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cache.c 
> b/tools/testing/selftests/resctrl/cache.c
> index d39ef4eebc37..208af1ecae28 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -29,25 +29,6 @@ static void initialize_perf_event_attr(void)
>   pea_llc_miss.disabled = 1;
>  }
>  
> -static void ioctl_perf_event_ioc_reset_enable(void)
> -{
> - ioctl(fd_lm, PERF_EVENT_IOC_RESET, 0);
> - ioctl(fd_lm, PERF_EVENT_IOC_ENABLE, 0);
> -}
> -
> -static int perf_event_open_llc_miss(pid_t pid, int cpu_no)
> -{
> - fd_lm = perf_event_open(&pea_llc_miss, pid, cpu_no, -1,
> - PERF_FLAG_FD_CLOEXEC);
> - if (fd_lm == -1) {
> - perror("Error opening leader");
> - ctrlc_handler(0, NULL, NULL);
> - return -1;
> - }
> -
> - return 0;
> -}
> -
>  static void initialize_llc_perf(void)
>  {
>   memset(&pea_llc_miss, 0, sizeof(struct perf_event_attr));
> @@ -63,42 +44,16 @@ static void initialize_llc_perf(void)
>  
>  static int reset_enable_llc_perf(pid_t pid, int cpu_no)
>  {
> - int ret = 0;
> -
> - ret = perf_event_open_llc_miss(pid, cpu_no);
> - if (ret < 0)
> - return ret;
> -
> - /* Start counters to log values */
> - ioctl_perf_event_ioc_reset_enable();
> -
> - return 0;
> -}
> -
> -/*
> - * get_llc_perf: llc cache miss through perf events
> - * @llc_perf_miss:   LLC miss counter that is filled on success
> - *
> - * Perf events like HW_CACHE_MISSES could be used to validate number of
> - * cache lines allocated.
> - *
> - * Return: =0 on success.  <0 on failure.
> - */
> -static int get_llc_perf(__u64 *llc_perf_miss)
> -{
> - int ret;
> -
> - /* Stop counters after one span to get miss rate */
> -
> - ioctl(fd_lm, PERF_EVENT_IOC_DISABLE, 0);
> -
> - ret = read(fd_lm, &rf_cqm, sizeof(struct read_format));
> - if (ret == -1) {
> - perror("Could not get llc misses through perf");
> + fd_lm = perf_event_open(&pea_llc_miss, pid, cpu_no, -1, 
> PERF_FLAG_FD_CLOEXEC);
> + if (fd_lm == -1) {
> + perror("Error opening leader");
> + ctrlc_handler(0, NULL, NULL);

I understand you just copied the code here ... but it is not clear to me
why this particular error handling deserves a ctrlc_handler().

>   return -1;
>   }
>  
> - *llc_perf_miss = rf_cqm.values[0].value;
> + /* Start counters to log values */
> + ioctl(fd_lm, PERF_EVENT_IOC_RESET, 0);
> + ioctl(fd_lm, PERF_EVENT_IOC_ENABLE, 0);
>  
>   return 0;
>  }
> @@ -166,20 +121,29 @@ static int print_results_cache(char *filename, int 
> bm_pid, __u64 llc_value)
>   return 0;
>  }
>  
> +/*
> + * measure_llc_perf: measure perf events
> + * @bm_pid:  child pid that runs benchmark

I expected "bm_pid" to reflect a "benchmark pid" that
is not unique to the child. Are both parent and child
not running the benchmark?

Missing doc of a parameter here.

> + *
> + * Measure things like cache misses from perf events.

"things like cache misses" is vague. The function's name 
still contains "llc" which makes me think it is not quite
generic yet.



> + *
> + * Return: =0 on success.  <0 on failure.
> + */
>  static int measure_llc_perf(struct resctrl_val_param *param, int bm_pid)
>  {
> - __u64 llc_perf_miss = 0;
>   int ret;
>  
> - /*
> -  * Measure cache miss from perf.
> -  */
> - ret = get_llc_perf(&llc_perf_miss);
> - if (ret < 0)
> - return ret;
> + /* Stop counters after one span to get miss rate */
> + ioctl(fd_lm, PERF_EVENT_IOC_DISABLE, 0);
>  
> - ret = print_results_cache(param->filename, bm_pid, llc_perf_miss);
> - return ret;
> + ret = read(fd_lm, &rf_cqm, sizeof(struct read_format));
> + close(fd_lm);

I am not able to tell where this close() moved from.

> + if (ret == -1) {
> + perror("Could not get perf value");
> + return -1;
> + }
> +
> + return print_results_cache(param->filename, bm_pid, 
> rf_cqm.values[0].value);
>  }
>  
>  int measure_llc_resctrl(struct resctrl_val_param *pa

Re: [PATCH 17/24] selftests/resctrl: Create struct for input parameter

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:

...

> diff --git a/tools/testing/selftests/resctrl/mba_test.c 
> b/tools/testing/selftests/resctrl/mba_test.c
> index d3bf4368341e..5157a3f74fee 100644
> --- a/tools/testing/selftests/resctrl/mba_test.c
> +++ b/tools/testing/selftests/resctrl/mba_test.c
> @@ -141,13 +141,13 @@ void mba_test_cleanup(void)
>   remove(RESULT_FILE_NAME);
>  }
>  
> -int mba_schemata_change(int cpu_no, const char * const *benchmark_cmd)
> +int mba_schemata_change(const struct user_params *uparams)
>  {
>   struct resctrl_val_param param = {
>   .resctrl_val= MBA_STR,
>   .ctrlgrp= "c1",
>   .mongrp = "m1",
> - .cpu_no = cpu_no,
> + .cpu_no = uparams->cpu,
>   .filename   = RESULT_FILE_NAME,
>   .bw_report  = "reads",
>   .setup  = mba_setup
> @@ -156,7 +156,7 @@ int mba_schemata_change(int cpu_no, const char * const 
> *benchmark_cmd)
>  
>   remove(RESULT_FILE_NAME);
>  
> - ret = resctrl_val(benchmark_cmd, ¶m);
> + ret = resctrl_val(uparams->benchmark_cmd, ¶m);
>   if (ret)
>   goto out;
>  

How about a new member of struct resctrl_val_param that points to 
uparams? That would remove cpu_no from resctrl_val_param
and have everything available when a test needs to run ... not copying
some user parameters into struct resctrl_val_param and passing
others as parameters.

Reinette

Re: [PATCH 03/24] selftests/resctrl: Refactor get_cbm_mask()

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:

...

> @@ -229,6 +228,31 @@ int get_cbm_mask(char *cache_type, char *cbm_mask)
>   return 0;
>  }
>  
> +/*
> + * get_cbm_mask - Get cbm bit mask

I know you just copied code here but please keep an eye out for acronyms
to be written in caps.

This needs not be named to reflect verbatim what the function does.
Looking ahead I wonder if "get_full_mask()/get_max_mask()" may not be a
clear indication of what this does?

Something like:
get_full_mask()/get_max_mask()  Get maximum Cache Bit Mask (CBM) 
@cache_type:Cache level(? or should this be "type") as "L2" or L3".
@mask:  Full/Maximum portion of cache available for
allocation represented by bit mask
returned as unsigned long.

> + * @cache_type:  Cache level L2/L3
> + * @mask:cbm_mask returned as unsigned long
> + *
> + * Return: = 0 on success, < 0 on failure.
> + */
> +int get_cbm_mask(const char *cache_type, unsigned long *mask)
> +{
> + char cbm_mask_path[1024];
> + int ret;
> +
> + if (!cache_type)
> + return -1;

Just to confirm ... error checking on mask is intentionally deferred 
until get_bit_mask()?

> +
> + snprintf(cbm_mask_path, sizeof(cbm_mask_path), "%s/%s/cbm_mask",
> +  INFO_PATH, cache_type);
> +
> + ret = get_bit_mask(cbm_mask_path, mask);
> + if (ret)
> + return -1;
> +
> + return 0;
> +}
> +
>  /*
>   * get_core_sibling - Get sibling core id from the same socket for given CPU
>   * @cpu_no:  CPU number

Reinette

Re: [PATCH 05/24] selftests/resctrl: Create cache_size() helper

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> CAT and CMT tests calculate the span size from the n-bits cache
> allocation on their own.
> 
> Add cache_size() helper which calculates size of the cache portion for
> the given number of bits and use it to replace the existing span
> calculations. This also prepares for the new CAT test that will need to
> determine the size of the cache portion also during results processing.
> 
> cache_size local variables were renamed out of the way to
> cache_total_size.

Please do stick to imperative mood ... "Rename cache_size local
variables ..."


...

> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
> b/tools/testing/selftests/resctrl/resctrl.h
> index 2f3f0ee439d8..da06b2d492f9 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -117,4 +117,18 @@ int show_cache_info(unsigned long sum_llc_val, int 
> no_of_bits,
>   unsigned long max_diff_percent, unsigned long num_of_runs,
>   bool platform, bool cmt);
>  
> +/*
> + * cache_size - Calculate the size of a cache portion
> + * @cache_size:  Cache size in bytes
> + * @mask:Cache portion mask
> + * @cache_mask:  Full bitmask for the cache
> + *
> + * Return: The size of the cache portion in bytes.
> + */
> +static inline int cache_size(unsigned long cache_size, unsigned long mask,
> +  unsigned long cache_mask)
> +{
> + return cache_size * count_bits(mask) / count_bits(cache_mask);
> +}
> +
>  #endif /* RESCTRL_H */


The get_cache_size() and cache_size() naming appears similar enough to me
to cause confusion. Considering the "portion" term above, what do you think
of "cache_portion_size()" or even "cache_portion_bytes()"?

Reinette

Re: [PATCH 18/24] selftests/resctrl: Introduce generalized test framework

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
...
> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
> b/tools/testing/selftests/resctrl/resctrl.h
> index ec6efd36f60a..e017adf1390d 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -37,6 +37,8 @@
>  
>  #define DEFAULT_SPAN (250 * MB)
>  
> +#define ARRAY_SIZE(arr)  (sizeof(arr) / sizeof((arr)[0]))
> +

Shuah worked hard to remove all these copies within kselftest. Please
use the one in kselftest.h instead.

>  #define PARENT_EXIT(err_msg) \
>   do {\
>   perror(err_msg);\


...

> @@ -233,25 +183,26 @@ int main(int argc, char **argv)
>   case 't':
>   token = strtok(optarg, ",");
>  
> - mbm_test = false;
> - mba_test = false;
> - cmt_test = false;
> - cat_test = false;
> + if (!test_param_seen) {
> + for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++)
> + resctrl_tests[i]->disabled = true;
> + tests = 0;
> + test_param_seen = true;
> + }
>   while (token) {
> - if (!strncmp(token, MBM_STR, sizeof(MBM_STR))) {
> - mbm_test = true;
> - tests++;
> - } else if (!strncmp(token, MBA_STR, 
> sizeof(MBA_STR))) {
> - mba_test = true;
> - tests++;
> - } else if (!strncmp(token, CMT_STR, 
> sizeof(CMT_STR))) {
> - cmt_test = true;
> - tests++;
> - } else if (!strncmp(token, CAT_STR, 
> sizeof(CAT_STR))) {
> - cat_test = true;
> - tests++;
> - } else {
> - printf("invalid argument\n");
> + bool found = false;
> +
> + for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++) 
> {
> + if (!strcasecmp(token, 
> resctrl_tests[i]->name)) {
> + if (resctrl_tests[i]->disabled)
> + tests++;
> + resctrl_tests[i]->disabled = 
> false;
> + found = true;
> + }
> + }

Could providing multiple "-t" result in the test count not
matching the number of tests run?

> +
> + if (!found) {
> + printf("invalid test: %s\n", token);
>  
>   return -1;
>   }

A great improvement, thanks.

Reinette

Re: [PATCH 01/24] selftests/resctrl: Split fill_buf to allow tests finer-grained control

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> MBM, MBA and CMT test cases use run_fill_buf() to loop indefinitely
> around the buffer. CAT test case is different and doesn't want to loop
> around the buffer continuously.

Please do not that the changelog starts by describing issue with
run_fill_buf() (that does not appear in patch) and then switches to
describe solution for fill_cache() below.

> 
> Split fill_cache() so that both the use cases are easier to control by
> creating separate functions for buffer allocation and looping around
> the buffer. Make those functions available for tests. The new interface
> is based on returning/passing pointers instead of the startptr global
> pointer variable that can now be removed. The deallocation can use
> free() directly.

Seems like startptr removal has been done already:
5e3e4f1a03f0 ("selftests/resctrl: Remove unnecessary startptr global from 
fill_buf") 

Reinette

Re: [PATCH 24/24] selftests/resctrl: Ignore failures from L2 CAT test with <= 2 bits

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> L2 CAT test with low number of bits tends to occasionally fail because
> of what seems random variation. The margin is quite small to begin with
> for <= 2 bits in CBM. At times, the result can even become negative.
> While it would be possible to allow negative values for those cases, it
> would be more confusing to user.
> 
> Ignore failures from the tests where <= 2 were used to avoid false
> negative results.
> 

I think the core message is that 2 or fewer bits should not be used. Instead
of running the test and ignoring the results the test should perhaps just not
be run.

Reinette

Re: [PATCH 08/24] selftests/resctrl: Split show_cache_info() to test specific and generic parts

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> show_cache_info() calculates results and provides generic cache
> information. This makes it hard to alter pass/fail conditions.
> 
> Separate the test specific checks into CAT and CMT test files and
> leave only the generic information part into show_cache_info().
> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cache.c| 40 --
>  tools/testing/selftests/resctrl/cat_test.c | 30 ++--
>  tools/testing/selftests/resctrl/cmt_test.c | 32 +++--
>  tools/testing/selftests/resctrl/resctrl.h  |  6 ++--
>  4 files changed, 65 insertions(+), 43 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cache.c 
> b/tools/testing/selftests/resctrl/cache.c
> index 299d9508221f..95489d4b42b7 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -267,43 +267,17 @@ int cat_val(struct resctrl_val_param *param, size_t 
> span)
>  }
>  
>  /*
> - * show_cache_info:  show cache test result information
> - * @sum_llc_val: sum of LLC cache result data
> + * show_cache_info:  show generic cache test information
>   * @no_of_bits:  number of bits
> - * @cache_span:  cache span in bytes for CMT or in lines for CAT
> - * @max_diff:max difference
> - * @max_diff_percent:max difference percentage
> - * @num_of_runs: number of runs
> - * @platform:show test information on this platform
> - * @cmt: CMT test or CAT test
> - *
> - * Return:   0 on success. non-zero on failure.
> + * @avg_llc_val: avg of LLC cache result data
> + * @cache_span:  cache span
> + * @lines:   cache span in lines or bytes

It may be helpful to directly connect it to the other parameter:
@cache_span in lines or bytes

>   */
> -int show_cache_info(unsigned long sum_llc_val, int no_of_bits,
> - size_t cache_span, unsigned long max_diff,
> - unsigned long max_diff_percent, unsigned long num_of_runs,
> - bool platform, bool cmt)
> +void show_cache_info(int no_of_bits, unsigned long avg_llc_val,
> +  size_t cache_span, bool lines)
>  {
> - unsigned long avg_llc_val = 0;
> - float diff_percent;
> - long avg_diff = 0;
> - int ret;
> -
> - avg_llc_val = sum_llc_val / num_of_runs;
> - avg_diff = (long)abs(cache_span - avg_llc_val);
> - diff_percent = ((float)cache_span - avg_llc_val) / cache_span * 100;
> -
> - ret = platform && abs((int)diff_percent) > max_diff_percent &&
> -   (cmt ? (abs(avg_diff) > max_diff) : true);
> -
> - ksft_print_msg("%s Check cache miss rate within %lu%%\n",
> -ret ? "Fail:" : "Pass:", max_diff_percent);
> -
> - ksft_print_msg("Percent diff=%d\n", abs((int)diff_percent));
>   ksft_print_msg("Number of bits: %d\n", no_of_bits);
>   ksft_print_msg("Average LLC val: %lu\n", avg_llc_val);
> - ksft_print_msg("Cache span (%s): %zu\n", cmt ? "bytes" : "lines",
> + ksft_print_msg("Cache span (%s): %zu\n", !lines ? "bytes" : "lines",
>  cache_span);

I think it would be easier to read as: lines ? "lines" : "bytes"

Reinette

Re: [PATCH 16/24] selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> CAT test spawns two processes into two different control groups with
> exclusive schemata. Both the processes alloc a buffer from memory
> matching their allocated LLC block size and flush the entire buffer out
> of caches. Since the processes are reading through the buffer only once
> during the measurement and initially all the buffer was flushed, the
> test isn't testing CAT.
> 
> Rewrite the CAT test to allocate a buffer sized to half of LLC. Then
> perform a sequence of tests with different LLC alloc sizes starting
> from half of the CBM bits down to 1-bit CBM. Flush the buffer before
> each test and read the buffer twice. Observe the LLC misses on the
> second read through the buffer. As the allocated LLC block gets smaller
> and smaller, the LLC misses will become larger and larger giving a
> strong signal on CAT working properly.
> 
> The new CAT test is using only a single process because it relies on
> measured effect against another run of itself rather than another
> process adding noise. The rest of the system is allocated the CBM bits
> not used by the CAT test to keep the test isolated.
> 
> Replace count_bits() with count_contiguous_bits() to get the first bit
> position in order to be able to calculate masks based on it.
> 
> This change has been tested with a number of systems from different
> generations.

Thank you very much for doing this.

> 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cat_test.c  | 286 +---
>  tools/testing/selftests/resctrl/fill_buf.c  |   6 +-
>  tools/testing/selftests/resctrl/resctrl.h   |   5 +-
>  tools/testing/selftests/resctrl/resctrlfs.c |  44 +--
>  4 files changed, 137 insertions(+), 204 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index e71690a9bbb3..7518c520c5cc 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -11,65 +11,68 @@
>  #include "resctrl.h"
>  #include 
>  
> -#define RESULT_FILE_NAME1"result_cat1"
> -#define RESULT_FILE_NAME2"result_cat2"
> +#define RESULT_FILE_NAME "result_cat"
>  #define NUM_OF_RUNS  5
> -#define MAX_DIFF_PERCENT 4
> -#define MAX_DIFF 100
>  
>  /*
> - * Change schemata. Write schemata to specified
> - * con_mon grp, mon_grp in resctrl FS.
> - * Run 5 times in order to get average values.
> + * Minimum difference in LLC misses between a test with n+1 bits CBM mask to
> + * the test with n bits. With e.g. 5 vs 4 bits in the CBM mask, the minimum
> + * difference must be at least MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3 
> percent.

This formula is not clear to me. In the code the formula is always:
MIN_DIFF_PERCENT_PER_BIT * (bits - 1) ... is the "-1" because it always
decrements the number of bits tested by one? So, for example, if testing
5 then 3 bits it would be  MIN_DIFF_PERCENT_PER_BIT * (3 - 2)?
Would above example thus be:
MIN_DIFF_PERCENT_PER_BIT * (4 - (5 - 4)) = 3 ?



> + *
> + * The relationship between number of used CBM bits and difference in LLC
> + * misses is not expected to be linear. With a small number of bits, the
> + * margin is smaller than with larger number of bits. For selftest purposes,
> + * however, linear approach is enough because ultimately only pass/fail
> + * decision has to be made and distinction between strong and stronger
> + * signal is irrelevant.
>   */

...

>  /*
>   * cat_test: execute CAT benchmark and measure LLC cache misses
>   * @param:   parameters passed to cat_test()
>   * @span:buffer size for the benchmark
> + * @current_mask start mask for the first iteration
> + *
> + * Run CAT test, bits are removed one-by-one from the current_mask for each
> + * subsequent test.
>   *

Could this please be expanded to provide more details about how test works,
measurements taken, and how pass/fail is determined?

> - * Return:   0 on success. non-zero on failure.
> + * Return:   0 on success. non-zero on failure.

Is non-zero specific enough? Does that mean that <0 and >0 are failure?

>   */
> -static int cat_test(struct resctrl_val_param *param, size_t span)
> +static int cat_test(struct resctrl_val_param *param, size_t span, unsigned 
> long current_mask)
>  {
> - int memflush = 1, operation = 0, ret = 0;
>   char *resctrl_val = param->resctrl_val;
>   static struct perf_event_read pe_read;
>   struct perf_event_attr pea;
> + unsigned char *buf;
> + char schemata[64];
> + int ret, i, pe_fd;
&g

Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> CAT selftests only cover L3 but some newer CPUs come also with L2 CAT
> support.

No need to use "new" language. L2 CAT has been available for a long time
... since Apollo Lake. Which systems actually support it is a different
topic. This is an architectural feature that has been available for a
long time. Whether a system supports it will be detected and the test
run based on that. 

> 
> Add L2 CAT selftest. As measuring L2 misses is not easily available
> with perf, use L3 accesses as a proxy for L2 CAT working or not.

I understand the exact measurement is not available but I do notice some
L2 related symbolic counters when I run "perf list". l2_rqsts.all_demand_miss
looks promising.

L3 cannot be relied on for those systems, like Apollo lake, that do
not have an L3.

> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cat_test.c| 68 +--
>  tools/testing/selftests/resctrl/resctrl.h |  1 +
>  .../testing/selftests/resctrl/resctrl_tests.c |  1 +
>  3 files changed, 63 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 48a96acd9e31..a9c72022bb5a 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -131,8 +131,47 @@ void cat_test_cleanup(void)
>   remove(RESULT_FILE_NAME);
>  }
>  
> +/*
> + * L2 CAT test measures L2 misses indirectly using L3 accesses as a proxy
> + * because perf cannot directly provide the number of L2 misses (there are
> + * only platform specific ways to get the number of L2 misses).
> + *
> + * This function sets up L3 CAT to reduce noise from other processes during
> + * L2 CAT test.

This motivation is not clear to me. Does the same isolation used during L3 CAT
testing not work? I expected it to follow the same idea with the L2 cache split
in two, the test using one part and the rest of the system using the other.
Is that not enough isolation?

Reinette

Re: [PATCH 06/24] selftests/resctrl: Exclude shareable bits from schemata in CAT test

2023-11-02 Thread Reinette Chatre

Hi Ilpo,

On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> CAT test doesn't take shareable bits into account, i.e., the test might
> be sharing cache with some devices (e.g., graphics).
> 
> Introduce get_mask_no_shareable() and use it to provision an
> environment for CAT test where the allocated LLC is isolated better.
> Excluding shareable_bits may create hole(s) into the cbm_mask, thus add
> a new helper count_contiguous_bits() to find the longest contiguous set
> of CBM bits.
> 
> create_bit_mask() is needed by an upcoming CAT test rewrite so make it
> available in resctrl.h right away.
> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cat_test.c  | 12 ++-
>  tools/testing/selftests/resctrl/resctrl.h   |  3 +
>  tools/testing/selftests/resctrl/resctrlfs.c | 84 +
>  3 files changed, 95 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 80861c362a53..e5861e7cba7e 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -92,13 +92,17 @@ int cat_perf_miss_val(int cpu_no, int n, char *cache_type)
>   unsigned long l_mask, l_mask_1;
>   int ret, pipefd[2], sibling_cpu_no;
>   unsigned long cache_total_size = 0;
> - unsigned long long_mask;
> + unsigned long full_cache_mask, long_mask;

Please do maintain reverse fir order.

>   int count_of_bits;
>   char pipe_message;
>   size_t span;
>  
>   /* Get default cbm mask for L3/L2 cache */
> - ret = get_cbm_mask(cache_type, &long_mask);
> + ret = get_cbm_mask(cache_type, &full_cache_mask);
> + if (ret)
> + return ret;
> + /* Get the exclusive portion of the cache */
> + ret = get_mask_no_shareable(cache_type, &long_mask);
>   if (ret)
>   return ret;
>  
> +
> +/*
> + * count_contiguous_bits - Returns the longest train of bits in a bit mask
> + * @val  A bit mask
> + * @startThe location of the least-significant bit of the longest train
> + *
> + * Return:   The length of the contiguous bits in the longest train of bits
> + */
> +static unsigned int count_contiguous_bits(unsigned long val, unsigned int 
> *start)
> +{
> + unsigned long last_val;
> + int count = 0;

could count be unsigned int?

> +
> + while (val) {
> + last_val = val;
> + val &= (val >> 1);
> + count++;
> + }
> +
> + if (start) {
> + if (count)
> + *start = ffsl(last_val) - 1;
> + else
> + *start = 0;
> + }
> +
> + return count;
> +}
> +
>  /*
>   * get_cbm_mask - Get cbm bit mask
>   * @cache_type:  Cache level L2/L3
> @@ -253,6 +291,52 @@ int get_cbm_mask(const char *cache_type, unsigned long 
> *mask)
>   return 0;
>  }
>  
> +/*
> + * get_shareable_mask - Get shareable mask from shareable_bits for given 
> cache
> + * @cache_type:  Cache level L2/L3
> + * @shareable_mask:  shareable mask returned as unsigned long

A nitpick but please be consistent in starting sentences with capital letter.

> + *
> + * Return: = 0 on success, < 0 on failure.
> + */
> +int get_shareable_mask(const char *cache_type, unsigned long *shareable_mask)
> +{
> + char mask_path[1024];
> +
> + if (!cache_type)
> + return -1;

Same question as earlier about error checking.

> +
> + snprintf(mask_path, sizeof(mask_path), "%s/%s/shareable_bits",
> +  INFO_PATH, cache_type);
> +
> + return get_bit_mask(mask_path, shareable_mask);
> +}
> +
> +/*
> + * get_mask_no_shareable - Get CBM mask without shareable_bits for given 
> cache
> + * @cache_type:  Cache level L2/L3
> + * @mask:mask returned as unsigned long

Apart from comment about capital letter this comment does not seem to
reveal more about what code does.

> + *
> + * Return: = 0 on success, < 0 on failure.
> + */
> +int get_mask_no_shareable(const char *cache_type, unsigned long *mask)
> +{
> + unsigned long full_mask, shareable_mask;
> + unsigned int start, len;
> +
> + if (get_cbm_mask(cache_type, &full_mask) < 0)
> + return -1;
> + if (get_shareable_mask(cache_type, &shareable_mask) < 0)
> + return -1;
> +
> + len = count_contiguous_bits(full_mask & ~shareable_mask, &start);
> + if (!len)
> + return -1;
> +
> + *mask = create_bit_mask(start, len);
> +
> + return 0;
> +}

Nicely done, thanks.

> +
>  /*
>   * get_core_sibling - Get sibling core id from the same socket for given CPU
>   * @cpu_no:  CPU number


Reinette

Re: [PATCH 05/24] selftests/resctrl: Create cache_size() helper

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 1:53 AM, Ilpo Järvinen wrote:
> 
> Yes, I'm more than happy to rename them. This naming was what you 
> suggested earlier. (I used cache_alloc_size() or something like that
> initially and you were against using "alloc" in the name.)

My apologies for giving poor guidance. I cannot recall the thread behind
that guidance but looking at these changes together the cache_size() and
get_cache_size() naming does seem close enough to cause confusion.

Reinette

Re: [PATCH 16/24] selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 3:57 AM, Ilpo Järvinen wrote:
> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:

...

>>>  /*
>>> - * Change schemata. Write schemata to specified
>>> - * con_mon grp, mon_grp in resctrl FS.
>>> - * Run 5 times in order to get average values.
>>> + * Minimum difference in LLC misses between a test with n+1 bits CBM mask 
>>> to
>>> + * the test with n bits. With e.g. 5 vs 4 bits in the CBM mask, the minimum
>>> + * difference must be at least MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3 
>>> percent.
>>
>> This formula is not clear to me. In the code the formula is always:
>> MIN_DIFF_PERCENT_PER_BIT * (bits - 1) ... is the "-1" because it always
>> decrements the number of bits tested by one?
> 
> No, -1 is not related to decrementing bits by one, but setting a boundary 
> which workds for 1 bit masks. In general,  the smaller the number of bits 
> in the allocation mask is, less change there will be between n+1 -> n bits
> results. When n is 1, the result with some platforms is close to zero so I 
> just had to make the min diff to allow it. Thus, n-1 to set the failure 
> threshold at 0%.
> 
>> So, for example, if testing
>> 5 then 3 bits it would be  MIN_DIFF_PERCENT_PER_BIT * (3 - 2)?
>> Would above example thus be:
>> MIN_DIFF_PERCENT_PER_BIT * (4 - (5 - 4)) = 3 ?
> 
> I suspect you're overthinking it here. The CAT selftest currently doesn't 
> jump from 5 to 3 bits so I don't know what you're trying to calculate 
> here.

I was trying to understand the formula. The example starts with "e.g 5 vs 4 bits
in the CBM mask ..." and then jumps to "MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3"
It is not obvious to me where the "5" and "4" from the example produces the
resulting formula.

Perhaps it would be simpler (to me) to say something like 

Minimum difference in LLC misses between a test with n+1 bits CBM mask 
to
the test with n bits is MIN_DIFF_PERCENT_PER_BIT * (n - 1). With e.g. 5 
vs 4
bits in the CBM mask, the minimum difference must be at least
MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3 percent.

> 
>>> - * Return: 0 on success. non-zero on failure.
>>> + * Return: 0 on success. non-zero on failure.
>>
>> Is non-zero specific enough? Does that mean that <0 and >0 are failure?
> 
> I suspect it is, after all the cleanups and fixes that have been done.
> The wording is from the original though.
> 
>>>   */
>>> -static int cat_test(struct resctrl_val_param *param, size_t span)
>>> +static int cat_test(struct resctrl_val_param *param, size_t span, unsigned 
>>> long current_mask)
>>>  {
>>> -   int memflush = 1, operation = 0, ret = 0;
>>> char *resctrl_val = param->resctrl_val;
>>> static struct perf_event_read pe_read;
>>> struct perf_event_attr pea;
>>> +   unsigned char *buf;
>>> +   char schemata[64];
>>> +   int ret, i, pe_fd;
>>> pid_t bm_pid;
>>> -   int pe_fd;
>>>  
>>> if (strcmp(param->filename, "") == 0)
>>> sprintf(param->filename, "stdio");
>>> @@ -143,54 +168,64 @@ static int cat_test(struct resctrl_val_param *param, 
>>> size_t span)
>>> if (ret)
>>> return ret;
>>>  
>>> +   buf = alloc_buffer(span, 1);
>>> +   if (buf == NULL)
>>> +   return -1;
>>> +
>>> perf_event_attr_initialize(&pea, PERF_COUNT_HW_CACHE_MISSES);
>>> perf_event_initialize_read_format(&pe_read);
>>>  
>>> -   /* Test runs until the callback setup() tells the test to stop. */
>>> -   while (1) {
>>> -   ret = param->setup(param);
>>> -   if (ret == END_OF_TESTS) {
>>> -   ret = 0;
>>> -   break;
>>> -   }
>>> -   if (ret < 0)
>>> -   break;
>>> -   pe_fd = perf_event_reset_enable(&pea, bm_pid, param->cpu_no);
>>> -   if (pe_fd < 0) {
>>> -   ret = -1;
>>> -   break;
>>> -   }
>>> +   while (current_mask) {
>>> +   snprintf(schemata, sizeof(schemata), "%lx", param->mask & 
>>> ~current_mask);
>>> +   ret = write_schemata("", schemata, param->cpu_no, 
>>> param->resctrl_val);
>>> +   if (ret)
>&

Re: [PATCH 17/24] selftests/resctrl: Create struct for input parameter

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 4:24 AM, Ilpo Järvinen wrote:
> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
>>
>>> diff --git a/tools/testing/selftests/resctrl/mba_test.c 
>>> b/tools/testing/selftests/resctrl/mba_test.c
>>> index d3bf4368341e..5157a3f74fee 100644
>>> --- a/tools/testing/selftests/resctrl/mba_test.c
>>> +++ b/tools/testing/selftests/resctrl/mba_test.c
>>> @@ -141,13 +141,13 @@ void mba_test_cleanup(void)
>>> remove(RESULT_FILE_NAME);
>>>  }
>>>  
>>> -int mba_schemata_change(int cpu_no, const char * const *benchmark_cmd)
>>> +int mba_schemata_change(const struct user_params *uparams)
>>>  {
>>> struct resctrl_val_param param = {
>>> .resctrl_val= MBA_STR,
>>> .ctrlgrp= "c1",
>>> .mongrp = "m1",
>>> -   .cpu_no = cpu_no,
>>> +   .cpu_no = uparams->cpu,
>>> .filename   = RESULT_FILE_NAME,
>>> .bw_report  = "reads",
>>> .setup  = mba_setup
>>> @@ -156,7 +156,7 @@ int mba_schemata_change(int cpu_no, const char * const 
>>> *benchmark_cmd)
>>>  
>>> remove(RESULT_FILE_NAME);
>>>  
>>> -   ret = resctrl_val(benchmark_cmd, ¶m);
>>> +   ret = resctrl_val(uparams->benchmark_cmd, ¶m);
>>> if (ret)
>>> goto out;
>>>  
>>
>> How about a new member of struct resctrl_val_param that points to 
>> uparams? That would remove cpu_no from resctrl_val_param
>> and have everything available when a test needs to run ... not copying
>> some user parameters into struct resctrl_val_param and passing
>> others as parameters.
> 
> I'm a bit allergic to adding more stuff into resctrl_val_param. It seems 
> a structure where random stuff has been thrown at just because it exists.
> In general, your point is very valid though because the members of 
> resctrl_val_param should be auditted through to see how many of them are 
> even useful after adding uparams and struct resctrl_test.
> 
> I could get rid of copying parameters from uparams to params and just 
> passing uparams instead of benchmark_cmd into resctrl_val(). Would you be 
> okay with that?

I am ok with that. I assume this implies that cpu_no will be removed from
resctrl_val_param?

> Oh, and I really should rename resctrl_val() one day to something more 
> meaningful too. :-) (but it won't be part of this series and will likely 
> be another conflicty nightmare because resctrl_val_param too needs to 
> be renamed...).

"Naming only" changes that are not part of something more substantive are not
very appealing though.

Reinette

Re: [PATCH 18/24] selftests/resctrl: Introduce generalized test framework

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 2:54 AM, Ilpo Järvinen wrote:
> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
>> ...
>>> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
>>> b/tools/testing/selftests/resctrl/resctrl.h
>>> index ec6efd36f60a..e017adf1390d 100644
>>
>>> @@ -233,25 +183,26 @@ int main(int argc, char **argv)
>>> case 't':
>>> token = strtok(optarg, ",");
>>>  
>>> -   mbm_test = false;
>>> -   mba_test = false;
>>> -   cmt_test = false;
>>> -   cat_test = false;
>>> +   if (!test_param_seen) {
>>> +   for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++)
>>> +   resctrl_tests[i]->disabled = true;
>>> +   tests = 0;
>>> +   test_param_seen = true;
>>> +   }
>>> while (token) {
>>> -   if (!strncmp(token, MBM_STR, sizeof(MBM_STR))) {
>>> -   mbm_test = true;
>>> -   tests++;
>>> -   } else if (!strncmp(token, MBA_STR, 
>>> sizeof(MBA_STR))) {
>>> -   mba_test = true;
>>> -   tests++;
>>> -   } else if (!strncmp(token, CMT_STR, 
>>> sizeof(CMT_STR))) {
>>> -   cmt_test = true;
>>> -   tests++;
>>> -   } else if (!strncmp(token, CAT_STR, 
>>> sizeof(CAT_STR))) {
>>> -   cat_test = true;
>>> -   tests++;
>>> -   } else {
>>> -   printf("invalid argument\n");
>>> +   bool found = false;
>>> +
>>> +   for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++) 
>>> {
>>> +   if (!strcasecmp(token, 
>>> resctrl_tests[i]->name)) {
>>> +   if (resctrl_tests[i]->disabled)
>>> +   tests++;
>>> +   resctrl_tests[i]->disabled = 
>>> false;
>>> +   found = true;
>>> +   }
>>> +   }
>>
>> Could providing multiple "-t" result in the test count not
>> matching the number of tests run?
> 
> bool test_param_seen covers the case with more than one -t parameter.
> Because of it, the code above resets tests and ->disabled only when the 
> first -t is encountered. tests++ is only done when ->disabled is set from 
> true to false.
> 
> I don't see how they could get out of sync but if you had a more specific 
> case in mind, just let me know.
> 

Thank you for your detailed explanation. I can now see how this is safeguarded.

Reinette

Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 3:39 AM, Ilpo Järvinen wrote:
> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> 
>>> Add L2 CAT selftest. As measuring L2 misses is not easily available
>>> with perf, use L3 accesses as a proxy for L2 CAT working or not.
>>
>> I understand the exact measurement is not available but I do notice some
>> L2 related symbolic counters when I run "perf list". l2_rqsts.all_demand_miss
>> looks promising.
> 
> Okay, I was under impression that L2 misses are not available. Both based 
> on what you mentioned to me half an year ago and because of what flags I 
> found from the header. But I'll take another look into it.

You are correct that when I did L2 testing a long time ago I used
the model specific L2 miss counts. I was hoping that things have improved
so that model specific counters are not needed, as you have tried here.
I found the l2_rqsts symbol while looking for alternatives but I am not
familiar enough with perf to know how these symbolic names are mapped.
I was hoping that they could be a simple drop-in replacement to
experiment with.

> 
>> L3 cannot be relied on for those systems, like Apollo lake, that do
>> not have an L3.
> 
> Do you happen know what perf will report for such CPUs, will it return 
> L2 as LLC?

I don't know.

> 
>>> Signed-off-by: Ilpo Järvinen 
>>> ---
>>>  tools/testing/selftests/resctrl/cat_test.c| 68 +--
>>>  tools/testing/selftests/resctrl/resctrl.h |  1 +
>>>  .../testing/selftests/resctrl/resctrl_tests.c |  1 +
>>>  3 files changed, 63 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
>>> b/tools/testing/selftests/resctrl/cat_test.c
>>> index 48a96acd9e31..a9c72022bb5a 100644
>>> --- a/tools/testing/selftests/resctrl/cat_test.c
>>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>>> @@ -131,8 +131,47 @@ void cat_test_cleanup(void)
>>> remove(RESULT_FILE_NAME);
>>>  }
>>>  
>>> +/*
>>> + * L2 CAT test measures L2 misses indirectly using L3 accesses as a proxy
>>> + * because perf cannot directly provide the number of L2 misses (there are
>>> + * only platform specific ways to get the number of L2 misses).
>>> + *
>>> + * This function sets up L3 CAT to reduce noise from other processes during
>>> + * L2 CAT test.
>>
>> This motivation is not clear to me. Does the same isolation used during 
>> L3 CAT testing not work? I expected it to follow the same idea with the 
>> L2 cache split in two, the test using one part and the rest of the 
>> system using the other. Is that not enough isolation?
> 
> Isolation for L2 is done very same way as with L3 and I think it itself 
> works just fine.
> 
> However, because L2 CAT selftest as is measures L3 accesses that in ideal 
> world equals to L2 misses, isolating selftest related L3 accesses from the 
> rest of the system should reduce noise in the # of L3 accesses. It's not 
> mandatory though so if L3 CAT is not available the function just prints a 
> warning about the potential noise and does setup nothing for L3.

This is not clear to me. If the read misses L2 and then accesses L3 then
it should not matter which part of L3 cache the work is isolated to. What noise
do you have in mind?

Reinette

Re: [PATCH 24/24] selftests/resctrl: Ignore failures from L2 CAT test with <= 2 bits

2023-11-03 Thread Reinette Chatre

Hi Ilpo,

On 11/3/2023 3:24 AM, Ilpo Järvinen wrote:
> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
>>> L2 CAT test with low number of bits tends to occasionally fail because
>>> of what seems random variation. The margin is quite small to begin with
>>> for <= 2 bits in CBM. At times, the result can even become negative.
>>> While it would be possible to allow negative values for those cases, it
>>> would be more confusing to user.
>>>
>>> Ignore failures from the tests where <= 2 were used to avoid false
>>> negative results.
>>>
>>
>> I think the core message is that 2 or fewer bits should not be used. Instead
>> of running the test and ignoring the results the test should perhaps just not
>> be run.
> 
> I considered that but it often does work so it felt shame to now present
> them when they're successful. Then I just had to decide how to deal with
> the cases where they failed.
> 
> Also, if I make it to not run down to 1 bit, those numbers will never ever 
> be seen by anyone. It doesn't say 2 and 1 bit results don't contain any 
> information to a human reader who is able to do more informed decisions 
> whether something is truly working or not. We could, hypothetically, have 
> a HW issue one day which makes 1-bit L2 mask to misbehave and if the 
> number is never seen by anyone, it's extremely unlikely to be caught 
> easily.
> 
> They are just reliable enough for simple automated threshold currently. 
> Maybe something else than average value would be, it would need to be 
> explored but I suspect also the memory address of the buffer might affect 
> the value, with L3 it definitely should because of how the things work but 
> I don't know if that holds for L2 too. I have earlier tried playing with 
> the buffer addresses with L3 but as I didn't immediately yield positive 
> outcome to guard against outliers, I postponed that investigation (e.g., 
> my alloc pattern might have been too straightforward and didn't provide 
> enough entropy into the buffer start address because I just alloc'ed n x 
> buf_size buffers back-to-back).
> 
> But I don't have very strong opinion on this so if you prefer I just stop 
> at 3 bits, I can change it?
> 

We seem to have different users in mind when thinking about this. I was
considering the users that just run the selftest to get a pass/fail. You
seem to also consider folks using this for validation. I'm ok with keeping
this change to accommodate both.

Reinette

Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-06 Thread Reinette Chatre

Hi Ilpo,

On 11/6/2023 1:53 AM, Ilpo Järvinen wrote:
> On Fri, 3 Nov 2023, Reinette Chatre wrote:
>> On 11/3/2023 3:39 AM, Ilpo Järvinen wrote:
>>> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>>>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
>>>
>>>>> Add L2 CAT selftest. As measuring L2 misses is not easily available
>>>>> with perf, use L3 accesses as a proxy for L2 CAT working or not.
>>>>
>>>> I understand the exact measurement is not available but I do notice some
>>>> L2 related symbolic counters when I run "perf list". 
>>>> l2_rqsts.all_demand_miss
>>>> looks promising.
>>>
>>> Okay, I was under impression that L2 misses are not available. Both based 
>>> on what you mentioned to me half an year ago and because of what flags I 
>>> found from the header. But I'll take another look into it.
>>
>> You are correct that when I did L2 testing a long time ago I used
>> the model specific L2 miss counts. I was hoping that things have improved
>> so that model specific counters are not needed, as you have tried here.
>> I found the l2_rqsts symbol while looking for alternatives but I am not
>> familiar enough with perf to know how these symbolic names are mapped.
>> I was hoping that they could be a simple drop-in replacement to
>> experiment with.
> 
> According to perf_event_open() manpage, mapping those symbolic names 
> requires libpfm so this would add a library dependency?

I do not see perf list using this library to determine the event and
umask but I am in unfamiliar territory. I'll have to spend some more
time here to determine options.

> 
>>>>> Signed-off-by: Ilpo Järvinen 
>>>>> ---
>>>>>  tools/testing/selftests/resctrl/cat_test.c| 68 +--
>>>>>  tools/testing/selftests/resctrl/resctrl.h |  1 +
>>>>>  .../testing/selftests/resctrl/resctrl_tests.c |  1 +
>>>>>  3 files changed, 63 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
>>>>> b/tools/testing/selftests/resctrl/cat_test.c
>>>>> index 48a96acd9e31..a9c72022bb5a 100644
>>>>> --- a/tools/testing/selftests/resctrl/cat_test.c
>>>>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>>>>> @@ -131,8 +131,47 @@ void cat_test_cleanup(void)
>>>>>   remove(RESULT_FILE_NAME);
>>>>>  }
>>>>>  
>>>>> +/*
>>>>> + * L2 CAT test measures L2 misses indirectly using L3 accesses as a proxy
>>>>> + * because perf cannot directly provide the number of L2 misses (there 
>>>>> are
>>>>> + * only platform specific ways to get the number of L2 misses).
>>>>> + *
>>>>> + * This function sets up L3 CAT to reduce noise from other processes 
>>>>> during
>>>>> + * L2 CAT test.
>>>>
>>>> This motivation is not clear to me. Does the same isolation used during 
>>>> L3 CAT testing not work? I expected it to follow the same idea with the 
>>>> L2 cache split in two, the test using one part and the rest of the 
>>>> system using the other. Is that not enough isolation?
>>>
>>> Isolation for L2 is done very same way as with L3 and I think it itself 
>>> works just fine.
>>>
>>> However, because L2 CAT selftest as is measures L3 accesses that in ideal 
>>> world equals to L2 misses, isolating selftest related L3 accesses from the 
>>> rest of the system should reduce noise in the # of L3 accesses. It's not 
>>> mandatory though so if L3 CAT is not available the function just prints a 
>>> warning about the potential noise and does setup nothing for L3.
>>
>> This is not clear to me. If the read misses L2 and then accesses L3 then
>> it should not matter which part of L3 cache the work is isolated to. 
>> What noise do you have in mind?
> 
> The way it is currently done is to measure L3 accesses. If something else 
> runs at the same time as the CAT selftest, it can do mem accesses that 
> cause L3 accesses which is noise in the # of L3 accesses number since 
> those accesses were unrelated to the L2 CAT selftest.
> 

Creating a CAT allocation sets aside a portion of cache where a task/cpu
can allocation into cache, it does not prevent one task from accessing
the cache concurrently with another.

Reinette

Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-06 Thread Reinette Chatre

Hi Ilpo,

On 11/6/2023 9:03 AM, Reinette Chatre wrote:
> On 11/6/2023 1:53 AM, Ilpo Järvinen wrote:
>> On Fri, 3 Nov 2023, Reinette Chatre wrote:
>>> On 11/3/2023 3:39 AM, Ilpo Järvinen wrote:
>>>> On Thu, 2 Nov 2023, Reinette Chatre wrote:
>>>>> On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
>>>>
>>>>>> Add L2 CAT selftest. As measuring L2 misses is not easily available
>>>>>> with perf, use L3 accesses as a proxy for L2 CAT working or not.
>>>>>
>>>>> I understand the exact measurement is not available but I do notice some
>>>>> L2 related symbolic counters when I run "perf list". 
>>>>> l2_rqsts.all_demand_miss
>>>>> looks promising.
>>>>
>>>> Okay, I was under impression that L2 misses are not available. Both based 
>>>> on what you mentioned to me half an year ago and because of what flags I 
>>>> found from the header. But I'll take another look into it.
>>>
>>> You are correct that when I did L2 testing a long time ago I used
>>> the model specific L2 miss counts. I was hoping that things have improved
>>> so that model specific counters are not needed, as you have tried here.
>>> I found the l2_rqsts symbol while looking for alternatives but I am not
>>> familiar enough with perf to know how these symbolic names are mapped.
>>> I was hoping that they could be a simple drop-in replacement to
>>> experiment with.
>>
>> According to perf_event_open() manpage, mapping those symbolic names 
>> requires libpfm so this would add a library dependency?
> 
> I do not see perf list using this library to determine the event and
> umask but I am in unfamiliar territory. I'll have to spend some more
> time here to determine options.

tools/perf/pmu-events/README cleared it up for me. The architecture specific
tables are included in the perf binary. Potentially pmu-events.h could be
included or the test could just stick with the architectural events.
A quick look at the various cache.json files created the impression that
the events of interest may actually have the same event code and umask across
platforms.
I am not familiar with libpfm. This can surely be considered if it supports
this testing. Several selftests have library dependencies.

Reinette

Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-08 Thread Reinette Chatre

Hi Ilpo,

On 11/7/2023 1:33 AM, Ilpo Järvinen wrote:
> man perf_event_open() says this:
> 
> "If type is PERF_TYPE_RAW, then a custom "raw" config  value  is  needed.
> Most  CPUs  support  events  that  are  not covered by the "generalized"
> events.  These are implementation defined; see your CPU manual (for  ex-
> ample  the  Intel Volume 3B documentation or the AMD BIOS and Kernel De-
> veloper Guide).  The libpfm4 library can be used to translate  from  the
> name in the architectural manuals to the raw hex value perf_event_open()
> expects in this field."
> 
> ...I've not come across libpfm myself either but to me it looks libpfm 
> bridges between those architecture specific tables and perf_event_open(). 
> That is, it could provide the binary value necessary in constructing the 
> perf_event_attr struct.
> 
> I think this is probably the function which maps string -> 
> perf_event_attr:
> 
> https://man7.org/linux/man-pages/man3/pfm_get_os_event_encoding.3.html
> 

This sounds promising. If this works out I think that it would be ideal if
the L2 CAT test is not blocked by absence of libpfm. That is, the resctrl
tests should not fail to build if libpfm is not present but instead
L2 CAT just turns into a simple functional test. To accomplish this it looks
like tools/build/Makefile.feature can be helpful and already has a check
for libpfm.

Reinette

Re: [PATCH v2 01/26] selftests/resctrl: Don't use ctrlc_handler() outside signal handling

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> perf_event_open_llc_miss() calls ctrlc_handler() to cleanup if
> perf_event_open() returns an error. Those cleanups, however, are not
> the responsability of perf_event_open_llc_miss() and it thus interferes

responsibility

> unnecessarily with the usual cleanup pattern. Worse yet,
> ctrlc_handler() calls exit() in the end preventing the ordinary cleanup
> done in the calling function from executing.
> 
> ctrlc_handler() should only be used as a signal handler, not during
> normal error handling.
> 
> Remove call to ctrlc_handler() from perf_event_open_llc_miss(). As
> unmounting resctrlfs and test cleanup are already handled properly
> by error rollbacks in the calling functions, no other changes are
> necessary.
> 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 02/26] selftests/resctrl: Split fill_buf to allow tests finer-grained control

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> MBM, MBA and CMT test cases call run_fill_buf() that in turn calls
> fill_cache() to alloc and loop indefinitely around the buffer. This
> binds buffer allocation and running the benchmark into a single bundle
> so that a selftest cannot allocate a buffer once and reuse it. CAT test
> doesn't want to loop around the buffer continuously and after rewrite
> it needs the ability to allocate the buffer separately.
> 
> Split buffer allocation out of fill_cache() into alloc_buffer(). This
> change is part of preparation for the new CAT test that allocates a
> buffer and does multiple passes over the same buffer (but not in an
> infinite loop).
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---

Could you please list the changes to a patch in this area instead of
lumping all in the cover letter? Without this it is not clear what,
if anything, changed in the new version.

>  tools/testing/selftests/resctrl/fill_buf.c | 26 +-
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/fill_buf.c 
> b/tools/testing/selftests/resctrl/fill_buf.c
> index 0d425f26583a..6f32f44128e1 100644
> --- a/tools/testing/selftests/resctrl/fill_buf.c
> +++ b/tools/testing/selftests/resctrl/fill_buf.c
> @@ -135,33 +135,37 @@ static int fill_cache_write(unsigned char *buf, size_t 
> buf_size, bool once)
>   return 0;
>  }
>  
> -static int fill_cache(size_t buf_size, int memflush, int op, bool once)
> +static unsigned char *alloc_buffer(size_t buf_size, int memflush)
>  {
>   unsigned char *buf;
> - int ret;
>  
>   buf = malloc_and_init_memory(buf_size);
>   if (!buf)
> - return -1;
> + return NULL;
>  
>   /* Flush the memory before using to avoid "cache hot pages" effect */
>   if (memflush)
>   mem_flush(buf, buf_size);
>  
> + return buf;
> +}
> +
> +static int fill_cache(size_t buf_size, int memflush, int op, bool once)
> +{
> + unsigned char *buf;
> + int ret;
> +
> + buf = alloc_buffer(buf_size, memflush);
> + if (!buf)
> + return -1;
> +
>   if (op == 0)
>   ret = fill_cache_read(buf, buf_size, once);
>   else
>   ret = fill_cache_write(buf, buf_size, once);
> -
>   free(buf);
>  
> - if (ret) {
> - printf("\n Error in fill cache read/write...\n");
> - return -1;
> - }
> -

The changelog does not motivate the removal of this error message.
It seems ok at this point since the only failing functions already
print their own error message. Without a motivation of this change
it is not clear if it is actually intended.

In any case, this looks good.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 03/26] selftests/resctrl: Refactor fill_buf functions

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> There are unnecessary nested calls in fill_buf.c:
>   - run_fill_buf() calls fill_cache()
>   - alloc_buffer() calls malloc_and_init_memory()
> 
> Simplify the code flow and remove those unnecessary call levels by
> moving the called code inside the calling function.
> 
> Resolve the difference in run_fill_buf() and fill_cache() parameter
> name into 'buf_size' which is more descriptive than 'span'. Also, while
> moving the allocation related code, rename 'p' into 'buf' to be
> consistent in naming the variables.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 04/26] selftests/resctrl: Refactor get_cbm_mask() and rename to get_full_cbm()

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> Callers of get_cbm_mask() are required to pass a string into which the
> capacity bitmask (CBM) is read. Neither CAT nor CMT tests need the
> bitmask as string but just convert it into an unsigned long value.
> 
> Another limitation is that the bit mask reader can only read
> .../cbm_mask files.
> 
> Generalize the bit mask reading function into get_bit_mask() such that
> it can be used to handle other files besides the .../cbm_mask and
> handles the unsigned long conversion within get_bit_mask() using
> fscanf(). Change get_cbm_mask() to use get_bit_mask() and rename it to
> get_full_cbm() to better indicates what the function does.

"to better indicates" -> "to better indicate"

> 
> Also mark cache_type const while at it and remove useless comments that
> are related to processing of CBM bits.
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---

...

> @@ -229,6 +228,32 @@ int get_cbm_mask(char *cache_type, char *cbm_mask)
>   return 0;
>  }
>  
> +/*
> + * get_full_cbm - Get full Cache Bit Mask (CBM)
> + * @cache_type:  Cache type as "L2" or "L3"
> + * @mask:Full cache bit mask representing the maximal portion of cache
> + *   available for allocation, returned as unsigned long.
> + *
> + * Return: = 0 on success, < 0 on failure.
> + */
> +int get_full_cbm(const char *cache_type, unsigned long *mask)
> +{
> + char cbm_path[PATH_MAX];
> + int ret;
> +
> + if (!cache_type)
> + return -1;
> +
> + snprintf(cbm_path, sizeof(cbm_path), "%s/%s/cbm_mask",
> +  INFO_PATH, cache_type);
> +
> + ret = get_bit_mask(cbm_path, mask);
> + if (ret)
> + return -1;
> +
> + return 0;

Can this just be "return get_bit_mask()" ? But actually, I would
like to propose that this also returns a failure if the returned
mask is zero. This would make the code more robust, especially looking
ahead at utility like cache_portion_size() that divides by this full
mask.

Reinette

Re: [PATCH v2 08/26] selftests/resctrl: Split measure_cache_vals()

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> measure_cache_vals() does a different thing depending on the test case
> that called it:
>   - For CAT, it measures LLC misses through perf.
>   - For CMT, it measures LLC occupancy through resctrl.
> 
> Split these two functionalities into own functions the CAT and CMT
> tests can call directly. Replace passing the struct resctrl_val_param
> parameter with the filename because it's more generic and all those
> functions need out of resctrl_val.
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cache.c   | 66 ---
>  tools/testing/selftests/resctrl/resctrl.h |  2 +-
>  tools/testing/selftests/resctrl/resctrl_val.c |  2 +-
>  3 files changed, 43 insertions(+), 27 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cache.c 
> b/tools/testing/selftests/resctrl/cache.c
> index 8aa6d67db978..129d1c293518 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -147,7 +147,7 @@ static int get_llc_occu_resctrl(unsigned long 
> *llc_occupancy)
>   *
>   * Return:   0 on success. non-zero on failure.
>   */
> -static int print_results_cache(char *filename, int bm_pid,
> +static int print_results_cache(const char *filename, int bm_pid,
>  unsigned long llc_value)
>  {
>   FILE *fp;
> @@ -169,35 +169,51 @@ static int print_results_cache(char *filename, int 
> bm_pid,
>   return 0;
>  }
>  
> -int measure_cache_vals(struct resctrl_val_param *param, int bm_pid)
> +/*
> + * perf_event_measure - Measure perf events
> + * @filename:Filename for writing the results
> + * @bm_pid:  PID that runs the benchmark
> + *
> + * Measures perf events (e.g., cache misses) and writes the results into
> + * @filename. @bm_pid is written to the results file along with the measured
> + * value.
> + *
> + * Return: =0 on success. <0 on failure.

I do not think this is accurate. It looks like this function returns
the return value of print_results_cache() which returns errno on failure.
If this is the case then I think this proves that returning a
positive integer on failure should be avoided since it just creates
traps.

> + */
> +static int perf_event_measure(const char *filename, int bm_pid)
>  {
> - unsigned long llc_perf_miss = 0, llc_occu_resc = 0, llc_value = 0;
> + unsigned long llc_perf_miss = 0;
>   int ret;
>  
> - /*
> -  * Measure cache miss from perf.
> -  */
> - if (!strncmp(param->resctrl_val, CAT_STR, sizeof(CAT_STR))) {
> - ret = get_llc_perf(&llc_perf_miss);
> - if (ret < 0)
> - return ret;
> - llc_value = llc_perf_miss;
> - }
> + ret = get_llc_perf(&llc_perf_miss);
> + if (ret < 0)
> + return ret;
>  
> - /*
> -  * Measure llc occupancy from resctrl.
> -  */
> - if (!strncmp(param->resctrl_val, CMT_STR, sizeof(CMT_STR))) {
> - ret = get_llc_occu_resctrl(&llc_occu_resc);
> - if (ret < 0)
> - return ret;
> - llc_value = llc_occu_resc;
> - }
> - ret = print_results_cache(param->filename, bm_pid, llc_value);
> - if (ret)
> + ret = print_results_cache(filename, bm_pid, llc_perf_miss);
> + return ret;
> +}

Perhaps print_results_cache() can be made to return negative error
and this just be "return print_results_cache(...)" and the function
comment be accurate?

> +
> +/*
> + * measure_llc_resctrl - Measure resctrl llc value from resctrl

llc -> LLC

> + * @filename:Filename for writing the results
> + * @bm_pid:  PID that runs the benchmark
> + *
> + * Measures llc occupancy from resctrl and writes the results into @filename.

llc -> LLC

> + * @bm_pid is written to the results file along with the measured value.
> + *
> + * Return: =0 on success. <0 on failure.

same issue ?


Reinette

Re: [PATCH v2 10/26] selftests/resctrl: Remove unnecessary __u64 -> unsigned long conversion

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> Perf counters are __u64 but the code converts them to unsigned long
> before printing them out.
> 
> Remove unnecessary type conversion and retain the value as __u64 whole
> the time.

"whole the time" -> "the whole time"

Even so, this does not seem quite accurate since not all callers of
show_cache_info() and print_results_cache() use __u64. This seems ok
but it would be nicer if this was highlighted and not use "whole time".

> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 14/26] selftests/resctrl: Convert perf related globals to locals

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> Perf related variables pea_llc_miss, pe_read, and pe_fd are globals in
> cache.c.
> 
> Convert them to locals for better scoping and make pea_llc_miss simpler
> by renaming it to pea. Make close(pe_fd) handling easier to understand
> by doing it inside cat_val().
> 
> Make also sizeof()s use safer way to determine the right struct.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 15/26] selftests/resctrl: Move cat_val() to cat_test.c and rename to cat_test()

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> The main CAT test function is called cat_val() and resides in cache.c
> which is illogical.
> 
> Rename the function to cat_test() and move it into cat_test.c.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 16/26] selftests/resctrl: Open perf fd before start & add error handling

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> Perf fd (pe_fd) is opened, reset, and enabled during every test the CAT
> selftest runs. Also, ioctl(pe_fd, ...) calls are not error checked even
> if ioctl() could return an error.
> 
> Open perf fd only once before the tests and only reset and enable the
> counter within the test loop. Add error checking to pe_fd ioctl()
> calls.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 17/26] selftests/resctrl: Replace file write with volatile variable

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> The fill_buf code prevents compiler optimizating the entire read loop
> away by writing the final value of the variable into a file. While it
> achieves the goal, writing into a file requires significant amount of
> work within the innermost test loop and also error handling.
> 
> A simpler approach is to take advantage of volatile. Writing to a
> variable through a volatile pointer is enough to prevent compiler from
> optimizing the write away, and therefore compiler cannot remove the
> read loop either.
> 
> Add a volatile 'value_sink' into resctrl_tests.c and make fill_buf to
> write into it. As a result, the error handling in fill_buf.c can be
> simplified.
> 

The subject and changelog describes the need for a volatile variable.
The patch introduces two volatile variables. Could you please elaborate
why two volatile variables are needed?

Reinette

Re: [PATCH v2 18/26] selftests/resctrl: Read in less obvious order to defeat prefetch optimizations

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> When reading memory in order, HW prefetching optimizations will
> interfere with measuring how caches and memory are being accessed. This
> adds noise into the results.
> 
> Change the fill_buf reading loop to not use an obvious in-order access
> using multiply by a prime and modulo.
> 
> Using a prime multiplier with modulo ensures the entire buffer is
> eventually read. 23 is small enough that the reads are spread out but
> wrapping does not occur very frequently (wrapping too often can trigger
> L2 hits more frequently which causes noise to the test because getting
> the data from LLC is not required).
> 
> It was discovered that not all primes work equally well and some can
> cause wildly unstable results (e.g., in an earlier version of this
> patch, the reads were done in reversed order and 59 was used as the
> prime resulting in unacceptably high and unstable results in MBA and
> MBM test on some architectures).
> 
> Link: 
> https://lore.kernel.org/linux-kselftest/tyapr01mb6330025b5e6537f94da49acb8b...@tyapr01mb6330.jpnprd01.prod.outlook.com/
> Signed-off-by: Ilpo Järvinen 
> ---

I am not very comfortable with all the uncertainty involved in this
patch. A consolation is that this is surely an improvement.

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 19/26] selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> CAT test spawns two processes into two different control groups with
> exclusive schemata. Both the processes alloc a buffer from memory
> matching their allocated LLC block size and flush the entire buffer out
> of caches. Since the processes are reading through the buffer only once
> during the measurement and initially all the buffer was flushed, the
> test isn't testing CAT.
> 
> Rewrite the CAT test to allocate a buffer sized to half of LLC. Then
> perform a sequence of tests with different LLC alloc sizes starting
> from half of the CBM bits down to 1-bit CBM. Flush the buffer before
> each test and read the buffer twice. Observe the LLC misses on the
> second read through the buffer. As the allocated LLC block gets smaller
> and smaller, the LLC misses will become larger and larger giving a
> strong signal on CAT working properly.
> 
> The new CAT test is using only a single process because it relies on
> measured effect against another run of itself rather than another
> process adding noise. The rest of the system is set to use the CBM bits
> not used by the CAT test to keep the test isolated.
> 
> Replace count_bits() with count_contiguous_bits() to get the first bit
> position in order to be able to calculate masks based on it.
> 
> This change has been tested with a number of systems from different
> generations.
> 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/cat_test.c  | 282 +---
>  tools/testing/selftests/resctrl/fill_buf.c  |   6 +-
>  tools/testing/selftests/resctrl/resctrl.h   |   5 +-
>  tools/testing/selftests/resctrl/resctrlfs.c |  44 +--
>  4 files changed, 138 insertions(+), 199 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index cfda87667b46..4169b17b8f91 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -11,65 +11,69 @@
>  #include "resctrl.h"
>  #include 
>  
> -#define RESULT_FILE_NAME1"result_cat1"
> -#define RESULT_FILE_NAME2"result_cat2"
> +#define RESULT_FILE_NAME "result_cat"
>  #define NUM_OF_RUNS  5
> -#define MAX_DIFF_PERCENT 4
> -#define MAX_DIFF 100
>  
>  /*
> - * Change schemata. Write schemata to specified
> - * con_mon grp, mon_grp in resctrl FS.
> - * Run 5 times in order to get average values.
> + * Minimum difference in LLC misses between a test with n+1 bits CBM to the
> + * test with n bits is MIN_DIFF_PERCENT_PER_BIT * (n - 1). With e.g. 5 vs 4
> + * bits in the CBM mask, the minimum difference must be at least
> + * MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3 percent.
> + *
> + * The relationship between number of used CBM bits and difference in LLC
> + * misses is not expected to be linear. With a small number of bits, the
> + * margin is smaller than with larger number of bits. For selftest purposes,
> + * however, linear approach is enough because ultimately only pass/fail
> + * decision has to be made and distinction between strong and stronger
> + * signal is irrelevant.
>   */
> -static int cat_setup(struct resctrl_val_param *p)
> -{
> - char schemata[64];
> - int ret = 0;
> -
> - /* Run NUM_OF_RUNS times */
> - if (p->num_of_runs >= NUM_OF_RUNS)
> - return END_OF_TESTS;
> -
> - if (p->num_of_runs == 0) {
> - sprintf(schemata, "%lx", p->mask);
> - ret = write_schemata(p->ctrlgrp, schemata, p->cpu_no,
> -  p->resctrl_val);
> - }
> - p->num_of_runs++;
> -
> - return ret;
> -}
> +#define MIN_DIFF_PERCENT_PER_BIT 1
>  
>  static int show_results_info(__u64 sum_llc_val, int no_of_bits,
> -  unsigned long cache_span, unsigned long max_diff,
> -  unsigned long max_diff_percent, unsigned long 
> num_of_runs,
> -  bool platform)
> +  unsigned long cache_span, long min_diff_percent,

With all care taken in unsigned use I wonder why min_diff_percent is just long?

> +  unsigned long num_of_runs, bool platform,
> +  __s64 *prev_avg_llc_val)
>  {
>   __u64 avg_llc_val = 0;
> - float diff_percent;
> - int ret;
> + float avg_diff;
> + int ret = 0;
>  
>   avg_llc_val = sum_llc_val / num_of_runs;
> - diff_percent = ((float)cache_span - avg_llc_val) / cache_span * 100;
> + if (*prev_avg_llc_val)

Re: [PATCH v2 20/26] selftests/resctrl: Create struct for input parameters

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:

...

>  
> +static void init_user_params(struct user_params *uparams)
> +{
> + uparams->cpu = 1;
> + uparams->bits = 0;
> +}
> +
>  int main(int argc, char **argv)
>  {
>   bool mbm_test = true, mba_test = true, cmt_test = true;
> - const char *benchmark_cmd[BENCHMARK_ARGS] = {};
> - int c, cpu_no = 1, i, no_of_bits = 0;
> + struct user_params uparams = {};
>   char *span_str = NULL;
>   bool cat_test = true;
>   int tests = 0;
> - int ret;
> + int ret, c, i;
> +
> + init_user_params(&uparams);
>  

It is unexpected that this uses a combination of designated
initializer (via {}) as well as an explicit init function to
initialize the struct. I think it would be simpler to do
all initialization in the init function.

Reinette

Re: [PATCH v2 21/26] selftests/resctrl: Introduce generalized test framework

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
...

> +
> +static bool cmt_feature_check(const struct resctrl_test *test)
> +{
> + return validate_resctrl_feature_request("L3_MON", "llc_occupancy") &&
> +validate_resctrl_feature_request("L3", NULL);
> +}
> +
...

> +
> +static bool mba_feature_check(const struct resctrl_test *test)
> +{
> + return test_resource_feature_check(test) &&
> +validate_resctrl_feature_request("L3_MON", "mbm_local_bytes");
> +}
> +

Could cmt_feature_check() not also use test_resource_feature_check(test)?
Why are cmt_feature_check() and mba_feature_check() different in this regard?

...

>  
> +/*
> + * resctrl_test: resctrl test definition
> + * @name:Test name
> + * @resource:Resource to test (e.g., MB, L3, L2, etc.)
> + * @vendor_specific: Bitmask for vendor-specific tests (can be 0 for 
> universal tests)

I do not think these values were originally intended to be used in
a bitmask. The current values do make this possible but I would like to
suggest that their definition gets a comment to highlight how those
values are used.

The rest looks good to me. This is a good addition. Thank you.

Reinette

Re: [PATCH v2 22/26] selftests/resctrl: Pass write_schemata() resource instead of test name

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> write_schemata() takes the test name as an argument and determines the
> relevant resource based on the test name. Such mapping from name to
> resource does not really belong to resctrlfs.c that should provide
> only generic, test-independent functions.
> 
> Pass the resource stored in the test information structure to
> write_schemata() instead of the test name. The new API is also more
> flexible as it enables to use write_schemata() for more than one
> resource within a test.
> 
> While touching the sprintf(), move the unnecessary %c that is always
> '=' directly into the format string.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

...

>   break;
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index b711326b2141..fda5ad812faa 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -502,23 +502,17 @@ int write_bm_pid_to_resctrl(pid_t bm_pid, char 
> *ctrlgrp, char *mongrp,
>   * @ctrlgrp: Name of the con_mon grp
>   * @schemata:Schemata that should be updated to
>   * @cpu_no:  CPU number that the benchmark PID is binded to
> - * @resctrl_val: Resctrl feature (Eg: mbm, mba.. etc)
> + * @resource:Resctrl resource (Eg: MB, L3, L2, etc.)
>   *
>   * Update schemata of a con_mon grp *only* if requested resctrl feature is
>   * allocation type
>   *

Note above there remains a usage of "feature" that has just been changed to 
"resource".

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 23/26] selftests/resctrl: Add helper to convert L2/3 to integer

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> "L2"/"L3" conversion to integer is embedded into get_cache_size()
> which prevents reuse.
> 
> Create a helper for the cache string to integer conversion to make
> it reusable.
> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  tools/testing/selftests/resctrl/resctrlfs.c | 28 +++--
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index fda5ad812faa..38ca3ae562e9 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -94,6 +94,23 @@ int umount_resctrlfs(void)
>   return 0;
>  }
>  
> +/*
> + * get_cache_level - Convert cache level from string to integer
> + * @cache_type:  Cache level as string
> + *
> + * Return: cache level as integer or -1 if @cache_type is invalid.
> + */
> +static int get_cache_level(const char *cache_type)
> +{
> + if (!strcmp(cache_type, "L3"))
> + return 3;
> + if (!strcmp(cache_type, "L2"))
> + return 2;
> +
> + perror("Invalid cache level");

I know that you are just copying code here but this usage of
perror() does not look right. strcmp() does not set errno, it
does not "fail".

Reinette

Re: [PATCH v2 25/26] selftests/resctrl: Get domain id from cache id

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> Domain id is acquired differently depending on CPU. AMD tests use id
> from L3 cache, whereas CPUs from other vendors base the id on topology
> package id. In order to support L2 CAT test, this has to be
> generalized.
> 
> The driver side code seems to get the domain ids from cache ids so the
> approach used by the AMD branch seems to match the kernel-side code. It
> will also work with L2 domain IDs as long as the cache level is
> generalized.
> 
> Using the topology id was always fragile due to mismatch with the
> kernel-side way to acquire the domain id. It got incorrect domain id,
> e.g., when Cluster-on-Die (CoD) is enabled for CPU (but CoD is not well
> suited for resctrl in the first place so it has not been a big issue if
> tests don't work correctly with it).
> 
> Taking all the above into account, generalize acquiring the domain id
> by taking it from the cache id and do not hard-code the cache level.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v2 26/26] selftests/resctrl: Add test groups and name L3 CAT test L3_CAT

2023-11-28 Thread Reinette Chatre

Hi Ilpo,

On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:
> To select test to run -t parameter can be used. However, -t cat
> currently maps to L3 CAT test which is confusing after more CAT related
> tests are added.
> 
> Allow selecting tests as groups and call L3 CAT test "L3_CAT", "CAT"
> group will enable all CAT related tests.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Could you please defer this patch to accompany the series that
introduces other CAT related tests?

Reinette

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-05 Thread Reinette Chatre

Hi Peter,

On 12/1/2023 12:56 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
>  wrote:
>> On 5/15/2023 7:42 AM, Peter Newman wrote:
>>>
>>> I used a simple parent-child pipe loop benchmark with the parent in
>>> one monitoring group and the child in another to trigger 2M
>>> context-switches on the same CPU and compared the sample-based
>>> profiles on an AMD and Intel implementation. I used perf diff to
>>> compare the samples between hard and soft RMID switches.
>>>
>>> Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
>>>
>>>   +44.80%  [kernel.kallsyms]  [k] __rmid_read
>>> 10.43% -9.52%  [kernel.kallsyms]  [k] __switch_to
>>>
>>> AMD EPYC 7B12 64-Core Processor:
>>>
>>>   +28.27%  [kernel.kallsyms]  [k] __rmid_read
>>> 13.45%-13.44%  [kernel.kallsyms]  [k] __switch_to
>>>
>>> Note that a soft RMID switch that doesn't change CLOSID skips the
>>> PQR_ASSOC write completely, so from this data I can roughly say that
>>> __rmid_read() is a little over 2x the length of a PQR_ASSOC write that
>>> changes the current RMID on the AMD implementation and about 4.5x
>>> longer on Intel.
>>>
>>> Let me know if this clarifies the cost enough or if you'd like to also
>>> see instrumented measurements on the individual WRMSR/RDMSR
>>> instructions.
>>
>> I can see from the data the portion of total time spent in __rmid_read().
>> It is not clear to me what the impact on a context switch is. Is it
>> possible to say with this data that: this solution makes a context switch
>> x% slower?
>>
>> I think it may be optimistic to view this as a replacement of a PQR write.
>> As you point out, that requires that a CPU switches between tasks with the
>> same CLOSID. You demonstrate that resctrl already contributes a significant
>> delay to __switch_to - this work will increase that much more, it has to
>> be clear about this impact and motivate that it is acceptable.
> 
> We were operating under the assumption that if the overhead wasn't
> acceptable, we would have heard complaints about it by now, but we
> ultimately learned that this feature wasn't deployed as much as we had
> originally thought on AMD hardware and that the overhead does need to
> be addressed.
> 
> I am interested in your opinion on two options I'm exploring to
> mitigate the overhead, both of which depend on an API like the one
> Babu recently proposed for the AMD ABMC feature [1], where a new file
> interface will allow the user to indicate which mon_groups are
> actively being measured. I will refer to this as "assigned" for now,
> as that's the current proposal.
> 
> The first is likely the simpler approach: only read MBM event counters
> which have been marked as "assigned" in the filesystem to avoid paying
> the context switch cost on tasks in groups which are not actively
> being measured. In our use case, we calculate memory bandwidth on
> every group every few minutes by reading the counters twice, 5 seconds
> apart. We would just need counters read during this 5-second window.

I assume that tasks within a monitoring group can be scheduled on any
CPU and from the cover letter of this work I understand that only an
RMID assigned to a processor can be guaranteed to be tracked by hardware.

Are you proposing for this option that you keep this "soft RMID" approach
with CPUs  permanently assigned a "hard RMID" but only update the counts for a
"soft RMID" that is "assigned"? I think that means that the context
switch cost for the monitored group would increase even more than with the
implementation in this series since the counters need to be read on context
switch in as well as context switch out.

If I understand correctly then only one monitoring group can be measured
at a time. If such a measurement takes 5 seconds then theoretically 12 groups
can be measured in one minute. It may be possible to create many more
monitoring groups than this. Would it be possible to reach monitoring
goals in your environment?

> 
> The second involves avoiding the situation where a hardware counter
> could be deallocated: Determine the number of simultaneous RMIDs
> supported, reduce the effective number of RMIDs available to that
> number. Use the default RMID (0) for all "unassigned" monitoring

hmmm ... so on the one side there is "only the RMID within the PQR
register can be guaranteed to be tracked by hardware" and on the 
other side there is "A given implementation may have in

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-05 Thread Reinette Chatre

Hi Peter,

On 12/5/2023 4:33 PM, Peter Newman wrote:
> On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
>  wrote:
>> On 12/1/2023 12:56 PM, Peter Newman wrote:
>>> On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
>>>> I think it may be optimistic to view this as a replacement of a PQR write.
>>>> As you point out, that requires that a CPU switches between tasks with the
>>>> same CLOSID. You demonstrate that resctrl already contributes a significant
>>>> delay to __switch_to - this work will increase that much more, it has to
>>>> be clear about this impact and motivate that it is acceptable.
>>>
>>> We were operating under the assumption that if the overhead wasn't
>>> acceptable, we would have heard complaints about it by now, but we
>>> ultimately learned that this feature wasn't deployed as much as we had
>>> originally thought on AMD hardware and that the overhead does need to
>>> be addressed.
>>>
>>> I am interested in your opinion on two options I'm exploring to
>>> mitigate the overhead, both of which depend on an API like the one
>>> Babu recently proposed for the AMD ABMC feature [1], where a new file
>>> interface will allow the user to indicate which mon_groups are
>>> actively being measured. I will refer to this as "assigned" for now,
>>> as that's the current proposal.
>>>
>>> The first is likely the simpler approach: only read MBM event counters
>>> which have been marked as "assigned" in the filesystem to avoid paying
>>> the context switch cost on tasks in groups which are not actively
>>> being measured. In our use case, we calculate memory bandwidth on
>>> every group every few minutes by reading the counters twice, 5 seconds
>>> apart. We would just need counters read during this 5-second window.
>>
>> I assume that tasks within a monitoring group can be scheduled on any
>> CPU and from the cover letter of this work I understand that only an
>> RMID assigned to a processor can be guaranteed to be tracked by hardware.
>>
>> Are you proposing for this option that you keep this "soft RMID" approach
>> with CPUs  permanently assigned a "hard RMID" but only update the counts for 
>> a
>> "soft RMID" that is "assigned"?
> 
> Yes
> 
>> I think that means that the context
>> switch cost for the monitored group would increase even more than with the
>> implementation in this series since the counters need to be read on context
>> switch in as well as context switch out.
>>
>> If I understand correctly then only one monitoring group can be measured
>> at a time. If such a measurement takes 5 seconds then theoretically 12 groups
>> can be measured in one minute. It may be possible to create many more
>> monitoring groups than this. Would it be possible to reach monitoring
>> goals in your environment?
> 
> We actually measure all of the groups at the same time, so thinking
> about this more, the proposed ABMC fix isn't actually a great fit: the
> user would have to assign all groups individually when a global
> setting would have been fine.
> 
> Ignoring any present-day resctrl interfaces, what we minimally need is...
> 
> 1. global "start measurement", which enables a
> read-counters-on-context switch flag, and broadcasts an IPI to all
> CPUs to read their current count
> 2. wait 5 seconds
> 3. global "end measurement", to IPI all CPUs again for final counts
> and clear the flag from step 1
> 
> Then the user could read at their leisure all the (frozen) event
> counts from memory until the next measurement begins.
> 
> In our case, if we're measuring as often as 5 seconds for every
> minute, that will already be a 12x aggregate reduction in overhead,
> which would be worthwhile enough.

The "con" here would be that during those 5 seconds (which I assume would be
controlled via user space so potentially shorter or longer) all tasks in the
system is expected to have significant (but yet to be measured) impact
on context switch delay.
I expect the overflow handler should only be run during the measurement
timeframe, to not defeat the "at their leisure" reading of counters.

>>> The second involves avoiding the situation where a hardware counter
>>> could be deallocated: Determine the number of simultaneous RMIDs
>>> supported, reduce the effective number of RMIDs available to that
>>> number. Use the default RMID (0) for all "unassigned" monitoring
>>
>> hmmm ... so on the on

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-06 Thread Reinette Chatre

Hi Peter,

On 12/6/2023 10:38 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre
>  wrote:
>>
>> On 12/5/2023 4:33 PM, Peter Newman wrote:
>>> On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
>>>  wrote:
>>>> On 12/1/2023 12:56 PM, Peter Newman wrote:
> 
>>> Ignoring any present-day resctrl interfaces, what we minimally need is...
>>>
>>> 1. global "start measurement", which enables a
>>> read-counters-on-context switch flag, and broadcasts an IPI to all
>>> CPUs to read their current count
>>> 2. wait 5 seconds
>>> 3. global "end measurement", to IPI all CPUs again for final counts
>>> and clear the flag from step 1
>>>
>>> Then the user could read at their leisure all the (frozen) event
>>> counts from memory until the next measurement begins.
>>>
>>> In our case, if we're measuring as often as 5 seconds for every
>>> minute, that will already be a 12x aggregate reduction in overhead,
>>> which would be worthwhile enough.
>>
>> The "con" here would be that during those 5 seconds (which I assume would be
>> controlled via user space so potentially shorter or longer) all tasks in the
>> system is expected to have significant (but yet to be measured) impact
>> on context switch delay.
> 
> Yes, of course. In the worst case I've measured, Zen2, it's roughly a
> 1700-cycle context switch penalty (~20%) for tasks in different
> monitoring groups. Bad, but the benefit we gain from the per-RMID MBM
> data makes up for it several times over if we only pay the cost during a
> measurement.

I see.

> 
>> I expect the overflow handler should only be run during the measurement
>> timeframe, to not defeat the "at their leisure" reading of counters.
> 
> Yes, correct. We wouldn't be interested in overflows of the hardware
> counter when not actively measuring bandwidth.
> 
> 
>>
>>>>> The second involves avoiding the situation where a hardware counter
>>>>> could be deallocated: Determine the number of simultaneous RMIDs
>>>>> supported, reduce the effective number of RMIDs available to that
>>>>> number. Use the default RMID (0) for all "unassigned" monitoring
>>>>
>>>> hmmm ... so on the one side there is "only the RMID within the PQR
>>>> register can be guaranteed to be tracked by hardware" and on the
>>>> other side there is "A given implementation may have insufficient
>>>> hardware to simultaneously track the bandwidth for all RMID values
>>>> that the hardware supports."
>>>>
>>>> From the above there seems to be something in the middle where
>>>> some subset of the RMID values supported by hardware can be used
>>>> to simultaneously track bandwidth? How can it be determined
>>>> what this number of RMID values is?
>>>
>>> In the context of AMD, we could use the smallest number of CPUs in any
>>> L3 domain as a lower bound of the number of counters.
>>
>> Could you please elaborate on this? (With the numbers of CPUs nowadays this
>> may be many RMIDs, perhaps even more than what ABMC supports.)
> 
> I think the "In the context of AMD" part is key. This feature would only
> be applicable to the AMD implementations we have today which do not
> implement ABMC.  I believe the difficulties are unique to the topologies
> of these systems: many small L3 domains per node with a relatively small
> number of CPUs in each. If the L3 domains were large and few, simply
> restricting the number of RMIDs and allocating on group creation as we
> do today would probably be fine.
> 
>> I am missing something here since it is not obvious to me how this lower
>> bound is determined. Let's assume that there are as many monitor groups
>> (and thus as many assigned RMIDs) as there are CPUs in a L3 domain.
>> Each monitor group may have many tasks. It can be expected that at any
>> moment in time only a subset of assigned RMIDs are assigned to CPUs
>> via the CPUs' PQR registers. Of those RMIDs that are not assigned to
>> CPUs, how can it be certain that they continue to be tracked by hardware?
> 
> Are you asking whether the counters will ever be reclaimed proactively?
> The behavior I've observed is that writing a new RMID into a PQR_ASSOC
> register when all hardware counters in the domain are allocated will
> trigger the reallocation.

"When all hardware counters in the domain are allocat

Re: [PATCH v2 08/26] selftests/resctrl: Split measure_cache_vals()

2023-12-07 Thread Reinette Chatre

Hi Ilpo,

On 12/7/2023 6:32 AM, Ilpo Järvinen wrote:
> On Tue, 28 Nov 2023, Reinette Chatre wrote:
>> On 11/20/2023 3:13 AM, Ilpo Järvinen wrote:

...
>>> -   /*
>>> -* Measure llc occupancy from resctrl.
>>> -*/
>>> -   if (!strncmp(param->resctrl_val, CMT_STR, sizeof(CMT_STR))) {
>>> -   ret = get_llc_occu_resctrl(&llc_occu_resc);
>>> -   if (ret < 0)
>>> -   return ret;
>>> -   llc_value = llc_occu_resc;
>>> -   }
>>> -   ret = print_results_cache(param->filename, bm_pid, llc_value);
>>> -   if (ret)
>>> +   ret = print_results_cache(filename, bm_pid, llc_perf_miss);
>>> +   return ret;
>>> +}
>>
>> Perhaps print_results_cache() can be made to return negative error
>> and this just be "return print_results_cache(...)" and the function
>> comment be accurate?
> 
> I think, I'll just change all "return errno;" to "return -1" before this,
> however, one open question which impacts whether this is actually Fixes 
> class issue:
> 
> It seems that perror()'s manpage doesn't answer one important question, 
> whether it ifself can alter errno or not. The resctrl selftest code 
> assumes it doesn't but some evidence I came across says otherwise so doing 
> return errno; after calling perror() might not even be valid at all.
> 
> So I'm tempted to create an additional Fixes patch about the return change 
> into the front of the series.
> 

I would not trust errno to contain code of earlier calls after a call to 
perror().
If errno is needed I think it should be saved before calling perror(). Looking
at perror() at [1] I do not see an effort to restore errno before it returns,
and looking at the implementation of perror() there appears to be many
opportunities for errno to change.

Reinette

[1] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/perror.c;h=51e621e332a5e2aa76ecefb3bcf325efb43b345f;hb=HEAD#l47

Re: [PATCH v2 08/26] selftests/resctrl: Split measure_cache_vals()

2023-12-07 Thread Reinette Chatre




On 12/7/2023 10:33 AM, Ilpo Järvinen wrote:
> 
> I already spent some moments in converting all return error -> return -1, 
> since all such places do perror() calls anyway (which I also converted to 
> ksft_perror() or ksft_print_msg() where perror() didn't make any sense) 
> there's not much added value in returning the errno which was not 
> correctly done in the existing code anyway.

Thank you very much Ilpo.

Reinette

Re: [PATCH v3 01/29] selftests/resctrl: Convert perror() to ksft_perror() or ksft_print_msg()

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:17 AM, Ilpo Järvinen wrote:
> The resctrl selftest code contains a number of perror() calls. Some of
> them come with hash character and some don't. The kselftest framework
> provides ksft_perror() that is compatible with test output formatting
> so it should be used instead of adding custom hash signs.
> 
> Some perror() calls are too far away from anything that sets error.
> For those call sites, ksft_print_msg() must be used instead.
> 
> Convert perror() to ksft_perror() or ksft_print_msg().
> 
> Other related changes:
> - Remove hash signs
> - Remove trailing stops & newlines from ksft_perror()
> - Add terminating newlines for converted ksft_print_msg()
> - Use consistent capitalization
> 

Another great cleanup. Also thanks for fixing some non-sensical messages.

...

> @@ -149,7 +149,7 @@ int cat_perf_miss_val(int cpu_no, int n, char *cache_type)
>   param.num_of_runs = 0;
>  
>   if (pipe(pipefd)) {
> - perror("# Unable to create pipe");
> + ksft_perror("Unable to create pipe");
>   return errno;
>   }
>  
> @@ -185,7 +185,7 @@ int cat_perf_miss_val(int cpu_no, int n, char *cache_type)
>* Just print the error message.
>* Let while(1) run and wait for itself to be killed.
>*/
> - perror("# failed signaling parent process");
> + ksft_perror("Failed signaling parent process");
>  

Partial writes are not actually errors and it cannot be expected that errno be 
set
in these cases. In these cases I think ksft_print_msg() would be more 
appropriate.

>   close(pipefd[1]);
>   while (1)
> @@ -197,7 +197,7 @@ int cat_perf_miss_val(int cpu_no, int n, char *cache_type)
>   while (pipe_message != 1) {
>   if (read(pipefd[0], &pipe_message,
>sizeof(pipe_message)) < sizeof(pipe_message)) {
> - perror("# failed reading from child process");
> + ksft_perror("Failed reading from child 
> process");
>   break;
>   }

Same with partial reads.


...

  
> @@ -540,14 +540,14 @@ static int print_results_bw(char *filename,  int 
> bm_pid, float bw_imc,
>   } else {
>   fp = fopen(filename, "a");
>   if (!fp) {
> - perror("Cannot open results file");
> + ksft_perror("Cannot open results file");
>  
>   return errno;
>   }
>   if (fprintf(fp, "Pid: %d \t Mem_BW_iMC: %f \t Mem_BW_resc: %lu 
> \t Difference: %lu\n",
>   bm_pid, bw_imc, bw_resc, diff) <= 0) {
> + ksft_perror("Could not log results");
>   fclose(fp);
> - perror("Could not log results.");
>  
>   return errno;

>From what I can tell fprintf() does not set errno on error. Perhaps this
should rather be ksft_print_msg()?


...

> @@ -738,15 +743,17 @@ int resctrl_val(const char * const *benchmark_cmd, 
> struct resctrl_val_param *par
>   sigact.sa_flags = SA_SIGINFO;
>  
>   /* Register for "SIGUSR1" signal from parent */
> - if (sigaction(SIGUSR1, &sigact, NULL))
> - PARENT_EXIT("Can't register child for signal");
> + if (sigaction(SIGUSR1, &sigact, NULL)) {
> + ksft_perror("Can't register child for signal");
> + PARENT_EXIT();
> + }
>  
>   /* Tell parent that child is ready */
>   close(pipefd[0]);
>   pipe_message = 1;
>   if (write(pipefd[1], &pipe_message, sizeof(pipe_message)) <
>   sizeof(pipe_message)) {
> - perror("# failed signaling parent process");
> + ksft_perror("Failed signaling parent process");
>   close(pipefd[1]);
>   return -1;

another partial write possibility

>   }
> @@ -755,7 +762,8 @@ int resctrl_val(const char * const *benchmark_cmd, struct 
> resctrl_val_param *par
>   /* Suspend child until delivery of "SIGUSR1" from parent */
>   sigsuspend(&sigact.sa_mask);
>  
> - PARENT_EXIT("Child is done");
> + ksft_perror("Child is done");
> + PARENT_EXIT();
>   }
>  
>   ksft_print_msg("Benchmark PID: %d\n", bm_pid);
> @@ -796,7 +804,7 @@ int resctrl_val(const char * const *benchmark_cmd, struct 
> resctrl_val_param *par
>   while (pipe_message != 1) {
>   if (read(pipefd[0], &pipe_message, sizeof(pipe_message)) <
>   sizeof(pipe_message)) {
> - perror("# failed reading message from child process");
> + ksft_perror("Failed reading message from child 
>

Re: [PATCH v3 02/29] selftests/resctrl: Return -1 instead of errno on error

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:17 AM, Ilpo Järvinen wrote:
> A number of functions in the resctrl selftests return errno. It is
> problematic because errno is positive which is often counterintuitive.
> Also, every sites returning errno prints the error message already with

every sites -> every site

> ksft_perror() so there is not much added value in returning the precise
> error code.
> 
> Simply convert all places returning errno to return -1 that is typical
> userspace error code in case of failures.
> 
> While at it, improve resctrl_val() comment to state that 0 means the
> test was run (either pass or fail).
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 04/29] selftests/resctrl: Change function comments to say < 0 on error

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> A number function comments state the function return non-zero on
> failure but in reality they can only return 0 on success and < 0 on
> error.
> 
> Update the comments to say < 0 on error to match the behavior.
> 
> While at it, improve cat_val() comment to state that 0 means the test
> was run (either pass or fail).
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 07/29] selftests/resctrl: Refactor get_cbm_mask() and rename to get_full_cbm()

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:

> -int get_cbm_mask(char *cache_type, char *cbm_mask)
> +static int get_bit_mask(const char *filename, unsigned long *mask)
>  {
> - char cbm_mask_path[1024];
>   FILE *fp;
>  
> - if (!cbm_mask)
> + if (!filename || !mask)
>   return -1;
>  
> - sprintf(cbm_mask_path, "%s/%s/cbm_mask", INFO_PATH, cache_type);
> -
> - fp = fopen(cbm_mask_path, "r");
> + fp = fopen(filename, "r");
>   if (!fp) {
> - ksft_perror("Failed to open cache level");
> -
> + fprintf(stderr, "Failed to open bit mask file '%s': %s\n",
> + filename, strerror(errno));
>   return -1;
>   }
> - if (fscanf(fp, "%s", cbm_mask) <= 0) {
> - ksft_perror("Could not get max cbm_mask");
> +
> + if (fscanf(fp, "%lx", mask) <= 0) {
> + fprintf(stderr, "Could not read bit mask file '%s': %s\n",
> + filename, strerror(errno));
>   fclose(fp);
>  
>   return -1;

After seeing the new effort to correct the perror() messages it is
not obvious to me why this patch changes these particular messages to
use fprintf(stderr, ...).

Reinette

Re: [PATCH v3 09/29] selftests/resctrl: Create cache_portion_size() helper

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> +/*
> + * cache_portion_size - Calculate the size of a cache portion
> + * @cache_size:  Total cache size in bytes
> + * @portion_mask:Cache portion mask
> + * @full_cache_mask: Full Cache Bit Mask (CBM) for the cache
> + *
> + * Return: The size of the cache portion in bytes.
> + */
> +static inline int cache_portion_size(unsigned long cache_size,
> +  unsigned long portion_mask,
> +  unsigned long full_cache_mask)
> +{
> + return cache_size * count_bits(portion_mask) / 
> count_bits(full_cache_mask);
> +}
> +

Even after you added the new zero check the static checker I tried
was not able to recognize that this is safe. Could you please add a check
to ensure that there will be no divide-by-zero here?

Reinette

Re: [PATCH v3 10/29] selftests/resctrl: Exclude shareable bits from schemata in CAT test

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> CAT test doesn't take shareable bits into account, i.e., the test might
> be sharing cache with some devices (e.g., graphics).
> 
> Introduce get_mask_no_shareable() and use it to provision an
> environment for CAT test where the allocated LLC is isolated better.
> Excluding shareable_bits may create hole(s) into the cbm_mask, thus add
> a new helper count_contiguous_bits() to find the longest contiguous set
> of CBM bits.
> 
> create_bit_mask() is needed by an upcoming CAT test rewrite so make it
> available in resctrl.h right away.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 11/29] selftests/resctrl: Split measure_cache_vals()

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> measure_cache_vals() does a different thing depending on the test case
> that called it:
>   - For CAT, it measures LLC misses through perf.
>   - For CMT, it measures LLC occupancy through resctrl.
> 
> Split these two functionalities into own functions the CAT and CMT
> tests can call directly. Replace passing the struct resctrl_val_param
> parameter with the filename because it's more generic and all those
> functions need out of resctrl_val.
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 20/29] selftests/resctrl: Replace file write with volatile variable

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> The fill_buf code prevents compiler optimizating the entire read loop
> away by writing the final value of the variable into a file. While it
> achieves the goal, writing into a file requires significant amount of
> work within the innermost test loop and also error handling.
> 
> A simpler approach is to take advantage of volatile. Writing through
> a pointer to a volatile variable is enough to prevent compiler from
> optimizing the write away, and therefore compiler cannot remove the
> read loop either.
> 
> Add a volatile 'value_sink' into resctrl_tests.c and make fill_buf to
> write into it. As a result, the error handling in fill_buf.c can be
> simplified.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 22/29] selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> CAT test spawns two processes into two different control groups with
> exclusive schemata. Both the processes alloc a buffer from memory
> matching their allocated LLC block size and flush the entire buffer out
> of caches. Since the processes are reading through the buffer only once
> during the measurement and initially all the buffer was flushed, the
> test isn't testing CAT.
> 
> Rewrite the CAT test to allocate a buffer sized to half of LLC. Then
> perform a sequence of tests with different LLC alloc sizes starting
> from half of the CBM bits down to 1-bit CBM. Flush the buffer before
> each test and read the buffer twice. Observe the LLC misses on the
> second read through the buffer. As the allocated LLC block gets smaller
> and smaller, the LLC misses will become larger and larger giving a
> strong signal on CAT working properly.
> 
> The new CAT test is using only a single process because it relies on
> measured effect against another run of itself rather than another
> process adding noise. The rest of the system is set to use the CBM bits
> not used by the CAT test to keep the test isolated.
> 
> Replace count_bits() with count_contiguous_bits() to get the first bit
> position in order to be able to calculate masks based on it.
> 
> This change has been tested with a number of systems from different
> generations.
> 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Ilpo Järvinen 
> ---

Thank you!

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 23/29] selftests/resctrl: Restore the CPU affinity after CAT test

2023-12-13 Thread Reinette Chatre

chmark) to a specified cpu
> - * @bm_pid:  PID that should be binded
> - * @cpu_no:  CPU number at which the PID would be binded
> + * @bm_pid:  PID that should be binded
> + * @cpu_no:  CPU number at which the PID would be binded
> + * @old_affinity:When not NULL, set to old CPU affinity
>   *
>   * Return: 0 on success, < 0 on error.
>   */
> -int taskset_benchmark(pid_t bm_pid, int cpu_no)
> +int taskset_benchmark(pid_t bm_pid, int cpu_no, cpu_set_t *old_affinity)
>  {
>   cpu_set_t my_set;
>  
> + if (old_affinity) {
> + CPU_ZERO(old_affinity);
> + if (sched_getaffinity(bm_pid, sizeof(*old_affinity),
> +   old_affinity)) {
> + ksft_perror("Unable to read previous CPU affinity");

"previous" can be confusing here (it is not trying to determine something
from the past but instead the current state). It can just be "Unable to read
CPU affinity"

> + return -1;
> + }
> + }
> +
>   CPU_ZERO(&my_set);
>   CPU_SET(cpu_no, &my_set);
>  
> @@ -366,6 +376,23 @@ int taskset_benchmark(pid_t bm_pid, int cpu_no)
>   return 0;
>  }
>  
> +/*
> + * taskset_restore - Taskset PID to the earlier CPU affinity
> + * @bm_pid:  PID that should be reset
> + * @old_affinity:The old CPU affinity to restore
> + *
> + * Return: 0 on success, < 0 on error.
> + */
> +int taskset_restore(pid_t bm_pid, cpu_set_t *old_affinity)
> +{
> + if (sched_setaffinity(bm_pid, sizeof(*old_affinity), old_affinity)) {
> + ksft_perror("Unable to restore taskset");

This message is not clear to me. How about "Unable to restore CPU affinity"?

> + return -1;
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * create_grp - Create a group only if one doesn't exist
>   * @grp_name:Name of the group

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 24/29] selftests/resctrl: Create struct for input parameters

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> resctrl_tests reads a set of parameters and passes them individually
> for each tests which causes variations in the call signature between
> the tests.
> 
> Add struct input_params to hold all input parameters. It can be easily
> passed to every test without varying the call signature.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 25/29] selftests/resctrl: Introduce generalized test framework

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> Each test currently has a "run test" function in per test file and
> another resctrl_tests.c. The functions in resctrl_tests.c are almost
> identical.
> 
> Generalize the one in resctrl_tests.c such that it can be shared
> between all of the tests. It makes adding new tests easier and removes
> the per test if () forests.
> 
> Also add comment to CPU vendor IDs that they must be defined as bits
> for a bitmask.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 27/29] selftests/resctrl: Add helper to convert L2/3 to integer

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> "L2"/"L3" conversion to integer is embedded into get_cache_size()
> which prevents reuse.
> 
> Create a helper for the cache string to integer conversion to make
> it reusable.
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 28/29] selftests/resctrl: Rename resource ID to domain ID

2023-12-13 Thread Reinette Chatre

Hi Ilpo,

On 12/11/2023 4:18 AM, Ilpo Järvinen wrote:
> Kernel-side calls the instances of a resource domains.
> 
> Change the resource_id naming in the selftest code to domain_id to
> match the kernel side better.
> 
> Suggested-by: Maciej Wieczór-Retman 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v3 01/29] selftests/resctrl: Convert perror() to ksft_perror() or ksft_print_msg()

2023-12-14 Thread Reinette Chatre

Hi Ilpo,

On 12/14/2023 2:12 AM, Ilpo Järvinen wrote:
> On Wed, 13 Dec 2023, Reinette Chatre wrote:
> 
>> Hi Ilpo,
>>
>> On 12/11/2023 4:17 AM, Ilpo Järvinen wrote:
>>> The resctrl selftest code contains a number of perror() calls. Some of
>>> them come with hash character and some don't. The kselftest framework
>>> provides ksft_perror() that is compatible with test output formatting
>>> so it should be used instead of adding custom hash signs.
>>>
>>> Some perror() calls are too far away from anything that sets error.
>>> For those call sites, ksft_print_msg() must be used instead.
>>>
>>> Convert perror() to ksft_perror() or ksft_print_msg().
>>>
>>> Other related changes:
>>> - Remove hash signs
>>> - Remove trailing stops & newlines from ksft_perror()
>>> - Add terminating newlines for converted ksft_print_msg()
>>> - Use consistent capitalization
>>>
>>
>> Another great cleanup. Also thanks for fixing some non-sensical messages.
>>
>> ...
>>
>>> @@ -149,7 +149,7 @@ int cat_perf_miss_val(int cpu_no, int n, char 
>>> *cache_type)
>>> param.num_of_runs = 0;
>>>  
>>> if (pipe(pipefd)) {
>>> -   perror("# Unable to create pipe");
>>> +   ksft_perror("Unable to create pipe");
>>> return errno;
>>> }
>>>  
>>> @@ -185,7 +185,7 @@ int cat_perf_miss_val(int cpu_no, int n, char 
>>> *cache_type)
>>>  * Just print the error message.
>>>  * Let while(1) run and wait for itself to be killed.
>>>  */
>>> -   perror("# failed signaling parent process");
>>> +   ksft_perror("Failed signaling parent process");
>>>  
>>
>> Partial writes are not actually errors and it cannot be expected that errno 
>> be set
>> in these cases. In these cases I think ksft_print_msg() would be more 
>> appropriate.
> 
> I can change those to use print instead although I don't think these will
> fail for other reasons than a real error as the pipe should be empty and 
> only single byte is written to it.
> 

Apologies, I did not pay attention to the actual size of the message. Yes,
leaving it as ksft_perror() is reasonable.

Reinette

Re: [PATCH v4 01/29] selftests/resctrl: Convert perror() to ksft_perror() or ksft_print_msg()

2023-12-15 Thread Reinette Chatre

Hi Ilpo,

On 12/15/2023 7:04 AM, Ilpo Järvinen wrote:
> The resctrl selftest code contains a number of perror() calls. Some of
> them come with hash character and some don't. The kselftest framework
> provides ksft_perror() that is compatible with test output formatting
> so it should be used instead of adding custom hash signs.
> 
> Some perror() calls are too far away from anything that sets error.
> For those call sites, ksft_print_msg() must be used instead.
> 
> Convert perror() to ksft_perror() or ksft_print_msg().
> 
> Other related changes:
> - Remove hash signs
> - Remove trailing stops & newlines from ksft_perror()
> - Add terminating newlines for converted ksft_print_msg()
> - Use consistent capitalization
> - Small fixes/tweaks to typos & grammar of the messages
> - Extract error printing out of PARENT_EXIT() to be able to
>   differentiate
> 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v4 07/29] selftests/resctrl: Refactor get_cbm_mask() and rename to get_full_cbm()

2023-12-15 Thread Reinette Chatre

Hi Ilpo,

On 12/15/2023 7:04 AM, Ilpo Järvinen wrote:
> Callers of get_cbm_mask() are required to pass a string into which the
> capacity bitmask (CBM) is read. Neither CAT nor CMT tests need the
> bitmask as string but just convert it into an unsigned long value.
> 
> Another limitation is that the bit mask reader can only read
> .../cbm_mask files.
> 
> Generalize the bit mask reading function into get_bit_mask() such that
> it can be used to handle other files besides the .../cbm_mask and
> handles the unsigned long conversion within get_bit_mask() using
> fscanf(). Change get_cbm_mask() to use get_bit_mask() and rename it to
> get_full_cbm() to better indicate what the function does.
> 
> Return error from get_full_cbm() if the bitmask is zero for some reason
> because it makes the code more robust as the selftests naturally assume
> the bitmask has some bits.
> 
> Also mark cache_type const while at it and remove useless comments that
> are related to processing of CBM bits.
> 
> Co-developed-by: Fenghua Yu 
> Signed-off-by: Fenghua Yu 
> Signed-off-by: Ilpo Järvinen 
> ---

Reviewed-by: Reinette Chatre 

Reinette

Re: [PATCH v4 00/29] selftests/resctrl: CAT test improvements & generalized test framework

2023-12-15 Thread Reinette Chatre

Hi Ilpo and Shuah,

On 12/15/2023 7:04 AM, Ilpo Järvinen wrote:
> Here's v4 series to improve resctrl selftests with generalized test
> framework and rewritten CAT test.
> 
> The series contains following improvements:
> 
> - Excludes shareable bits from CAT test allocation to avoid interference
> - Replaces file "sink" with a volatile variable
> - Alters read pattern to defeat HW prefetcher optimizations
> - Rewrites CAT test to make the CAT test reliable and truly measure
>   if CAT is working or not
> - Introduces generalized test framework making easier to add new tests
> - Lots of other cleanups & refactoring
> 
> This series has been tested across a large number of systems from
> different generations.

Ilpo, thank you very much for this great cleanup and a creating a
reliable CAT test. This work is focused on kernel health and greatly
appreciated.

All patches in this series should have my reviewed-by tag. For
confirmation, for this whole series:
Reviewed-by: Reinette Chatre 

Shuah, could you please consider this series for inclusion at
your convenience?

Thank you very much.

Reinette

Re: [PATCH v4 00/29] selftests/resctrl: CAT test improvements & generalized test framework

2023-12-15 Thread Reinette Chatre




On 12/15/2023 9:45 AM, Reinette Chatre wrote:
> Hi Ilpo and Shuah,
> 
> On 12/15/2023 7:04 AM, Ilpo Järvinen wrote:
>> Here's v4 series to improve resctrl selftests with generalized test
>> framework and rewritten CAT test.
>>
>> The series contains following improvements:
>>
>> - Excludes shareable bits from CAT test allocation to avoid interference
>> - Replaces file "sink" with a volatile variable
>> - Alters read pattern to defeat HW prefetcher optimizations
>> - Rewrites CAT test to make the CAT test reliable and truly measure
>>   if CAT is working or not
>> - Introduces generalized test framework making easier to add new tests
>> - Lots of other cleanups & refactoring
>>
>> This series has been tested across a large number of systems from
>> different generations.
> 
> Ilpo, thank you very much for this great cleanup and a creating a
> reliable CAT test. This work is focused on kernel health and greatly
> appreciated.
> 
> All patches in this series should have my reviewed-by tag. For
> confirmation, for this whole series:
> Reviewed-by: Reinette Chatre 
> 
> Shuah, could you please consider this series for inclusion at
> your convenience?

Just in case somebody tries this series out against kernel v6.7-rc5 ...

A problematic perf patch made it into v6.7-rc5 (v6.7-rc4 and before are fine).
When testing this series against kernel v6.7-rc5 the splat reported at [1]
is triggered. A perf fix [2] is already queued up so all will be fine when 
testing
this series against a kernel with [2] merged.

Reinette


[1] 
https://lore.kernel.org/lkml/20231214000620.3081018-1-lucas.demar...@intel.com/
[2] https://lore.kernel.org/lkml/20231215112450.3972309-1-mark.rutl...@arm.com/

Re: [PATCH v2 1/4] selftests/resctrl: Add test groups and name L3 CAT test L3_CAT

2024-01-08 Thread Reinette Chatre

Hi Maciej,

On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
> From: Ilpo Järvinen 
> 
> To select test to run -t parameter can be used. However, -t cat
> currently maps to L3 CAT test which is confusing after more CAT related
> tests are added.
> 
> Allow selecting tests as groups and call L3 CAT test "L3_CAT", "CAT"
> group will enable all CAT related tests.
> 
> Signed-off-by: Ilpo Järvinen 
> Signed-off-by: Maciej Wieczor-Retman 
> ---
> Changelog v2:
> - Move this patch from Ilpo's series here (Ilpo).
> 
>  tools/testing/selftests/resctrl/cat_test.c  |  3 ++-
>  tools/testing/selftests/resctrl/resctrl.h   |  2 ++
>  tools/testing/selftests/resctrl/resctrl_tests.c | 16 +++-
>  3 files changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 24af8310288a..39fc9303b8e8 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -295,7 +295,8 @@ static int cat_run_test(const struct resctrl_test *test, 
> const struct user_param
>  }
>  
>  struct resctrl_test l3_cat_test = {
> - .name = "CAT",
> + .name = "L3_CAT",
> + .group = "CAT",
>   .resource = "L3",
>   .feature_check = test_resource_feature_check,
>   .run_test = cat_run_test,
> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
> b/tools/testing/selftests/resctrl/resctrl.h
> index c54efcf1412a..739e16d08a7b 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -65,6 +65,7 @@ struct user_params {
>  /*
>   * resctrl_test: resctrl test definition
>   * @name:Test name
> + * @group:   Test group (e.g., L2 and L3 CAT test belong to CAT 
> group), can be NULL

Could you please remove references to L2 CAT test that is not available yet? A
detailed description of what a test group is will be helpful.

Reinette

Re: [PATCH v2 2/4] selftests/resctrl: Add helpers for the non-contiguous test

2024-01-08 Thread Reinette Chatre

Hi Maciej,

On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
> The CAT non-contiguous selftests have to read the file responsible for
> reporting support of non-contiguous CBMs in Intel CAT. Then the test

"in Intel CAT" -> "in kernel (resctrl)"

> compares if that information matches what is reported by CPUID output.
> 
> Add a generic helper function to read a chosen functionality support
> information.

Since this is a generic function that just reads a value from a file it
cannot be assumed that the value represents functionality support.

> 
> Signed-off-by: Maciej Wieczor-Retman 
> ---
> Changelog v2:
> - Added this patch.
> 
>  tools/testing/selftests/resctrl/resctrl.h   |  1 +
>  tools/testing/selftests/resctrl/resctrlfs.c | 25 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
> b/tools/testing/selftests/resctrl/resctrl.h
> index 739e16d08a7b..8f72d94b9cbe 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -161,6 +161,7 @@ unsigned int count_contiguous_bits(unsigned long val, 
> unsigned int *start);
>  int get_full_cbm(const char *cache_type, unsigned long *mask);
>  int get_mask_no_shareable(const char *cache_type, unsigned long *mask);
>  int get_cache_size(int cpu_no, const char *cache_type, unsigned long 
> *cache_size);
> +int read_info_res_file(const char *resource, const char *filename);
>  void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
>  int signal_handler_register(void);
>  void signal_handler_unregister(void);
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c 
> b/tools/testing/selftests/resctrl/resctrlfs.c
> index 0e97036a64b8..70333440ff2f 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -249,6 +249,31 @@ static int get_bit_mask(const char *filename, unsigned 
> long *mask)
>   return 0;
>  }
>  
> +int read_info_res_file(const char *resource, const char *filename)

Considering that this is intended to be a new generic utility, could you
please add some function documentation?

> +{
> + char file_path[PATH_MAX];
> + FILE *fp;
> + int ret;
> +
> + snprintf(file_path, sizeof(file_path), "%s/%s/%s", INFO_PATH, resource,
> +  filename);
> +
> + fp = fopen(file_path, "r");
> + if (!fp) {
> + perror("Error in opening sparse_masks file\n");

The error messages do not match the goal of this function to be generic.
Also, please note the recent cleanup done by Ilpo to replace the perror()
by ksft_perror().

> + return -1;
> + }
> +
> + if (fscanf(fp, "%u", &ret) <= 0) {

I find this to be potentially confusing. The function claims to be a generic
utility to read a value from a resctrl file ... but hidden within is that the
value is required to be unsigned, which is then cast into an int. This could be
made more specific and robust with something like below:
int resource_info_unsigned_get(const char *resource, const char 
*filename,
unsigned int *val)

The return value will be the result of the request. If 
resource_info_unsigned_get()
returns 0 then @val will contain the value read. 

> + perror("Could not get sparse_masks contents\n");
> + fclose(fp);
> + return -1;
> + }
> +
> + fclose(fp);
> + return ret;
> +}
> +
>  /*
>   * create_bit_mask- Create bit mask from start, len pair
>   * @start:   LSB of the mask

Reinette

Re: [PATCH v2 3/4] selftests/resctrl: Split validate_resctrl_feature_request()

2024-01-08 Thread Reinette Chatre

Hi Maciej,

On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
> validate_resctrl_feature_request() is used to test both if a resource is
> present in the info directory, and if a passed monitoring feature is
> present in the mon_features file. There exists a different way to
> represent feature support and that is by the presence of 0 or 1 in
> single file in the info/resource directory. In this case the filename
> represents what feature support is being indicated.
> 
> Split validate_resctrl_feature_request() into three smaller functions
> that each accomplish one check:
> - Resource directory presence in the /sys/fs/resctrl/info directory.
> - Feature name presence in the /sys/fs/resctrl/info/RESOURCE/mon_features 
> file.
> - Feature file presence in a given /sys/fs/resctrl/info/RESOURCE directory.

Please present refactoring of existing code and introduction of new
feature as separate patches. 

> 
> Signed-off-by: Maciej Wieczor-Retman 
> ---
> Changelog v2:
> - Added this patch.
> 
>  tools/testing/selftests/resctrl/cat_test.c  |  2 -
>  tools/testing/selftests/resctrl/cmt_test.c  |  4 +-
>  tools/testing/selftests/resctrl/mba_test.c  |  5 +-
>  tools/testing/selftests/resctrl/mbm_test.c  |  6 +--
>  tools/testing/selftests/resctrl/resctrl.h   |  6 ++-
>  tools/testing/selftests/resctrl/resctrlfs.c | 59 +
>  6 files changed, 63 insertions(+), 19 deletions(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 39fc9303b8e8..7dc7206b3b99 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -1,9 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
>   * Cache Allocation Technology (CAT) test
> - *
>   * Copyright (C) 2018 Intel Corporation
> - *
>   * Authors:
>   *Sai Praneeth Prakhya ,
>   *Fenghua Yu 

Some unrelated changes here. 

> diff --git a/tools/testing/selftests/resctrl/cmt_test.c 
> b/tools/testing/selftests/resctrl/cmt_test.c
> index dd5ca343c469..7b63aec8e2c4 100644
> --- a/tools/testing/selftests/resctrl/cmt_test.c
> +++ b/tools/testing/selftests/resctrl/cmt_test.c
> @@ -169,8 +169,8 @@ static int cmt_run_test(const struct resctrl_test *test, 
> const struct user_param
>  
>  static bool cmt_feature_check(const struct resctrl_test *test)
>  {
> - return test_resource_feature_check(test) &&
> -validate_resctrl_feature_request("L3_MON", "llc_occupancy");
> + return resctrl_mon_feature_exists("L3_MON", "llc_occupancy") &&
> +resctrl_resource_exists("L3");
>  }
>  
>  struct resctrl_test cmt_test = {
> diff --git a/tools/testing/selftests/resctrl/mba_test.c 
> b/tools/testing/selftests/resctrl/mba_test.c
> index da256d2dbe5c..ecf1c186448d 100644
> --- a/tools/testing/selftests/resctrl/mba_test.c
> +++ b/tools/testing/selftests/resctrl/mba_test.c
> @@ -170,8 +170,9 @@ static int mba_run_test(const struct resctrl_test *test, 
> const struct user_param
>  
>  static bool mba_feature_check(const struct resctrl_test *test)
>  {
> - return test_resource_feature_check(test) &&
> -validate_resctrl_feature_request("L3_MON", "mbm_local_bytes");
> + return resctrl_resource_exists(test->resource) &&
> +resctrl_mon_feature_exists("L3_MON",
> +   "mbm_local_bytes");
>  }
>  
>  struct resctrl_test mba_test = {
> diff --git a/tools/testing/selftests/resctrl/mbm_test.c 
> b/tools/testing/selftests/resctrl/mbm_test.c
> index 34879e7b71a0..d67ffa3ec63a 100644
> --- a/tools/testing/selftests/resctrl/mbm_test.c
> +++ b/tools/testing/selftests/resctrl/mbm_test.c
> @@ -97,7 +97,7 @@ static int mbm_setup(const struct resctrl_test *test,
>   return END_OF_TESTS;
>  
>   /* Set up shemata with 100% allocation on the first run. */
> - if (p->num_of_runs == 0 && validate_resctrl_feature_request("MB", NULL))
> + if (p->num_of_runs == 0 && resctrl_resource_exists("MB"))
>   ret = write_schemata(p->ctrlgrp, "100", uparams->cpu, 
> test->resource);
>  
>   p->num_of_runs++;
> @@ -140,8 +140,8 @@ static int mbm_run_test(const struct resctrl_test *test, 
> const struct user_param
>  
>  static bool mbm_feature_check(const struct resctrl_test *test)
>  {
> - return validate_resctrl_feature_request("L3_MON", "mbm_total_bytes") &&
> -validate_resctrl_feature_request("L3_MON", "mbm_local_bytes");
> + return resctrl_mon_feature_exists("L3_MON", "mbm_total_bytes") &&
> +resctrl_mon_feature_exists("L3_MON", "mbm_local_bytes");
>  }
>  
>  struct resctrl_test mbm_test = {
> diff --git a/tools/testing/selftests/resctrl/resctrl.h 
> b/tools/testing/selftests/resctrl/resctrl.h
> index 8f72d94b9cbe..74041a35d4ba 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -135,7 +135,11 @@ int get_domain_id(const char *resource, int cpu_no, int 
> *domain_id);
>

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-08 Thread Reinette Chatre

Hi Maciej,

On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
> Add tests for both L2 and L3 CAT to verify the return values
> generated by writing non-contiguous CBMs don't contradict the
> reported non-contiguous support information.
> 
> Use a logical XOR to confirm return value of write_schemata() and
> non-contiguous CBMs support information match.
> 
> Signed-off-by: Maciej Wieczor-Retman 
> ---
> Changelog v2:
> - Redo the patch message. (Ilpo)
> - Tidy up __cpuid_count calls. (Ilpo)
> - Remove redundant AND in noncont_mask calculations (Ilpo)
> - Fix bit_center offset.
> - Add newline before function return. (Ilpo)
> - Group non-contiguous tests with CAT tests. (Ilpo)
> - Use a helper for reading sparse_masks file. (Ilpo)
> - Make get_cache_level() available in other source files. (Ilpo)
> 
>  tools/testing/selftests/resctrl/cat_test.c| 75 +++
>  tools/testing/selftests/resctrl/resctrl.h |  3 +
>  .../testing/selftests/resctrl/resctrl_tests.c |  2 +
>  tools/testing/selftests/resctrl/resctrlfs.c   |  2 +-
>  4 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
> b/tools/testing/selftests/resctrl/cat_test.c
> index 7dc7206b3b99..ecf553a89aae 100644
> --- a/tools/testing/selftests/resctrl/cat_test.c
> +++ b/tools/testing/selftests/resctrl/cat_test.c
> @@ -292,6 +292,65 @@ static int cat_run_test(const struct resctrl_test *test, 
> const struct user_param
>   return ret;
>  }
>  
> +static int noncont_cat_run_test(const struct resctrl_test *test,
> + const struct user_params *uparams)
> +{
> + unsigned long full_cache_mask, cont_mask, noncont_mask;
> + unsigned int eax, ebx, ecx, edx, ret;
> + int level, bit_center, sparse_masks;
> + char schemata[64];
> +
> + /* Check to compare sparse_masks content to cpuid output. */

"cpuid" -> "CPUID" (to note it is an instruction)

> + sparse_masks = read_info_res_file(test->resource, "sparse_masks");
> + if (sparse_masks < 0)
> + return sparse_masks;
> +
> + level = get_cache_level(test->resource);
> + if (level < 0)
> + return -EINVAL;
> + __cpuid_count(0x10, 4 - level, eax, ebx, ecx, edx);

Please do not invent relationships. Please replace the "4 - level" with
specific index used that depends on particular cache. The cache level
may not even be needed, just use the resource to determine the correct
index.

> +
> + if (sparse_masks != ((ecx >> 3) & 1))
> + return -1;

Can a message be displayed to support the debugging this test failure?

> +
> + /* Write checks initialization. */
> + ret = get_full_cbm(test->resource, &full_cache_mask);
> + if (ret < 0)
> + return ret;

I assume this test failure relies on the error message from get_bit_mask()
that is called via get_full_cbm()?

> + bit_center = count_bits(full_cache_mask) / 2;
> + cont_mask = full_cache_mask >> bit_center;
> +
> + /* Contiguous mask write check. */
> + snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
> + ret = write_schemata("", schemata, uparams->cpu, test->resource);
> + if (ret)
> + return ret;

How will user know what failed? I am seeing this single test exercise a few 
scenarios
and it is not obvious to me if the issue will be clear if this test,
noncont_cat_run_test(), fails.

> +
> + /*
> +  * Non-contiguous mask write check. CBM has a 0xf hole approximately in 
> the middle.
> +  * Output is compared with support information to catch any edge case 
> errors.
> +  */
> + noncont_mask = ~(0xf << (bit_center - 2)) & full_cache_mask;
> + snprintf(schemata, sizeof(schemata), "%lx", noncont_mask);
> + ret = write_schemata("", schemata, uparams->cpu, test->resource);
> + if (ret && sparse_masks)
> + ksft_print_msg("Non-contiguous CBMs supported but write 
> failed\n");
> + else if (ret && !sparse_masks)
> + ksft_print_msg("Non-contiguous CBMs not supported and write 
> failed as expected\n");
> + else if (!ret && !sparse_masks)
> + ksft_print_msg("Non-contiguous CBMs not supported but write 
> succeeded\n");

Can these messages be made more specific with a "write" changed to "write of
non-contiguous CBM"

> +
> + return !ret == !sparse_masks;

Please return negative on error. Ilpo just did a big cleanup to address this.

> +}
> +
> +static bool noncont_cat_feature_check(const struct resctrl_test *test)
> +{
> + if (!resctrl_resource_exists(test->resource))
> + return false;
> +
> + return resctrl_cache_feature_exists(test->resource, "sparse_masks");
> +}
> +
>  struct resctrl_test l3_cat_test = {
>   .name = "L3_CAT",
>   .group = "CAT",
> @@ -299,3 +358,19 @@ struct resctrl_test l3_cat_test = {
>   .feature_check = test_resource_feature_check,
>   .run_test = cat_run_test,
>  };
> +
> +struct resctrl_test l3_noncont_

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-09 Thread Reinette Chatre

Hi Ilpo,

On 1/9/2024 1:13 AM, Ilpo Järvinen wrote:
> On Mon, 8 Jan 2024, Reinette Chatre wrote:
> 
>> Hi Maciej,
>>
>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
>>> Add tests for both L2 and L3 CAT to verify the return values
>>> generated by writing non-contiguous CBMs don't contradict the
>>> reported non-contiguous support information.
>>>
>>> Use a logical XOR to confirm return value of write_schemata() and
>>> non-contiguous CBMs support information match.
>>>
>>> Signed-off-by: Maciej Wieczor-Retman 
>>> ---
>>> Changelog v2:
>>> - Redo the patch message. (Ilpo)
>>> - Tidy up __cpuid_count calls. (Ilpo)
>>> - Remove redundant AND in noncont_mask calculations (Ilpo)
>>> - Fix bit_center offset.
>>> - Add newline before function return. (Ilpo)
>>> - Group non-contiguous tests with CAT tests. (Ilpo)
>>> - Use a helper for reading sparse_masks file. (Ilpo)
>>> - Make get_cache_level() available in other source files. (Ilpo)
>>>
>>>  tools/testing/selftests/resctrl/cat_test.c| 75 +++
>>>  tools/testing/selftests/resctrl/resctrl.h |  3 +
>>>  .../testing/selftests/resctrl/resctrl_tests.c |  2 +
>>>  tools/testing/selftests/resctrl/resctrlfs.c   |  2 +-
>>>  4 files changed, 81 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/resctrl/cat_test.c 
>>> b/tools/testing/selftests/resctrl/cat_test.c
>>> index 7dc7206b3b99..ecf553a89aae 100644
>>> --- a/tools/testing/selftests/resctrl/cat_test.c
>>> +++ b/tools/testing/selftests/resctrl/cat_test.c
>>> @@ -292,6 +292,65 @@ static int cat_run_test(const struct resctrl_test 
>>> *test, const struct user_param
>>> return ret;
>>>  }
>>>  
>>> +static int noncont_cat_run_test(const struct resctrl_test *test,
>>> +   const struct user_params *uparams)
>>> +{
>>> +   unsigned long full_cache_mask, cont_mask, noncont_mask;
>>> +   unsigned int eax, ebx, ecx, edx, ret;
>>> +   int level, bit_center, sparse_masks;
>>> +   char schemata[64];
>>> +
>>> +   /* Check to compare sparse_masks content to cpuid output. */
>>
>> "cpuid" -> "CPUID" (to note it is an instruction)
>>
>>> +   sparse_masks = read_info_res_file(test->resource, "sparse_masks");
>>> +   if (sparse_masks < 0)
>>> +   return sparse_masks;
>>> +
>>> +   level = get_cache_level(test->resource);
>>> +   if (level < 0)
>>> +   return -EINVAL;
>>> +   __cpuid_count(0x10, 4 - level, eax, ebx, ecx, edx);
>>
>> Please do not invent relationships. Please replace the "4 - level" with
>> specific index used that depends on particular cache. The cache level
>> may not even be needed, just use the resource to determine the correct
>> index.
> 
> This is actually my fault, I suggested Maciej could use arithmetics there.

No problem. The math works for the current values but there is no such
relationship. If hypothetically a new cache level needs to be supported
then this computation cannot be relied upon to continue to be correct.

>>> +   return !ret == !sparse_masks;
>>
>> Please return negative on error. Ilpo just did a big cleanup to address this.
> 
> Test failure is not same as an error. So tests should return negative for 
> errors which prevent even running test at all, and 0/1 for test 
> success/fail.
> 

Thanks for catching this. I missed this subtlety in the framework.

Reinette

Re: [PATCH v2 3/4] selftests/resctrl: Split validate_resctrl_feature_request()

2024-01-17 Thread Reinette Chatre

Hi Maciej,

On 1/17/2024 1:49 AM, Maciej Wieczór-Retman wrote:
> On 2024-01-08 at 14:38:45 -0800, Reinette Chatre wrote:
>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:


>>> +
>>> +   snprintf(res_path, sizeof(res_path), "%s/%s/%s", INFO_PATH, resource,
>>> +feature);
>>> +
>>> +   if (stat(res_path, &statbuf))
>>> +   return false;
>>
>> I think it will be more robust to look at statbuf to learn if the file type
>> is correct and the file is actually readable.
> 
> Could that file be unreadable or of wrong type?

It should be readable and the correct type when all goes well. Hence the term
"more robust".

> 
> Also I thought that since read_info_res_file() opens and reads that file any
> errors should be handled there. Shouldn't this part of the test only return
> whether the file is there or not since that indicates if something is 
> supported
> in the kernel?

ok.


Reinette

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-17 Thread Reinette Chatre

Hi Maciej,

On 1/17/2024 12:26 AM, Maciej Wieczór-Retman wrote:
> On 2024-01-08 at 14:42:11 -0800, Reinette Chatre wrote:
>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:

>>> +
>>> +   if (sparse_masks != ((ecx >> 3) & 1))
>>> +   return -1;
>>
>> Can a message be displayed to support the debugging this test failure?
> 
> Sure, that is a very good idea. I'll add ksft_perror() here.

I do not think ksft_perror() is appropriate since perror() is for
system error messages (something that sets errno). Perhaps just
ksft_print_msg().

>>> +   bit_center = count_bits(full_cache_mask) / 2;
>>> +   cont_mask = full_cache_mask >> bit_center;
>>> +
>>> +   /* Contiguous mask write check. */
>>> +   snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>> +   ret = write_schemata("", schemata, uparams->cpu, test->resource);
>>> +   if (ret)
>>> +   return ret;
>>
>> How will user know what failed? I am seeing this single test exercise a few 
>> scenarios
>> and it is not obvious to me if the issue will be clear if this test,
>> noncont_cat_run_test(), fails.
> 
> write_schemata() either succeeds with '0' or errors out with a negative 
> value. If
> the contiguous mask write fails, write_schemata should print out what was 
> wrong
> and I believe that the test will report an error rather than failure.

Right. I am trying to understand whether the user will be able to decipher what 
failed
in case there is an error. Seems like in this case the user is expected to look 
at the
source code of the test to understand what the test was trying to do at the 
time it
encountered the failure. In this case user may be "lucky" that this test only 
has
one write_schemata() call _not_ followed by a ksft_print_msg() so user can use 
that
reasoning to figure out which write_schemata() failed to further dig what test 
was
trying to do. 

Reinette

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-18 Thread Reinette Chatre

Hi Maciej,

On 1/18/2024 4:02 AM, Maciej Wieczór-Retman wrote:
> On 2024-01-17 at 10:49:06 -0800, Reinette Chatre wrote:
>> On 1/17/2024 12:26 AM, Maciej Wieczór-Retman wrote:
>>> On 2024-01-08 at 14:42:11 -0800, Reinette Chatre wrote:
>>>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:

>>>>> + bit_center = count_bits(full_cache_mask) / 2;
>>>>> + cont_mask = full_cache_mask >> bit_center;
>>>>> +
>>>>> + /* Contiguous mask write check. */
>>>>> + snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>>> + ret = write_schemata("", schemata, uparams->cpu, test->resource);
>>>>> + if (ret)
>>>>> + return ret;
>>>>
>>>> How will user know what failed? I am seeing this single test exercise a 
>>>> few scenarios
>>>> and it is not obvious to me if the issue will be clear if this test,
>>>> noncont_cat_run_test(), fails.
>>>
>>> write_schemata() either succeeds with '0' or errors out with a negative 
>>> value. If
>>> the contiguous mask write fails, write_schemata should print out what was 
>>> wrong
>>> and I believe that the test will report an error rather than failure.
>>
>> Right. I am trying to understand whether the user will be able to decipher 
>> what failed
>> in case there is an error. Seems like in this case the user is expected to 
>> look at the
>> source code of the test to understand what the test was trying to do at the 
>> time it
>> encountered the failure. In this case user may be "lucky" that this test 
>> only has
>> one write_schemata() call _not_ followed by a ksft_print_msg() so user can 
>> use that
>> reasoning to figure out which write_schemata() failed to further dig what 
>> test was
>> trying to do. 
> 
> When a write_schemata() is executed the string that is being written gets
> printed. If there are multiple calls in a single tests and one fails I'd 
> imagine
> it would be easy for the user to figure out which one failed.

It would be easy for the user the figure out if (a) it is obvious to the user
what schema a particular write_schema() call attempted to write and (b) all the
write_schema() calls attempt to write different schema.

> On a side note I'm not sure if that's true but I'm getting a feeling that the
> harder errors (not just test failures) are more of a clue for developers 
> working
> on the tests. Would you agree that it seems like users probably won't see
> write_schemata() fail here (if the test execution managed to get to this point
> without erroring out due to bad parameters or kernel compiled without required
> options)?

I do agree that users probably won't see such failures. I do not think these
errors are clues to developers working on the tests though, but instead clues
to resctrl developers or kernel development CI systems.

Reinette

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-19 Thread Reinette Chatre

Hi Maciej,

On 1/18/2024 11:37 PM, Maciej Wieczór-Retman wrote:
> On 2024-01-18 at 09:15:46 -0800, Reinette Chatre wrote:
>> On 1/18/2024 4:02 AM, Maciej Wieczór-Retman wrote:
>>> On 2024-01-17 at 10:49:06 -0800, Reinette Chatre wrote:
>>>> On 1/17/2024 12:26 AM, Maciej Wieczór-Retman wrote:
>>>>> On 2024-01-08 at 14:42:11 -0800, Reinette Chatre wrote:
>>>>>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
>>
>>>>>>> +   bit_center = count_bits(full_cache_mask) / 2;
>>>>>>> +   cont_mask = full_cache_mask >> bit_center;
>>>>>>> +
>>>>>>> +   /* Contiguous mask write check. */
>>>>>>> +   snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>>>>> +   ret = write_schemata("", schemata, uparams->cpu, 
>>>>>>> test->resource);
>>>>>>> +   if (ret)
>>>>>>> +   return ret;
>>>>>>
>>>>>> How will user know what failed? I am seeing this single test exercise a 
>>>>>> few scenarios
>>>>>> and it is not obvious to me if the issue will be clear if this test,
>>>>>> noncont_cat_run_test(), fails.
>>>>>
>>>>> write_schemata() either succeeds with '0' or errors out with a negative 
>>>>> value. If
>>>>> the contiguous mask write fails, write_schemata should print out what was 
>>>>> wrong
>>>>> and I believe that the test will report an error rather than failure.
>>>>
>>>> Right. I am trying to understand whether the user will be able to decipher 
>>>> what failed
>>>> in case there is an error. Seems like in this case the user is expected to 
>>>> look at the
>>>> source code of the test to understand what the test was trying to do at 
>>>> the time it
>>>> encountered the failure. In this case user may be "lucky" that this test 
>>>> only has
>>>> one write_schemata() call _not_ followed by a ksft_print_msg() so user can 
>>>> use that
>>>> reasoning to figure out which write_schemata() failed to further dig what 
>>>> test was
>>>> trying to do. 
>>>
>>> When a write_schemata() is executed the string that is being written gets
>>> printed. If there are multiple calls in a single tests and one fails I'd 
>>> imagine
>>> it would be easy for the user to figure out which one failed.
>>
>> It would be easy for the user the figure out if (a) it is obvious to the user
>> what schema a particular write_schema() call attempted to write and (b) all 
>> the
>> write_schema() calls attempt to write different schema.
> 
> Okay, your comment made me wonder if on error the schemata still is printed. I
> double checked in the code and whether write_schemata() fails or not it has a
> goto path where before returning it will print out the schema. So I believe 
> that
> satisfies your (a) condition.

Let me try with an example.
Scenario 1:
The test has the following code:
...
write_schemata(..., "0xfff", ...);
...
write_schemata(..., "0xf0f", ...);
...

Scenario 2:
The test has the following code:
...
write_schemata(..., schemata, ...);
...
write_schemata(..., schemata, ...);
...

Any failure of write_schemata() in scenario 1 will be easy to trace. As you
state, write_schemata() prints the schemata attempted and it will thus be
easy to look at the code to see which write_schemata() call failed since it
is obvious from the code which schemata was attempted.
A failure of one of the write_schemata() in scenario 2 will not be as easy
to trace since the user first needs to determine what the value of "schemata"
is at each call and that may depend on the platform, bit shifting done in test,
and state of system state at time of test.

> As for (b) depends on what you meant. Other tests that run more than one
> write_schemata() use different ones every time (CAT, MBM, MBA). Do you suggest
> that the non-contiguous test should attempt more schematas? For example shift
> the bit hole from one side to the other? I assumed one CBM with a centered bit
> hole would be enough to check if non-contiguous CBM feature works properly and
> more CBMs would be redundant.

Let me try with an example.
Scenario 1:
The test has the following code:
...
write_schemata(..., "0xfff", ...);
...
write_schemata(..., "0xf0f", ...);
...

Scenario 2:
The test has the following code:
...
write_schemata(..., "0xfff", ...);
...
write_schemata(..., "0xfff", ...);
...

A failure of either write_schemata() in scenario 1 will be easy to trace since 
the schemata attempted is different in each case. The schemata printed by the
write_schemata() error message can thus easily be connected to the specific
write_schemata() call.
A failure of either write_schemata() in scenario 2 is not so obvious since they
both attempted the same schemata so the error message printed by 
write_schemata()
could belong to either. 

Reinette

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-22 Thread Reinette Chatre

Hi Maciej,

On 1/21/2024 11:56 PM, Maciej Wieczór-Retman wrote:
> Hi!
> 
> On 2024-01-19 at 08:39:31 -0800, Reinette Chatre wrote:
>> Hi Maciej,
>>
>> On 1/18/2024 11:37 PM, Maciej Wieczór-Retman wrote:
>>> On 2024-01-18 at 09:15:46 -0800, Reinette Chatre wrote:
>>>> On 1/18/2024 4:02 AM, Maciej Wieczór-Retman wrote:
>>>>> On 2024-01-17 at 10:49:06 -0800, Reinette Chatre wrote:
>>>>>> On 1/17/2024 12:26 AM, Maciej Wieczór-Retman wrote:
>>>>>>> On 2024-01-08 at 14:42:11 -0800, Reinette Chatre wrote:
>>>>>>>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
>>>>
>>>>>>>>> + bit_center = count_bits(full_cache_mask) / 2;
>>>>>>>>> + cont_mask = full_cache_mask >> bit_center;
>>>>>>>>> +
>>>>>>>>> + /* Contiguous mask write check. */
>>>>>>>>> + snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>>>>>>> + ret = write_schemata("", schemata, uparams->cpu, 
>>>>>>>>> test->resource);
>>>>>>>>> + if (ret)
>>>>>>>>> + return ret;
>>>>>>>>
>>>>>>>> How will user know what failed? I am seeing this single test exercise 
>>>>>>>> a few scenarios
>>>>>>>> and it is not obvious to me if the issue will be clear if this test,
>>>>>>>> noncont_cat_run_test(), fails.
>>>>>>>
>>>>>>> write_schemata() either succeeds with '0' or errors out with a negative 
>>>>>>> value. If
>>>>>>> the contiguous mask write fails, write_schemata should print out what 
>>>>>>> was wrong
>>>>>>> and I believe that the test will report an error rather than failure.
>>>>>>
>>>>>> Right. I am trying to understand whether the user will be able to 
>>>>>> decipher what failed
>>>>>> in case there is an error. Seems like in this case the user is expected 
>>>>>> to look at the
>>>>>> source code of the test to understand what the test was trying to do at 
>>>>>> the time it
>>>>>> encountered the failure. In this case user may be "lucky" that this test 
>>>>>> only has
>>>>>> one write_schemata() call _not_ followed by a ksft_print_msg() so user 
>>>>>> can use that
>>>>>> reasoning to figure out which write_schemata() failed to further dig 
>>>>>> what test was
>>>>>> trying to do. 
>>>>>
>>>>> When a write_schemata() is executed the string that is being written gets
>>>>> printed. If there are multiple calls in a single tests and one fails I'd 
>>>>> imagine
>>>>> it would be easy for the user to figure out which one failed.
>>>>
>>>> It would be easy for the user the figure out if (a) it is obvious to the 
>>>> user
>>>> what schema a particular write_schema() call attempted to write and (b) 
>>>> all the
>>>> write_schema() calls attempt to write different schema.
>>>
>>> Okay, your comment made me wonder if on error the schemata still is 
>>> printed. I
>>> double checked in the code and whether write_schemata() fails or not it has 
>>> a
>>> goto path where before returning it will print out the schema. So I believe 
>>> that
>>> satisfies your (a) condition.
>>
>> Let me try with an example.
>> Scenario 1:
>> The test has the following code:
>>  ...
>>  write_schemata(..., "0xfff", ...);
>>  ...
>>  write_schemata(..., "0xf0f", ...);
>>  ...
>>
>> Scenario 2:
>> The test has the following code:
>>  ...
>>  write_schemata(..., schemata, ...);
>>  ...
>>  write_schemata(..., schemata, ...);
>>  ...
>>
>> Any failure of write_schemata() in scenario 1 will be easy to trace. As you
>> state, write_schemata() prints the schemata attempted and it will thus be
>> easy to look at the code to see which write_schemata() call failed since it
>> is obvious from the code which schemata was attempted.
>> A failure of one of the write_schemata() in scenario 2 will not be as easy
>&g

resctrl selftests ready for inclusion

2024-01-22 Thread Reinette Chatre

Hi Shuah,

Could you please consider Ilpo's resctrl selftest enhancements [1]
for inclusion into kselftest's "next" branch in preparation for the
next merge window?

Thank you very much.

Reinette

[1] 
https://lore.kernel.org/lkml/20231215150515.36983-1-ilpo.jarvi...@linux.intel.com/

Re: [PATCH v2 4/4] selftests/resctrl: Add non-contiguous CBMs CAT test

2024-01-23 Thread Reinette Chatre

Hi Maciej,

On 1/22/2024 11:58 PM, Maciej Wieczór-Retman wrote:
> On 2024-01-22 at 08:32:36 -0800, Reinette Chatre wrote:
>> Hi Maciej,
>>
>> On 1/21/2024 11:56 PM, Maciej Wieczór-Retman wrote:
>>> Hi!
>>>
>>> On 2024-01-19 at 08:39:31 -0800, Reinette Chatre wrote:
>>>> Hi Maciej,
>>>>
>>>> On 1/18/2024 11:37 PM, Maciej Wieczór-Retman wrote:
>>>>> On 2024-01-18 at 09:15:46 -0800, Reinette Chatre wrote:
>>>>>> On 1/18/2024 4:02 AM, Maciej Wieczór-Retman wrote:
>>>>>>> On 2024-01-17 at 10:49:06 -0800, Reinette Chatre wrote:
>>>>>>>> On 1/17/2024 12:26 AM, Maciej Wieczór-Retman wrote:
>>>>>>>>> On 2024-01-08 at 14:42:11 -0800, Reinette Chatre wrote:
>>>>>>>>>> On 12/12/2023 6:52 AM, Maciej Wieczor-Retman wrote:
>>>>>>
>>>>>>>>>>> +   bit_center = count_bits(full_cache_mask) / 2;
>>>>>>>>>>> +   cont_mask = full_cache_mask >> bit_center;
>>>>>>>>>>> +
>>>>>>>>>>> +   /* Contiguous mask write check. */
>>>>>>>>>>> +   snprintf(schemata, sizeof(schemata), "%lx", cont_mask);
>>>>>>>>>>> +   ret = write_schemata("", schemata, uparams->cpu, 
>>>>>>>>>>> test->resource);
>>>>>>>>>>> +   if (ret)
>>>>>>>>>>> +   return ret;
>>>>>>>>>>
>>>>>>>>>> How will user know what failed? I am seeing this single test 
>>>>>>>>>> exercise a few scenarios
>>>>>>>>>> and it is not obvious to me if the issue will be clear if this test,
>>>>>>>>>> noncont_cat_run_test(), fails.
>>>>>>>>>
>>>>>>>>> write_schemata() either succeeds with '0' or errors out with a 
>>>>>>>>> negative value. If
>>>>>>>>> the contiguous mask write fails, write_schemata should print out what 
>>>>>>>>> was wrong
>>>>>>>>> and I believe that the test will report an error rather than failure.
>>>>>>>>
>>>>>>>> Right. I am trying to understand whether the user will be able to 
>>>>>>>> decipher what failed
>>>>>>>> in case there is an error. Seems like in this case the user is 
>>>>>>>> expected to look at the
>>>>>>>> source code of the test to understand what the test was trying to do 
>>>>>>>> at the time it
>>>>>>>> encountered the failure. In this case user may be "lucky" that this 
>>>>>>>> test only has
>>>>>>>> one write_schemata() call _not_ followed by a ksft_print_msg() so user 
>>>>>>>> can use that
>>>>>>>> reasoning to figure out which write_schemata() failed to further dig 
>>>>>>>> what test was
>>>>>>>> trying to do. 
>>>>>>>
>>>>>>> When a write_schemata() is executed the string that is being written 
>>>>>>> gets
>>>>>>> printed. If there are multiple calls in a single tests and one fails 
>>>>>>> I'd imagine
>>>>>>> it would be easy for the user to figure out which one failed.
>>>>>>
>>>>>> It would be easy for the user the figure out if (a) it is obvious to the 
>>>>>> user
>>>>>> what schema a particular write_schema() call attempted to write and (b) 
>>>>>> all the
>>>>>> write_schema() calls attempt to write different schema.
>>>
>>>>> As for (b) depends on what you meant. Other tests that run more than one
>>>>> write_schemata() use different ones every time (CAT, MBM, MBA). Do you 
>>>>> suggest
>>>>> that the non-contiguous test should attempt more schematas? For example 
>>>>> shift
>>>>> the bit hole from one side to the other? I assumed one CBM with a 
>>>>> centered bit
>>>>> hole would be enough to check if non-contiguous CBM feature works 
>>>>> properly and
>>>>> more CBMs would be redundant.
>>&g

1 2 3 >

1 - 100 of 250 matches

Mail list logo