from:"Leonardo Bras"

Re: [PATCH v2 1/7] drivers: dio: Avoids building driver if CONFIG_DIO is disabled

2018-09-27 Thread Leonardo Bras

Dear Rolf,

On Thu, Sep 27, 2018 at 3:35 AM, Rolf Eike Beer  wrote:
> Am Donnerstag, 27. September 2018, 03:39:56 CEST schrieb Leonardo Brás:
>> Avoids building driver if 'make drivers/dio/' is called and
>> CONFIG_DIO is disabled.
>>
>> Signed-off-by: Leonardo Brás 
>> ---
>>  drivers/dio/Makefile | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/dio/Makefile b/drivers/dio/Makefile
>> index ae92d17083f2..f5cffe232448 100644
>> --- a/drivers/dio/Makefile
>> +++ b/drivers/dio/Makefile
>> @@ -2,4 +2,4 @@
>>  # Makefile for the linux kernel.
>>  #
>>
>> -obj-y := dio.o dio-driver.o dio-sysfs.o
>> +obj-$(CONFIG_DIO) := dio.o dio-driver.o dio-sysfs.o
>> \ No newline at end of file
>   ^^
>
> One more.

Thanks, I will fix this for v3.

Re: [PATCH v2 6/7] drivers: oprofile: Avoids building driver from direct make command

2018-09-27 Thread Leonardo Bras

Hello Rolf,

On Thu, Sep 27, 2018 at 3:34 AM, Rolf Eike Beer  wrote:
> Am Donnerstag, 27. September 2018, 03:41:38 CEST schrieb Leonardo Brás:
>> Creates new Makefile to avoid building driver if
>> 'make drivers/oprofile/' is called directly.
>>
>> This driver is usually built from arch/$ARCH and seems to have
>> no meaning building alone.
>>
>> Signed-off-by: Leonardo Brás 
>> ---
>>  drivers/oprofile/Makefile | 2 ++
>>  1 file changed, 2 insertions(+)
>>  create mode 100644 drivers/oprofile/Makefile
>>
>> diff --git a/drivers/oprofile/Makefile b/drivers/oprofile/Makefile
>> new file mode 100644
>> index ..acaed2ad6eee
>> --- /dev/null
>> +++ b/drivers/oprofile/Makefile
>> @@ -0,0 +1,2 @@
>> +#Does nothing, since the source is called from arch/$ARCH/ tree.
>> +
>
> Now there is a blank line where it does not need to be.
>
> Eike

Oh, it's a Makefile, and as all text files, it have to end with newline.
If I am wrong, please let me know.

Thanks for your feedback,

Leonardo Bras,

Re: [PATCH v3 3/7] drivers: parisc: Avoids building driver if CONFIG_PARISC is disabled

2018-10-09 Thread Leonardo Bras

Hello Michael,

> That already works, doesn't it? So all that you'd need is an offline
> tool to precompute what drivers to actually build with a given config.
>
> 'make -n' with some suitable output mangling might do the job.
>
> There may well be other ways to achieve your stated goal, without any
> need to make changes to the kernel build process (which is the result of
> many years of evolution and tuning, BTW).

Thanks for the info, I will try to use it.

> > This change is not supposed to bother the usual way of building the kernel, 
> > and
>
> Enough people have voiced their concern to warrant that you should back
> up that claim, IMO. Have you verified that your patchset does not change
> current behaviour when building the entire set of default configurations
> for each supported architecture? Does it reduce or increase overall
> complexity of the build process?
>
I have tried in some ARCHs and it worked fine. Out of curiosity, I
will try on all
of them.

> > it is not even supposed to add overhead to kernel compilation. And it would,
> > at least, solve my problem with the 3h limit, and enable the tool
> > I am building on GiltabCI to help other developers.
>
> (Apropos of nothing: Am I the only one who thinks gitlab might take a
> rather dim view of your creativity in dealing with their limit?)
>

They make available 50k minutes a month for OSS projects. I don't believe they
care how it's spent if its used to build/deploy the project. They even
allow using
several 'jobs' in parallel in order to speed up the process.

Thanks for your help,

Leonardo Bras

Re: [PATCH 1/1] kbuild: Optimize tests and remove shadowed local variable.

2018-10-09 Thread Leonardo Bras

Thank you!

Please let me know if it needs any rework!


Leonardo Bras

On Sat, Oct 6, 2018 at 4:50 PM Masahiro Yamada
 wrote:
>
> Hi Leonardo, David,
>
>
>
> On Fri, Oct 5, 2018 at 11:32 AM Leonardo Bras  wrote:
> >
> > Hello David,
> >
> > My name is Leonardo and I am new to kernel development.
> >
> > Is this patch acceptable? Do it need some rework? The change makes sense?
> > Is there a way to better follow the workflow for this patch?
> >
> > Please let me know if it needs anything.
>
>
> I thought David would pick this up,
> but he is not responding.
>
> Now, applied to my kbuild tree.
>
>
> Thanks.
>
> >
> > Best regards,
> >
> > Leonardo Bras
> >
> > On Wed, Sep 19, 2018 at 4:11 AM Masahiro Yamada
> >  wrote:
> > >
> > > FW: David Howells
> > >
> > >
> > > 2018-09-19 8:00 GMT+09:00 Leonardo Brás :
> > > > Removes an unnecessary shadowed local variable (start).
> > > > Optimize test of isdigit:
> > > > - If isalpha returns true, isdigit will return false, so no need to 
> > > > test.
> > > >
> > > > Signed-off-by: Leonardo Brás 
> > >
> > >
> > > This patch was sent to me, but maybe belong to David's field.
> > >
> > > David, will you take care of this patch?
> > >
> > > https://lore.kernel.org/patchwork/patch/988171/
> > >
> > > I think the commit subject should be changed kbuild: -> ASN.1:
> > >
> > >
> > > Anyway, this patch looks trivial,
> > > FWIW
> > >
> > > Reviewed-by: Masahiro Yamada 
> > >
> > >
> > >
> > >
> > >
> > > > ---
> > > >  scripts/asn1_compiler.c | 8 
> > > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/scripts/asn1_compiler.c b/scripts/asn1_compiler.c
> > > > index c146020fc783..a0056df4e358 100644
> > > > --- a/scripts/asn1_compiler.c
> > > > +++ b/scripts/asn1_compiler.c
> > > > @@ -413,7 +413,7 @@ static void tokenise(char *buffer, char *end)
> > > >
> > > > /* Handle string tokens */
> > > > if (isalpha(*p)) {
> > > > -   const char **dir, *start = p;
> > > > +   const char **dir;
> > > >
> > > > /* Can be a directive, type name or 
> > > > element
> > > >  * name.  Find the end of the name.
> > > > @@ -454,10 +454,10 @@ static void tokenise(char *buffer, char *end)
> > > >
> > > > tokens[tix++].token_type = 
> > > > TOKEN_TYPE_NAME;
> > > > continue;
> > > > -   }
> > > >
> > > > -   /* Handle numbers */
> > > > -   if (isdigit(*p)) {
> > > > +   } else if (isdigit(*p)) {
> > > > +   /* Handle numbers */
> > > > +
> > > > /* Find the end of the number */
> > > > q = p + 1;
> > > > while (q < nl && (isdigit(*q)))
> > > > --
> > > > 2.19.0
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Masahiro Yamada
>
>
>
> --
> Best Regards
> Masahiro Yamada

Re: [PATCH v3 0/7] Remove errors building drivers/DRIVERNAME

2018-10-09 Thread Leonardo Bras

Thanks Finn, I will take a good look at that and try to use it in my build.

Thank you,
Leonardo Bras

On Wed, Oct 3, 2018 at 11:00 PM Finn Thain  wrote:
>
> On Wed, 3 Oct 2018, Leonardo Bras wrote:
>
> > Both ccache and distcc seem very interesting, I will take my time to
> > study them better as they can solve some situations I face. Thanks for
> > sharing!
> >
>
> You might also want to check out 'gcc -O0', 'gcc -fopt-info' and 'gcc
> --help=optimizers' etc to see if you can reduce the compute cost.
>
> To reduce IO cost, my build tests always use 'make O=/some/path' where
> /some/path is on a tmpfs mountpoint.
>
> HTH.
>
> --

Re: [PATCH v3 2/5] kbuild: Removes unnecessary shadowed local variable.

2018-10-29 Thread Leonardo Bras

Sorry, I will take care next time.

Thank you,

Leonardo Bras

On Sun, Oct 28, 2018 at 1:37 PM Masahiro Yamada
 wrote:
>
> On Wed, Oct 24, 2018 at 1:04 PM Leonardo Bras  wrote:
> >
> > Removes an unnecessary shadowed local variable (start).
> > It was used only once, with the same value it was started before
> > the if block.
> >
> > Signed-off-by: Leonardo Bras 
>
>
>
> Applied to linux-kbuild
> with some fixups in the subject.
>
> Please do not add a period to the end of the subject.
>
>
>
>
>
>
> > ---
> >  scripts/asn1_compiler.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/scripts/asn1_compiler.c b/scripts/asn1_compiler.c
> > index c146020fc783..1b28787028d3 100644
> > --- a/scripts/asn1_compiler.c
> > +++ b/scripts/asn1_compiler.c
> > @@ -413,7 +413,7 @@ static void tokenise(char *buffer, char *end)
> >
> > /* Handle string tokens */
> > if (isalpha(*p)) {
> > -   const char **dir, *start = p;
> > +   const char **dir;
> >
> > /* Can be a directive, type name or element
> >  * name.  Find the end of the name.
> > --
> > 2.19.1
> >
>
>
> --
> Best Regards
> Masahiro Yamada

Re: [PATCH v2 3/5] Creates macro to avoid variable shadowing

2018-10-29 Thread Leonardo Bras

Thank you!
On Sun, Oct 28, 2018 at 1:38 PM Masahiro Yamada
 wrote:
>
> On Tue, Oct 23, 2018 at 10:11 AM Leonardo Brás  wrote:
> >
> > Creates DEF_FIELD_ADDR_VAR as a more generic version of the DEF_FIELD_ADD
> > macro, allowing usage of a variable name other than the struct element name.
> > Also, sets DEF_FIELD_ADDR as a specific usage of DEF_FILD_ADDR_VAR in which
> > the var name is the same as the struct element name.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
>
>
> Applied to linux-kbuild.
>
>
>
> >  scripts/mod/file2alias.c | 24 +---
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c
> > index 7be43697ff84..3015c0bdecb2 100644
> > --- a/scripts/mod/file2alias.c
> > +++ b/scripts/mod/file2alias.c
> > @@ -95,12 +95,20 @@ extern struct devtable *__start___devtable[], 
> > *__stop___devtable[];
> >   */
> >  #define DEF_FIELD(m, devid, f) \
> > typeof(((struct devid *)0)->f) f = TO_NATIVE(*(typeof(f) *)((m) + 
> > OFF_##devid##_##f))
> > +
> > +/* Define a variable v that holds the address of field f of struct devid
> > + * based at address m.  Due to the way typeof works, for a field of type
> > + * T[N] the variable has type T(*)[N], _not_ T*.
> > + */
> > +#define DEF_FIELD_ADDR_VAR(m, devid, f, v) \
> > +   typeof(((struct devid *)0)->f) *v = ((m) + OFF_##devid##_##f)
> > +
> >  /* Define a variable f that holds the address of field f of struct devid
> >   * based at address m.  Due to the way typeof works, for a field of type
> >   * T[N] the variable has type T(*)[N], _not_ T*.
> >   */
> >  #define DEF_FIELD_ADDR(m, devid, f) \
> > -   typeof(((struct devid *)0)->f) *f = ((m) + OFF_##devid##_##f)
> > +   DEF_FIELD_ADDR_VAR(m, devid, f, f)
> >
> >  /* Add a table entry.  We test function type matches while we're here. */
> >  #define ADD_TO_DEVTABLE(device_id, type, function) \
> > @@ -641,25 +649,27 @@ static void do_pnp_card_entries(void *symval, 
> > unsigned long size,
> > unsigned int i;
> >
> > device_id_check(mod->name, "pnp", size, id_size, symval);
> > +   DEF_FIELD_ADDR(symval, pnp_card_device_id, devs);
> > +   typeof(devs) devs_last;
> >
> > for (i = 0; i < count; i++) {
> > unsigned int j;
> > -   DEF_FIELD_ADDR(symval + i*id_size, pnp_card_device_id, 
> > devs);
> > +   devs_last = devs + i * id_size;
> >
> > for (j = 0; j < PNP_MAX_DEVICES; j++) {
> > -   const char *id = (char *)(*devs)[j].id;
> > -   int i2, j2;
> > +   const char *id = (char *)(*devs_last)[j].id;
> > +   int j2;
> > int dup = 0;
> >
> > if (!id[0])
> > break;
> >
> > /* find duplicate, already added value */
> > -   for (i2 = 0; i2 < i && !dup; i2++) {
> > -   DEF_FIELD_ADDR(symval + i2*id_size, 
> > pnp_card_device_id, devs);
> > +   while ((devs_last -= id_size) >= devs && !dup) {
> >
> > for (j2 = 0; j2 < PNP_MAX_DEVICES; j2++) {
> > -   const char *id2 = (char 
> > *)(*devs)[j2].id;
> > +   const char *id2 =
> > +   (char *)(*devs_last)[j2].id;
> >
> > if (!id2[0])
> > break;
> > --
> > 2.19.1
> >
>
>
> --
> Best Regards
> Masahiro Yamada

Re: [PATCH v3 4/5] modpost: Changes parameter name to avoid shadowing.

2018-10-31 Thread Leonardo Bras

Em seg, 2018-10-29 às 01:42 +0900, Masahiro Yamada escreveu:
> On Wed, Oct 24, 2018 at 1:05 PM Leonardo Bras 
> wrote:
> > Changes the parameter name to avoid shadowing a variable.
> > 
> > Signed-off-by: Leonardo Bras 
> 
> For this one, I'd rather like to see code refactoring
> than renaming the variable.
> 
> I will take a closer look.


What do you suggest to refactor?
I volunteer for this work.

Leonardo Bras

[PATCH 1/3] modpost: Refactor and removes global variable *modules

2018-10-31 Thread Leonardo Bras

Refactors the code to accept the modules list as a parameter in
the functions that need it, and moves the variable from the
global scope to the main function scope.

This also fixes the parameter shadowing of add_depends.

Signed-off-by: Leonardo Bras 
---
 scripts/mod/modpost.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 0d998c54564d..bebe65ed957a 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -118,10 +118,7 @@ void *do_nofail(void *ptr, const char *expr)
return ptr;
 }
 
-/* A list of all modules we processed */
-static struct module *modules;
-
-static struct module *find_module(const char *modname)
+static struct module *find_module(const char *modname, struct module *modules)
 {
struct module *mod;
 
@@ -131,7 +128,7 @@ static struct module *find_module(const char *modname)
return mod;
 }
 
-static struct module *new_module(const char *modname)
+static struct module *new_module(const char *modname, struct module **pmodules)
 {
struct module *mod;
char *p;
@@ -149,8 +146,8 @@ static struct module *new_module(const char *modname)
/* add to list */
mod->name = p;
mod->gpl_compatible = -1;
-   mod->next = modules;
-   modules = mod;
+   mod->next = *pmodules;
+   *pmodules = mod;
 
return mod;
 }
@@ -1929,7 +1926,7 @@ static char *remove_dot(char *s)
return s;
 }
 
-static void read_symbols(const char *modname)
+static void read_symbols(const char *modname, struct module **pmodules)
 {
const char *symname;
char *version;
@@ -1941,7 +1938,7 @@ static void read_symbols(const char *modname)
if (!parse_elf(&info, modname))
return;
 
-   mod = new_module(modname);
+   mod = new_module(modname, pmodules);
 
/* When there's no vmlinux, don't print warnings about
 * unresolved symbols (since there'll be too many ;) */
@@ -1992,7 +1989,8 @@ static void read_symbols(const char *modname)
mod->unres = alloc_symbol("module_layout", 0, mod->unres);
 }
 
-static void read_symbols_from_files(const char *filename)
+static void read_symbols_from_files(const char *filename,
+   struct module **pmodules)
 {
FILE *in = stdin;
char fname[PATH_MAX];
@@ -2006,7 +2004,7 @@ static void read_symbols_from_files(const char *filename)
while (fgets(fname, PATH_MAX, in) != NULL) {
if (strends(fname, "\n"))
fname[strlen(fname)-1] = '\0';
-   read_symbols(fname);
+   read_symbols(fname, pmodules);
}
 
if (in != stdin)
@@ -2318,7 +2316,8 @@ static void write_if_changed(struct buffer *b, const char 
*fname)
 /* parse Module.symvers file. line format:
  * 0x12345678symbolmodule[[export]something]
  **/
-static void read_dump(const char *fname, unsigned int kernel)
+static void read_dump(const char *fname, unsigned int kernel,
+ struct module **pmodules)
 {
unsigned long size, pos = 0;
void *file = grab_file(fname, &size);
@@ -2347,11 +2346,11 @@ static void read_dump(const char *fname, unsigned int 
kernel)
crc = strtoul(line, &d, 16);
if (*symname == '\0' || *modname == '\0' || *d != '\0')
goto fail;
-   mod = find_module(modname);
+   mod = find_module(modname, *pmodules);
if (!mod) {
if (is_vmlinux(modname))
have_vmlinux = 1;
-   mod = new_module(modname);
+   mod = new_module(modname, pmodules);
mod->skip = 1;
}
s = sym_add_exported(symname, mod, export_no(export));
@@ -2415,6 +2414,8 @@ int main(int argc, char **argv)
int err;
struct ext_sym_list *extsym_iter;
struct ext_sym_list *extsym_start = NULL;
+   /* A list of all modules we processed */
+   static struct module *modules;
 
while ((opt = getopt(argc, argv, "i:I:e:mnsST:o:awM:K:E")) != -1) {
switch (opt) {
@@ -2466,21 +2467,21 @@ int main(int argc, char **argv)
}
 
if (kernel_read)
-   read_dump(kernel_read, 1);
+   read_dump(kernel_read, 1, &modules);
if (module_read)
-   read_dump(module_read, 0);
+   read_dump(module_read, 0, &modules);
while (extsym_start) {
-   read_dump(extsym_start->file, 0);
+   read_dump(extsym_start->file, 0, &modules);
extsym_iter = extsym_start->next;
free(extsym_start);
extsym_start = ex

[PATCH 1/1] modpost: Refactor and removes global variable *modules

2018-10-31 Thread Leonardo Bras

Refactors the code to accept the modules list as a parameter in
the functions that need it, and moves the variable from the
global scope to the main function scope.

This also fixes the parameter shadowing of add_depends.

Signed-off-by: Leonardo Bras 
---
 scripts/mod/modpost.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 0d998c54564d..bebe65ed957a 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -118,10 +118,7 @@ void *do_nofail(void *ptr, const char *expr)
return ptr;
 }
 
-/* A list of all modules we processed */
-static struct module *modules;
-
-static struct module *find_module(const char *modname)
+static struct module *find_module(const char *modname, struct module *modules)
 {
struct module *mod;
 
@@ -131,7 +128,7 @@ static struct module *find_module(const char *modname)
return mod;
 }
 
-static struct module *new_module(const char *modname)
+static struct module *new_module(const char *modname, struct module **pmodules)
 {
struct module *mod;
char *p;
@@ -149,8 +146,8 @@ static struct module *new_module(const char *modname)
/* add to list */
mod->name = p;
mod->gpl_compatible = -1;
-   mod->next = modules;
-   modules = mod;
+   mod->next = *pmodules;
+   *pmodules = mod;
 
return mod;
 }
@@ -1929,7 +1926,7 @@ static char *remove_dot(char *s)
return s;
 }
 
-static void read_symbols(const char *modname)
+static void read_symbols(const char *modname, struct module **pmodules)
 {
const char *symname;
char *version;
@@ -1941,7 +1938,7 @@ static void read_symbols(const char *modname)
if (!parse_elf(&info, modname))
return;
 
-   mod = new_module(modname);
+   mod = new_module(modname, pmodules);
 
/* When there's no vmlinux, don't print warnings about
 * unresolved symbols (since there'll be too many ;) */
@@ -1992,7 +1989,8 @@ static void read_symbols(const char *modname)
mod->unres = alloc_symbol("module_layout", 0, mod->unres);
 }
 
-static void read_symbols_from_files(const char *filename)
+static void read_symbols_from_files(const char *filename,
+   struct module **pmodules)
 {
FILE *in = stdin;
char fname[PATH_MAX];
@@ -2006,7 +2004,7 @@ static void read_symbols_from_files(const char *filename)
while (fgets(fname, PATH_MAX, in) != NULL) {
if (strends(fname, "\n"))
fname[strlen(fname)-1] = '\0';
-   read_symbols(fname);
+   read_symbols(fname, pmodules);
}
 
if (in != stdin)
@@ -2318,7 +2316,8 @@ static void write_if_changed(struct buffer *b, const char 
*fname)
 /* parse Module.symvers file. line format:
  * 0x12345678symbolmodule[[export]something]
  **/
-static void read_dump(const char *fname, unsigned int kernel)
+static void read_dump(const char *fname, unsigned int kernel,
+ struct module **pmodules)
 {
unsigned long size, pos = 0;
void *file = grab_file(fname, &size);
@@ -2347,11 +2346,11 @@ static void read_dump(const char *fname, unsigned int 
kernel)
crc = strtoul(line, &d, 16);
if (*symname == '\0' || *modname == '\0' || *d != '\0')
goto fail;
-   mod = find_module(modname);
+   mod = find_module(modname, *pmodules);
if (!mod) {
if (is_vmlinux(modname))
have_vmlinux = 1;
-   mod = new_module(modname);
+   mod = new_module(modname, pmodules);
mod->skip = 1;
}
s = sym_add_exported(symname, mod, export_no(export));
@@ -2415,6 +2414,8 @@ int main(int argc, char **argv)
int err;
struct ext_sym_list *extsym_iter;
struct ext_sym_list *extsym_start = NULL;
+   /* A list of all modules we processed */
+   static struct module *modules;
 
while ((opt = getopt(argc, argv, "i:I:e:mnsST:o:awM:K:E")) != -1) {
switch (opt) {
@@ -2466,21 +2467,21 @@ int main(int argc, char **argv)
}
 
if (kernel_read)
-   read_dump(kernel_read, 1);
+   read_dump(kernel_read, 1, &modules);
if (module_read)
-   read_dump(module_read, 0);
+   read_dump(module_read, 0, &modules);
while (extsym_start) {
-   read_dump(extsym_start->file, 0);
+   read_dump(extsym_start->file, 0, &modules);
extsym_iter = extsym_start->next;
free(extsym_start);
extsym_start = ex

Re: [PATCH v3 4/5] modpost: Changes parameter name to avoid shadowing.

2018-10-31 Thread Leonardo Bras

Em qua, 2018-10-31 às 20:24 -0300, Leonardo Bras escreveu:
> Em seg, 2018-10-29 às 01:42 +0900, Masahiro Yamada escreveu:
> > On Wed, Oct 24, 2018 at 1:05 PM Leonardo Bras 
> > wrote:
> > > Changes the parameter name to avoid shadowing a variable.
> > > 
> > > Signed-off-by: Leonardo Bras 
> > 
> > For this one, I'd rather like to see code refactoring
> > than renaming the variable.
> > 
> > I will take a closer look.
> 
> What do you suggest to refactor?
> I volunteer for this work.
> 
> Leonardo Bras
> 

I refactored the code to move the global variable *modules to a local
context at function main, and changed the functions to accept *modules
as a parameter.

Is that what you had in mind?

https://lkml.org/lkml/2018/11/1/725

Thanks for reading,

Leonardo Bras

Re: [Lkcamp] [PATCH 2/4] Renames variable to fix shadow warning.

2018-10-17 Thread Leonardo Bras

Hello Helen,
Thanks for the suggestions!

On Tue, Oct 16, 2018 at 11:57 PM Helen Koike  wrote:
>
> Hi Leonardo,
>
> Thanks for the patch, just some small comments below.
>
> Please, check previous log messages with git log
> arch/x86/entry/vdso/vdso2c.h, you will see that most patches had the
> prefix "x86/vdso:" in the commit message.
Ok, added! This change will be available on v2.

>
> On 10/16/18 9:08 PM, Leonardo Brás wrote:
> > Renames the char variable to avoid shadowing a variable previously
> > declared on this function.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  arch/x86/entry/vdso/vdso2c.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
> > index fa847a620f40..9466998d0f28 100644
> > --- a/arch/x86/entry/vdso/vdso2c.h
> > +++ b/arch/x86/entry/vdso/vdso2c.h
> > @@ -93,11 +93,11 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
> >   int k;
> >   ELF(Sym) *sym = raw_addr + GET_LE(&symtab_hdr->sh_offset) +
> >   GET_LE(&symtab_hdr->sh_entsize) * i;
> > - const char *name = raw_addr + GET_LE(&strtab_hdr->sh_offset) +
> > + const char *name2 = raw_addr + GET_LE(&strtab_hdr->sh_offset) 
> > +
> >   GET_LE(&sym->st_name);
>
> Could you please use a more meaningful name instead of name2 please? I
> believe sym_name should be fine.

Ok, I renamed as you suggested.
Also, I renamed the "name" variable to "image_name" as Andy suggested.
This change will be available on v2.

>
> >
> >   for (k = 0; k < NSYMS; k++) {
> > - if (!strcmp(name, required_syms[k].name)) {
> > + if (!strcmp(name2, required_syms[k].name)) {
> >   if (syms[k]) {
> >   fail("duplicate symbol %s\n",
> >required_syms[k].name);
> >
>
> Regards
> Helen

Regards,
Leonardo Bras

Re: [PATCH 2/4] Renames variable to fix shadow warning.

2018-10-17 Thread Leonardo Bras

Thanks Ingo,
On Wed, Oct 17, 2018 at 3:01 AM Ingo Molnar  wrote:
>
>
> * Leonardo Brás  wrote:
>
> > Renames the char variable to avoid shadowing a variable previously
> > declared on this function.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  arch/x86/entry/vdso/vdso2c.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
> > index fa847a620f40..9466998d0f28 100644
> > --- a/arch/x86/entry/vdso/vdso2c.h
> > +++ b/arch/x86/entry/vdso/vdso2c.h
> > @@ -93,11 +93,11 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
> >   int k;
> >   ELF(Sym) *sym = raw_addr + GET_LE(&symtab_hdr->sh_offset) +
> >   GET_LE(&symtab_hdr->sh_entsize) * i;
> > - const char *name = raw_addr + GET_LE(&strtab_hdr->sh_offset) +
> > + const char *name2 = raw_addr + GET_LE(&strtab_hdr->sh_offset) 
> > +
> >   GET_LE(&sym->st_name);
> >
> >   for (k = 0; k < NSYMS; k++) {
> > - if (!strcmp(name, required_syms[k].name)) {
> > + if (!strcmp(name2, required_syms[k].name)) {
> >   if (syms[k]) {
> >   fail("duplicate symbol %s\n",
> >required_syms[k].name);
>
> NAK.
>
> Please read and understand the code and rename both variables to
> meaningful names, not just a mindless name/name2 ...
>

It's changed! This change will be available on v2.

> Thanks,
>
> Ingo

Regards,
Leonardo

Re: [PATCH 2/4] Renames variable to fix shadow warning.

2018-10-17 Thread Leonardo Bras

Hello Andy,
Thanks for the suggestion.
I renamed them as suggested and it will be available on v2.

Regards,

Leonardo

On Wed, Oct 17, 2018 at 2:54 PM Andy Lutomirski  wrote:
>
> On Tue, Oct 16, 2018 at 11:01 PM Ingo Molnar  wrote:
> >
> >
> > * Leonardo Brás  wrote:
> >
> > > Renames the char variable to avoid shadowing a variable previously
> > > declared on this function.
> > >
> > > Signed-off-by: Leonardo Brás 
> > > ---
> > >  arch/x86/entry/vdso/vdso2c.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
> > > index fa847a620f40..9466998d0f28 100644
> > > --- a/arch/x86/entry/vdso/vdso2c.h
> > > +++ b/arch/x86/entry/vdso/vdso2c.h
> > > @@ -93,11 +93,11 @@ static void BITSFUNC(go)(void *raw_addr, size_t 
> > > raw_len,
> > >   int k;
> > >   ELF(Sym) *sym = raw_addr + GET_LE(&symtab_hdr->sh_offset) +
> > >   GET_LE(&symtab_hdr->sh_entsize) * i;
> > > - const char *name = raw_addr + 
> > > GET_LE(&strtab_hdr->sh_offset) +
> > > + const char *name2 = raw_addr + 
> > > GET_LE(&strtab_hdr->sh_offset) +
> > >   GET_LE(&sym->st_name);
> > >
> > >   for (k = 0; k < NSYMS; k++) {
> > > - if (!strcmp(name, required_syms[k].name)) {
> > > + if (!strcmp(name2, required_syms[k].name)) {
> > >   if (syms[k]) {
> > >   fail("duplicate symbol %s\n",
> > >required_syms[k].name);
> >
> > NAK.
> >
> > Please read and understand the code and rename both variables to
> > meaningful names, not just a mindless name/name2 ...
> >
>
> Maybe image_name and sym_name?
>
> --Andy

Re: [Lkcamp] [PATCH 1/4] Adds -Wshadow=local on KBUILD_HOSTCFLAGS

2018-10-17 Thread Leonardo Bras

Hello Helen,

On Wed, Oct 17, 2018 at 1:32 AM Helen Koike  wrote:
>
> Hi Leonardo,
>
> Thanks for the patch.
>
> On 10/16/18 9:08 PM, Leonardo Brás wrote:
> > Adds -Wshadow=local on KBUILD_HOSTCFLAGS to show shadow warnings
> > on tools built for HOST.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  Makefile | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/Makefile b/Makefile
> > index e8b599b4dcde..fb0a9ac195e7 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -360,7 +360,7 @@ HOST_LFS_LIBS := $(shell getconf LFS_LIBS 2>/dev/null)
> >
> >  HOSTCC   = gcc
> >  HOSTCXX  = g++
> > -KBUILD_HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \
> > +KBUILD_HOSTCFLAGS   := -Wall -Wshadow=local -Wmissing-prototypes 
> > -Wstrict-prototypes -O2 \
> >   -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) \
> >   $(HOSTCFLAGS)
> >  KBUILD_HOSTCXXFLAGS := -O2 $(HOST_LFS_CFLAGS) $(HOSTCXXFLAGS)
> >
>
> I believe this should be the last patch on this series and not the first
> one to avoid commits in between where we have those warnings.
>

You are right, I will change the order for v2.
Thanks!

> Regards
> Helen

Regards,
Leonardo

Re: [PATCH 1/4] Adds -Wshadow=local on KBUILD_HOSTCFLAGS

2018-10-17 Thread Leonardo Bras

On Wed, Oct 17, 2018 at 5:21 AM Masahiro Yamada
 wrote:
>
> On Wed, Oct 17, 2018 at 5:11 PM Borislav Petkov  wrote:
> >
> > On Tue, Oct 16, 2018 at 09:08:09PM -0300, Leonardo Brás wrote:
> > > Adds -Wshadow=local on KBUILD_HOSTCFLAGS to show shadow warnings
> > > on tools built for HOST.
> > >
> > > Signed-off-by: Leonardo Brás 
> > > ---
> > >  Makefile | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/Makefile b/Makefile
> > > index e8b599b4dcde..fb0a9ac195e7 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -360,7 +360,7 @@ HOST_LFS_LIBS := $(shell getconf LFS_LIBS 2>/dev/null)
> > >
> > >  HOSTCC   = gcc
> > >  HOSTCXX  = g++
> > > -KBUILD_HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes 
> > > -O2 \
> > > +KBUILD_HOSTCFLAGS   := -Wall -Wshadow=local -Wmissing-prototypes 
> > > -Wstrict-prototypes -O2 \
> > >   -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) \
> > >   $(HOSTCFLAGS)
> > >  KBUILD_HOSTCXXFLAGS := -O2 $(HOST_LFS_CFLAGS) $(HOSTCXXFLAGS)
> > > --
> >
> > You might wanna take a look at scripts/Makefile.extrawarn which already
> > has -Wshadow.
>
>
> scripts/Makefile.extrawarn provides options for the target compiler (CC),
> whereas this patch adds -Wshadow=local for the host compiler (HOSTCC).
>
>


Thanks for helping, :)

Leonardo Bras

>
> --
> Best Regards
> Masahiro Yamada

Re: [PATCH 1/4] Adds -Wshadow=local on KBUILD_HOSTCFLAGS

2018-10-17 Thread Leonardo Bras

Hello Boris,


> Next question: why not -Wshadow simply?

Good question. I can change it to -Wshadow and fix stuff for v2.

>
> Also, -Wshadow for the target compiler is an extra warning (W=2). Why is
> it enabled by default here?

The idea was to put it as default and fix all the shadowing warnings.
What do you think?  I am open to suggestions.

Thank you,
Leonardo Bras


> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [Lkcamp] [PATCH 4/4] Changes macro usage to avoid shadowing a variable.

2018-10-17 Thread Leonardo Bras

Hello Helen,

On Wed, Oct 17, 2018 at 1:18 AM Helen Koike  wrote:
>
> Hi Leonardo,
>
> Thanks for the patch, just some small comments below.
>
> On 10/16/18 9:09 PM, Leonardo Brás wrote:
> > Changes the usage of DEF_FIELD_ADDR in this function to create a
> > reference and operate over it using an aux variable.
> > It also changes the loop logic used to find duplicates, to avoid
> > creating another variable.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  scripts/mod/file2alias.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c
> > index 7be43697ff84..9ea1db2aefdb 100644
> > --- a/scripts/mod/file2alias.c
> > +++ b/scripts/mod/file2alias.c
> > @@ -641,25 +641,27 @@ static void do_pnp_card_entries(void *symval, 
> > unsigned long size,
> >   unsigned int i;
> >
> >   device_id_check(mod->name, "pnp", size, id_size, symval);
> > + DEF_FIELD_ADDR(symval, pnp_card_device_id, devs);
> > + typeof(devs) devs_last;
> >
> >   for (i = 0; i < count; i++) {
> >   unsigned int j;
> > - DEF_FIELD_ADDR(symval + i*id_size, pnp_card_device_id, devs);
> > + devs_last = devs + i * id_size;
> >
> >   for (j = 0; j < PNP_MAX_DEVICES; j++) {
> > - const char *id = (char *)(*devs)[j].id;
> > - int i2, j2;
> > + const char *id = (char *)(*devs_last)[j].id;
> > + int j2;
> >   int dup = 0;
> >
> >   if (!id[0])
> >   break;
> >
> >   /* find duplicate, already added value */
> > - for (i2 = 0; i2 < i && !dup; i2++) {
> > - DEF_FIELD_ADDR(symval + i2*id_size, 
> > pnp_card_device_id, devs);
> > + while ((devs_last -= id_size) >= devs) {
>
> You forgot to consider the dup variable.

Sorry, I did not get it. Could you explain it?

> Also you inverted the order of this loop, I am not sure this is
> important (I just took a quick look) but you need to be careful.

It seems to be looking for a duplicate. I don't think inverting will
break anything.
But I will give a better look anyway.

>
> This is also hard to read,
Humm, IMO it looks easier to read this way. But ok, I can change it back.

> I would define another variant macro where
> you can set any name of the variable, e.g.
>
> #define DEF_FIELD_ADDR_VAR(m, devid, f, var) \
> typeof(((struct devid *)0)->f) *var = ((m) + OFF_##devid##_##f)
>
> In this way you don't need to change the logic, just the name of the
> variable.
>

Good suggestion. Is it ok to create a macro for using only once?
It looks like its not worth the trouble.


> >
> >   for (j2 = 0; j2 < PNP_MAX_DEVICES; j2++) {
> > -     const char *id2 = (char 
> > *)(*devs)[j2].id;
> > + const char *id2 =
> > + (char *)(*devs_last)[j2].id;
> >
> >   if (!id2[0])
> >   break;
> >
>
> Regards,
> Helen

Thank you for reviewing,

Leonardo Bras

Re: [Lkcamp] [PATCH 3/4] kbuild: Removes unnecessary shadowed local variable and optimize testing.

2018-10-17 Thread Leonardo Bras

Hello Helen,

On Wed, Oct 17, 2018 at 12:06 AM Helen Koike  wrote:
>
> Hi Leonardo,
>
> On 10/16/18 9:09 PM, Leonardo Brás wrote:
> > Removes an unnecessary shadowed local variable (start).
> > Optimize test of isdigit:
> > - If isalpha returns true, isdigit will return false, so no need to 
> > test.
>
> This patch does two different things, it should be in two separated patches.
Sure, no problem.

>
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  scripts/asn1_compiler.c | 7 +++
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/scripts/asn1_compiler.c b/scripts/asn1_compiler.c
> > index c146020fc783..08bb6e5fd6ad 100644
> > --- a/scripts/asn1_compiler.c
> > +++ b/scripts/asn1_compiler.c
> > @@ -413,7 +413,7 @@ static void tokenise(char *buffer, char *end)
> >
> >   /* Handle string tokens */
> >   if (isalpha(*p)) {
> > - const char **dir, *start = p;
> > + const char **dir;
> >
> >   /* Can be a directive, type name or element
> >* name.  Find the end of the name.
> > @@ -454,10 +454,9 @@ static void tokenise(char *buffer, char *end)
> >
> >   tokens[tix++].token_type = TOKEN_TYPE_NAME;
> >   continue;
> > - }
> > + } else if (isdigit(*p)) {
> > + /* Handle numbers */
>
> Actually you can't do that, p is being altered in the first if statement.

Yeah, you are right. I will remove this logic for v2.

>
> >
> > - /* Handle numbers */
> > - if (isdigit(*p)) {
> >   /* Find the end of the number */
> >   q = p + 1;
> >   while (q < nl && (isdigit(*q)))
> >
>
> Regards
> Helen

Thanks!

Leonardo Brás

Re: [PATCH v3 3/7] drivers: parisc: Avoids building driver if CONFIG_PARISC is disabled

2018-10-03 Thread Leonardo Bras

On Fri, Sep 28, 2018 at 4:15 AM James Bottomley
 wrote:
>
> On Thu, 2018-09-27 at 23:08 -0300, Leonardo Brás wrote:
> > Avoids building driver if 'make drivers/parisc/' is called and
> > CONFIG_PARISC is disabled.
>
> Is that really a problem? The drivers/Makefile has this:
>
> obj-$(CONFIG_PARISC)+= parisc/
> And you just overrode that by forcing the build.  It's not even clear
> we should refuse the build in that case; how would we know you don't
> have a legitimate reason for the override?
>

Sorry I did not explained my reasons earlier. I sent everybody involved an
e-mail explaining the full reason of this change.
(For reference it's here: https://lkml.org/lkml/2018/10/3/707)

> Signed-off-by: Leonardo Brás 
> > ---
> >  drivers/parisc/Makefile | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/parisc/Makefile b/drivers/parisc/Makefile
> > index 3cd5e6cb8478..80049d763aa0 100644
> > --- a/drivers/parisc/Makefile
> > +++ b/drivers/parisc/Makefile
> > @@ -24,5 +24,5 @@ obj-$(CONFIG_EISA)  += eisa.o
> > eisa_enumerator.o eisa_eeprom.o
> >  obj-$(CONFIG_SUPERIO)+= superio.o
> >  obj-$(CONFIG_CHASSIS_LCD_LED)+= led.o
> >  obj-$(CONFIG_PDC_STABLE) += pdc_stable.o
> > -obj-y+= power.o
> > +obj-$(CONFIG_PARISC) += power.o
>
> If we conclude the use case is legitimate, that's not enough: the two
> inner symbols are PARISC only but CONFIG_EISA isn't.

You are right.
It worked for my needs because I am only building the drivers, and not
linking them. But i believe doing something like I did in zorro/Makefile
would fix this all. (For reference, https://lkml.org/lkml/2018/9/28/150 )

If you agree, I will send the next patchset with this change.

Thanks for your help!

Leonardo Bras

Re: [PATCH v3 5/7] drivers: s390: Avoids building drivers if ARCH is not s390.

2018-10-03 Thread Leonardo Bras

On Mon, Oct 1, 2018 at 9:46 AM Heiko Carstens  wrote:
>
> On Thu, Sep 27, 2018 at 11:08:14PM -0300, Leonardo Brás wrote:
> > Avoids building s390 drivers if 'make drivers/s390/' is called but
> > ARCH is not s390.
> >
> > Signed-off-by: Leonardo Brás 
> > ---
> >  drivers/s390/Makefile | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/s390/Makefile b/drivers/s390/Makefile
> > index a863b0462b43..0575f02dba45 100644
> > --- a/drivers/s390/Makefile
> > +++ b/drivers/s390/Makefile
> > @@ -3,7 +3,7 @@
> >  # Makefile for the S/390 specific device drivers
> >  #
> >
> > -obj-y += cio/ block/ char/ crypto/ net/ scsi/ virtio/
> > -
> > -drivers-y += drivers/s390/built-in.a
> > -
> > +ifeq ($(ARCH),s390)
> > + obj-y += cio/ block/ char/ crypto/ net/ scsi/ virtio/
> > + drivers-y += drivers/s390/built-in.a
> > +endif
>
> And then somebody wants to build with e.g. "make drivers/s390/cio/" and it
> still doesn't work. So _if_ this should be supported then it should work
> with all directory levels and all configuration options. Otherwise this is
> going to be a never ending story.
>

It makes sense.
I proposed this change to help me solving a problem described here
(https://lkml.org/lkml/2018/10/3/707), and for this it was enough if it didn't
build when "make drivers/s390/" was called.

Sorry I didn't send the e-mail with the reason earlier.

For solving my problem it was not necessary, but if you think it's interesting,
I could refactor all drivers/s390 Makefiles to make them all build only if
we are dealing with the s390 architecture.

What do you think?

Thanks for the reply,

Leonardo Bras

Re: [PATCH v3 0/7] Remove errors building drivers/DRIVERNAME

2018-10-03 Thread Leonardo Bras

On Wed, Oct 3, 2018 at 8:27 PM Finn Thain  wrote:
>
> On Wed, 3 Oct 2018, Leonardo Bras wrote:
>
> >
> > Sorry, I was not very clear at my reasons why this change is important,
> > I will try to briefly explain the whole story.
> >
> > Some weeks ago I was trying to solve a task that needed to change some
> > compiling options, build the whole kernel (allyesconfig) and look for
> > errors. The problem was: It would take a long time to build everything
> > in my computer. And many friends with slimmer laptops would take much
> > longer.
> >
>
> It seems to me that you shouldn't need expensive optimization for
> continuous integration. In theory that could make a big difference though
> I admit to no experience of build farms.
>
> Have you looked at ccache? It will save you from having to recompile any
> files not changed.
>
> You could also leverage all of your laptops together using distcc.
>
> HTH.
>
> --

Both ccache and distcc seem very interesting,
I will take my time to study them better as they can solve some situations
I face. Thanks for sharing!

Still, I believe my patchset is relevant, as it would enable people to rebuild
the whole kernel (with build flags changes, for example) at the cost of only
git-pushing the changes to a fork of my Gitlab repo, and waiting 3 hours.

Thanks for helping!

Re: [PATCH v3 3/7] drivers: parisc: Avoids building driver if CONFIG_PARISC is disabled

2018-10-04 Thread Leonardo Bras

> Well it's not really that persuasive.  Most people simply let the build
> run to completion, but if you have a problem with a job control 3h
> timelimit, then create a job that kills itself at 2:59 and then
> resubmits itself.  That will produce a complete build in 3h chunks
> without any need to call sub Makefiles.
>

Humm, I probably should have explained better how GitlabCI works.
It works creating a docker container that have a limited lifespan of 3h max.
After that the time is over, this container ceases to exist, leaving no build
objects, only the console log. So there is no way of 'resuming' the building
from where it stopped. I used the 'job' term because it's how they call it,
and I understand it's easily confused with bash jobs.

> All of our Makefiles are coded assuming the upper level can prevent
> descent into the lower ones.  You're proposing to change that
> assumption, requiring a fairly large patch set, which doesn't really
> seem to provide a huge benefit.
>
> James

I understand your viewpoint.
But what I propose is not to change that assumption, but instead give some
Makefiles the aditional ability to be called directly and still not build stuff
if they were not enabled in .config.

But, why these chosen Makefiles, and not all of them?
Granularity.
What I am trying to achieve with this patchset is the ability of building
smaller sets of drivers without accidentally building what is not enabled
on .config.
And, in my viewpoint, building a single drivers/DRIVERNAME is small enough to
be fast in most situations.

This change is not supposed to bother the usual way of building the kernel, and
it is not even supposed to add overhead to kernel compilation. And it would,
at least, solve my problem with the 3h limit, and enable the tool
I am building on GiltabCI to help other developers.

Thanks for reading,

Leonardo Bras

Re: [PATCH 1/1] kbuild: Optimize tests and remove shadowed local variable.

2018-10-04 Thread Leonardo Bras

Hello David,

My name is Leonardo and I am new to kernel development.

Is this patch acceptable? Do it need some rework? The change makes sense?
Is there a way to better follow the workflow for this patch?

Please let me know if it needs anything.

Best regards,

Leonardo Bras

On Wed, Sep 19, 2018 at 4:11 AM Masahiro Yamada
 wrote:
>
> FW: David Howells
>
>
> 2018-09-19 8:00 GMT+09:00 Leonardo Brás :
> > Removes an unnecessary shadowed local variable (start).
> > Optimize test of isdigit:
> > - If isalpha returns true, isdigit will return false, so no need to test.
> >
> > Signed-off-by: Leonardo Brás 
>
>
> This patch was sent to me, but maybe belong to David's field.
>
> David, will you take care of this patch?
>
> https://lore.kernel.org/patchwork/patch/988171/
>
> I think the commit subject should be changed kbuild: -> ASN.1:
>
>
> Anyway, this patch looks trivial,
> FWIW
>
> Reviewed-by: Masahiro Yamada 
>
>
>
>
>
> > ---
> >  scripts/asn1_compiler.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/scripts/asn1_compiler.c b/scripts/asn1_compiler.c
> > index c146020fc783..a0056df4e358 100644
> > --- a/scripts/asn1_compiler.c
> > +++ b/scripts/asn1_compiler.c
> > @@ -413,7 +413,7 @@ static void tokenise(char *buffer, char *end)
> >
> > /* Handle string tokens */
> > if (isalpha(*p)) {
> > -   const char **dir, *start = p;
> > +   const char **dir;
> >
> > /* Can be a directive, type name or element
> >  * name.  Find the end of the name.
> > @@ -454,10 +454,10 @@ static void tokenise(char *buffer, char *end)
> >
> > tokens[tix++].token_type = TOKEN_TYPE_NAME;
> > continue;
> > -   }
> >
> > -   /* Handle numbers */
> > -   if (isdigit(*p)) {
> > +   } else if (isdigit(*p)) {
> > +   /* Handle numbers */
> > +
> > /* Find the end of the number */
> > q = p + 1;
> > while (q < nl && (isdigit(*q)))
> > --
> > 2.19.0
> >
>
>
>
> --
> Best Regards
> Masahiro Yamada

Re: [PATCH v2 3/5] Creates macro to avoid variable shadowing

2018-10-23 Thread Leonardo Bras

Hello Helen,
On Tue, Oct 23, 2018 at 6:35 PM Helen Koike  wrote:
> Like this you are not changing the logic in this file, you are just
> renaming the variable which is more then enough to solve the -Wshadow
> warning.
> I don't it is worthy changing the logic.

You are right, that was what the patch was supposed to do. For some
crazy reason I must have submitted the wrong commit as a patch.

I will fix it and send you all a v3 ASAP.

Sorry for the inconvenience.
Thanks!

Leonardo Brás

Re: [PATCH v2 1/5] x86/vdso: Renames variable to fix shadow warning.

2018-10-23 Thread Leonardo Bras

Hello Andy,

Yeah, it's much better. I will replace this part of the commit message.
It will be fixed for v3.

Thank you!

On Mon, Oct 22, 2018 at 11:13 PM Andy Lutomirski  wrote:

> How about “The go32() and go64() functions has an argument and a local 
> variable called ‘name’.  Rename both to clarify the code and to fix a warning 
> with -Wshadow.”
>
> Other than that, Acked-by: Andy Lutomirski

[PATCH v3 1/5] x86/vdso: Renames variable to fix shadow warning.

2018-10-23 Thread Leonardo Bras

The go32() and go64() functions have an argument and a local variable
called ‘name’.  Rename both to clarify the code and to fix a warning
with -Wshadow.

Signed-off-by: Leonardo Bras 
---
 arch/x86/entry/vdso/vdso2c.h | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index fa847a620f40..a20b134de2a8 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -7,7 +7,7 @@
 
 static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 void *stripped_addr, size_t stripped_len,
-FILE *outfile, const char *name)
+FILE *outfile, const char *image_name)
 {
int found_load = 0;
unsigned long load_size = -1;  /* Work around bogus warning */
@@ -93,11 +93,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
int k;
ELF(Sym) *sym = raw_addr + GET_LE(&symtab_hdr->sh_offset) +
GET_LE(&symtab_hdr->sh_entsize) * i;
-   const char *name = raw_addr + GET_LE(&strtab_hdr->sh_offset) +
-   GET_LE(&sym->st_name);
+   const char *sym_name = raw_addr +
+  GET_LE(&strtab_hdr->sh_offset) +
+  GET_LE(&sym->st_name);
 
for (k = 0; k < NSYMS; k++) {
-   if (!strcmp(name, required_syms[k].name)) {
+   if (!strcmp(sym_name, required_syms[k].name)) {
if (syms[k]) {
fail("duplicate symbol %s\n",
 required_syms[k].name);
@@ -134,7 +135,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (syms[sym_vvar_start] % 4096)
fail("vvar_begin must be a multiple of 4096\n");
 
-   if (!name) {
+   if (!image_name) {
fwrite(stripped_addr, stripped_len, 1, outfile);
return;
}
@@ -157,7 +158,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
}
fprintf(outfile, "\n};\n\n");
 
-   fprintf(outfile, "const struct vdso_image %s = {\n", name);
+   fprintf(outfile, "const struct vdso_image %s = {\n", image_name);
fprintf(outfile, "\t.data = raw_data,\n");
fprintf(outfile, "\t.size = %lu,\n", mapping_size);
if (alt_sec) {
-- 
2.19.1

[PATCH v3 4/5] modpost: Changes parameter name to avoid shadowing.

2018-10-23 Thread Leonardo Bras

Changes the parameter name to avoid shadowing a variable.

Signed-off-by: Leonardo Bras 
---
 scripts/mod/modpost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 0d998c54564d..368fe42340df 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -2228,13 +2228,13 @@ static int add_versions(struct buffer *b, struct module 
*mod)
 }
 
 static void add_depends(struct buffer *b, struct module *mod,
-   struct module *modules)
+   struct module *module_list)
 {
struct symbol *s;
struct module *m;
int first = 1;
 
-   for (m = modules; m; m = m->next)
+   for (m = module_list; m; m = m->next)
m->seen = is_vmlinux(m->name);
 
buf_printf(b, "\n");
-- 
2.19.1

[PATCH v3 2/5] kbuild: Removes unnecessary shadowed local variable.

2018-10-23 Thread Leonardo Bras

Removes an unnecessary shadowed local variable (start).
It was used only once, with the same value it was started before
the if block.

Signed-off-by: Leonardo Bras 
---
 scripts/asn1_compiler.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/asn1_compiler.c b/scripts/asn1_compiler.c
index c146020fc783..1b28787028d3 100644
--- a/scripts/asn1_compiler.c
+++ b/scripts/asn1_compiler.c
@@ -413,7 +413,7 @@ static void tokenise(char *buffer, char *end)
 
/* Handle string tokens */
if (isalpha(*p)) {
-   const char **dir, *start = p;
+   const char **dir;
 
/* Can be a directive, type name or element
 * name.  Find the end of the name.
-- 
2.19.1

[PATCH v3 3/5] Creates macro to avoid variable shadowing

2018-10-23 Thread Leonardo Bras

Creates DEF_FIELD_ADDR_VAR as a more generic version of the DEF_FIELD_ADD
macro, allowing usage of a variable name other than the struct element name.
Also, sets DEF_FIELD_ADDR as a specific usage of DEF_FILD_ADDR_VAR in which
the var name is the same as the struct element name.
Then, makes use of DEF_FIELD_ADDR_VAR to create a variable of another name,
in order to avoid variable shadowing.

Signed-off-by: Leonardo Bras 
---
 scripts/mod/file2alias.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c
index 7be43697ff84..ed468313ddeb 100644
--- a/scripts/mod/file2alias.c
+++ b/scripts/mod/file2alias.c
@@ -95,12 +95,20 @@ extern struct devtable *__start___devtable[], 
*__stop___devtable[];
  */
 #define DEF_FIELD(m, devid, f) \
typeof(((struct devid *)0)->f) f = TO_NATIVE(*(typeof(f) *)((m) + 
OFF_##devid##_##f))
+
+/* Define a variable v that holds the address of field f of struct devid
+ * based at address m.  Due to the way typeof works, for a field of type
+ * T[N] the variable has type T(*)[N], _not_ T*.
+ */
+#define DEF_FIELD_ADDR_VAR(m, devid, f, v) \
+   typeof(((struct devid *)0)->f) *v = ((m) + OFF_##devid##_##f)
+
 /* Define a variable f that holds the address of field f of struct devid
  * based at address m.  Due to the way typeof works, for a field of type
  * T[N] the variable has type T(*)[N], _not_ T*.
  */
 #define DEF_FIELD_ADDR(m, devid, f) \
-   typeof(((struct devid *)0)->f) *f = ((m) + OFF_##devid##_##f)
+   DEF_FIELD_ADDR_VAR(m, devid, f, f)
 
 /* Add a table entry.  We test function type matches while we're here. */
 #define ADD_TO_DEVTABLE(device_id, type, function) \
@@ -644,7 +652,7 @@ static void do_pnp_card_entries(void *symval, unsigned long 
size,
 
for (i = 0; i < count; i++) {
unsigned int j;
-   DEF_FIELD_ADDR(symval + i*id_size, pnp_card_device_id, devs);
+   DEF_FIELD_ADDR(symval + i * id_size, pnp_card_device_id, devs);
 
for (j = 0; j < PNP_MAX_DEVICES; j++) {
const char *id = (char *)(*devs)[j].id;
@@ -656,10 +664,13 @@ static void do_pnp_card_entries(void *symval, unsigned 
long size,
 
/* find duplicate, already added value */
for (i2 = 0; i2 < i && !dup; i2++) {
-   DEF_FIELD_ADDR(symval + i2*id_size, 
pnp_card_device_id, devs);
+   DEF_FIELD_ADDR_VAR(symval + i2 * id_size,
+  pnp_card_device_id,
+  devs, devs_dup);
 
for (j2 = 0; j2 < PNP_MAX_DEVICES; j2++) {
-   const char *id2 = (char 
*)(*devs)[j2].id;
+   const char *id2 =
+   (char *)(*devs_dup)[j2].id;
 
if (!id2[0])
break;
-- 
2.19.1

[PATCH v3 5/5] Adds -Wshadow on KBUILD_HOSTCFLAGS

2018-10-23 Thread Leonardo Bras

Adds -Wshadow on KBUILD_HOSTCFLAGS to show shadow warnings
on tools built for HOST.

Signed-off-by: Leonardo Bras 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index e8b599b4dcde..3edae5d359b5 100644
--- a/Makefile
+++ b/Makefile
@@ -360,7 +360,7 @@ HOST_LFS_LIBS := $(shell getconf LFS_LIBS 2>/dev/null)
 
 HOSTCC   = gcc
 HOSTCXX  = g++
-KBUILD_HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \
+KBUILD_HOSTCFLAGS   := -Wall -Wshadow -Wmissing-prototypes -Wstrict-prototypes 
-O2 \
-fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) \
$(HOSTCFLAGS)
 KBUILD_HOSTCXXFLAGS := -O2 $(HOST_LFS_CFLAGS) $(HOSTCXXFLAGS)
-- 
2.19.1

[PATCH v3 0/5] Adds -Wshadow on KBUILD_HOSTCFLAGS and fix warnings

2018-10-23 Thread Leonardo Bras

This patchset add -Wshadow on KBUILD_HOSTCFLAGS and fixes
all code that show this warning.

Changes in v3:
- Better Cover letter
- Better commit message for patch 1/5.
- Fixes what should change on patch 3/5
- Removes accent of my second name (better for searching at lkml.org)


v2: https://lkml.org/lkml/2018/10/23/151
v1: https://lkml.org/lkml/2018/10/17/169

Leonardo Bras (5):
  x86/vdso: Renames variable to fix shadow warning.
  kbuild: Removes unnecessary shadowed local variable.
  Creates macro to avoid variable shadowing
  modpost: Changes parameter name to avoid shadowing.
  Adds -Wshadow on KBUILD_HOSTCFLAGS

 Makefile |  2 +-
 arch/x86/entry/vdso/vdso2c.h | 13 +++--
 scripts/asn1_compiler.c  |  2 +-
 scripts/mod/file2alias.c | 19 +++
 scripts/mod/modpost.c|  4 ++--
 5 files changed, 26 insertions(+), 14 deletions(-)

-- 
2.19.1

Re: [PATCH 1/4] Adds -Wshadow=local on KBUILD_HOSTCFLAGS

2018-10-18 Thread Leonardo Bras

On Thu, Oct 18, 2018 at 11:42 PM Masahiro Yamada
 wrote:
> Adding -Wshadow to KBUILD_HOSTCFLAGS emits another warning in Kconfig.
> Of course, it is easy to fix.
For v2, I already replaced '-Wshadow=local' for '-Wshadow' and fixed this
warning.

> But, I just started to think this option is a kind of harsh...

The v2 it's almost done. You think it will be useful, or should I drop it?

Regards,
Leonardo Bras

Re: [PATCH V12 03/14] riscv: errata: Move errata vendor func-id into vendorid_list.h

2024-01-03 Thread Leonardo Bras

On Mon, Dec 25, 2023 at 07:58:36AM -0500, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Move errata vendor func-id definitions from errata_list into
> vendorid_list.h. Unifying these definitions is also for following
> rwonce errata implementation.
> 
> Suggested-by: Leonardo Bras 
> Link: https://lore.kernel.org/linux-riscv/zqlfj1cmq8pao...@redhat.com/
> Signed-off-by: Guo Ren 
> Signed-off-by: Guo Ren 
> ---
>  arch/riscv/include/asm/errata_list.h   | 18 --
>  arch/riscv/include/asm/vendorid_list.h | 18 ++
>  2 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/errata_list.h 
> b/arch/riscv/include/asm/errata_list.h
> index 83ed25e43553..31bbd9840e97 100644
> --- a/arch/riscv/include/asm/errata_list.h
> +++ b/arch/riscv/include/asm/errata_list.h
> @@ -11,24 +11,6 @@
>  #include 
>  #include 
>  
> -#ifdef CONFIG_ERRATA_ANDES
> -#define ERRATA_ANDESTECH_NO_IOCP 0
> -#define ERRATA_ANDESTECH_NUMBER  1
> -#endif
> -
> -#ifdef CONFIG_ERRATA_SIFIVE
> -#define  ERRATA_SIFIVE_CIP_453 0
> -#define  ERRATA_SIFIVE_CIP_1200 1
> -#define  ERRATA_SIFIVE_NUMBER 2
> -#endif
> -
> -#ifdef CONFIG_ERRATA_THEAD
> -#define  ERRATA_THEAD_PBMT 0
> -#define  ERRATA_THEAD_CMO 1
> -#define  ERRATA_THEAD_PMU 2
> -#define  ERRATA_THEAD_NUMBER 3
> -#endif
> -
>  #ifdef __ASSEMBLY__
>  
>  #define ALT_INSN_FAULT(x)\
> diff --git a/arch/riscv/include/asm/vendorid_list.h 
> b/arch/riscv/include/asm/vendorid_list.h
> index e55407ace0c3..c503373193d2 100644
> --- a/arch/riscv/include/asm/vendorid_list.h
> +++ b/arch/riscv/include/asm/vendorid_list.h
> @@ -9,4 +9,22 @@
>  #define SIFIVE_VENDOR_ID 0x489
>  #define THEAD_VENDOR_ID  0x5b7
>  
> +#ifdef CONFIG_ERRATA_ANDES
> +#define ERRATA_ANDESTECH_NO_IOCP 0
> +#define ERRATA_ANDESTECH_NUMBER  1
> +#endif
> +
> +#ifdef CONFIG_ERRATA_SIFIVE
> +#define  ERRATA_SIFIVE_CIP_453 0
> +#define  ERRATA_SIFIVE_CIP_1200 1
> +#define  ERRATA_SIFIVE_NUMBER 2
> +#endif
> +
> +#ifdef CONFIG_ERRATA_THEAD
> +#define      ERRATA_THEAD_PBMT 0
> +#define  ERRATA_THEAD_CMO 1
> +#define  ERRATA_THEAD_PMU 2
> +#define  ERRATA_THEAD_NUMBER 3
> +#endif
> +
>  #endif
> -- 
> 2.40.1
> 

LGTM:
Reviewed-by: Leonardo Bras

Re: [PATCH V12 06/14] riscv: qspinlock: Introduce combo spinlock

2024-01-03 Thread Leonardo Bras

On Mon, Dec 25, 2023 at 07:58:39AM -0500, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Combo spinlock could support queued and ticket in one Linux Image and
> select them during boot time via command line. Here is the func
> size (Bytes) comparison table below:
> 
> TYPE  : COMBO | TICKET | QUEUED
> arch_spin_lock: 106   | 60 | 50
> arch_spin_unlock  : 54| 36 | 26
> arch_spin_trylock : 110   | 72 | 54
> arch_spin_is_locked   : 48| 34 | 20
> arch_spin_is_contended: 56| 40 | 24
> rch_spin_value_unlocked   : 48| 34 | 24
> 
> One example of disassemble combo arch_spin_unlock:
>   <+14>:nop# detour slot
>   <+18>:fence   rw,w   --+-> queued_spin_unlock
>   <+22>:sb  zero,0(a4) --+   (2 instructions)
>   <+26>:ld  s0,8(sp)
>   <+28>:addisp,sp,16
>   <+30>:ret
>   <+32>:lw  a5,0(a4)   --+-> ticket_spin_unlock
>   <+34>:sext.w  a5,a5|   (7 instructions)
>   <+36>:fence   rw,w |
>   <+40>:addiw   a5,a5,1  |
>   <+42>:sllia5,a5,0x30   |
>   <+44>:srlia5,a5,0x30   |
>   <+46>:sh  a5,0(a4)   --+
>   <+50>:ld  s0,8(sp)
>   <+52>:addisp,sp,16
>   <+54>:ret
> The qspinlock is smaller and faster than ticket-lock when all are in a
> fast path.
> 
> The combo spinlock could provide a compatible Linux Image for different
> micro-arch designs that have/haven't forward progress guarantee. Use
> command line options to select between qspinlock and ticket-lock, and
> the default is ticket-lock.
> 
> Signed-off-by: Guo Ren 
> Signed-off-by: Guo Ren 
> ---
>  .../admin-guide/kernel-parameters.txt |  2 +
>  arch/riscv/Kconfig|  9 +++-
>  arch/riscv/include/asm/spinlock.h | 48 +++
>  arch/riscv/kernel/setup.c | 34 +
>  include/asm-generic/qspinlock.h   |  2 +
>  include/asm-generic/ticket_spinlock.h |  2 +
>  6 files changed, 96 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 65731b060e3f..2ac9f1511774 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4753,6 +4753,8 @@
>   [KNL] Number of legacy pty's. Overwrites compiled-in
>   default number.
>  
> + qspinlock   [RISCV] Use native qspinlock.
> +
>   quiet   [KNL] Disable most log messages
>  
>   r128=   [HW,DRM]
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index f345df0763b2..b7673c5c0997 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -434,7 +434,7 @@ config NODES_SHIFT
>  
>  choice
>   prompt "RISC-V spinlock type"
> - default RISCV_TICKET_SPINLOCKS
> + default RISCV_COMBO_SPINLOCKS
>  
>  config RISCV_TICKET_SPINLOCKS
>   bool "Using ticket spinlock"
> @@ -446,6 +446,13 @@ config RISCV_QUEUED_SPINLOCKS
>   help
> Make sure your micro arch give cmpxchg/xchg forward progress
> guarantee. Otherwise, stay at ticket-lock.
> +
> +config RISCV_COMBO_SPINLOCKS
> + bool "Using combo spinlock"
> + depends on SMP && MMU
> + select ARCH_USE_QUEUED_SPINLOCKS
> + help
> +   Select queued spinlock or ticket-lock by cmdline.
>  endchoice
>  
>  config RISCV_ALTERNATIVE
> diff --git a/arch/riscv/include/asm/spinlock.h 
> b/arch/riscv/include/asm/spinlock.h
> index 98a3da4b1056..d07643c07aae 100644
> --- a/arch/riscv/include/asm/spinlock.h
> +++ b/arch/riscv/include/asm/spinlock.h
> @@ -7,12 +7,60 @@
>  #define _Q_PENDING_LOOPS (1 << 9)
>  #endif
>  
> +#ifdef CONFIG_RISCV_COMBO_SPINLOCKS
> +#define __no_arch_spinlock_redefine
> +#include 
> +#include 
> +#include 
> +
> +DECLARE_STATIC_KEY_TRUE(combo_qspinlock_key);
> +
> +#define COMBO_SPINLOCK_BASE_DECLARE(op)  
> \
> +static __always_inline void arch_spin_##op(arch_spinlock_t *lock)\
> +{\
> + if (static_branch_likely(&combo_qspinlock_key)) \
> + queued_spin_##op(lock); \
> + else\
> + ticket_spin_##op(lock); \
> +}
> +COMBO_SPINLOCK_BASE_DECLARE(lock)
> +COMBO_SPINLOCK_BASE_DECLARE(unlock)
> +
> +#define COMBO_SPINLOCK_IS_DECLARE(op)
> \
> +static __always_inline int arch_spin_##op(arch_spinlock_t *lock) \
> +{\
> + if (static_branch_likely(&combo_qspinlock_key)) \
> + return queued_spin_##op(lock);  \
> + else

Re: [PATCH V12 07/14] riscv: qspinlock: Add virt_spin_lock() support for VM guest

2024-01-03 Thread Leonardo Bras

On Mon, Dec 25, 2023 at 07:58:40AM -0500, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Add a static key controlling whether virt_spin_lock() should be
> called or not. When running on bare metal set the new key to
> false.
> 
> The VM guests should fall back to a Test-and-Set spinlock,
> because fair locks have horrible lock 'holder' preemption issues.
> The virt_spin_lock_key would shortcut for the queued_spin_lock_-
> slowpath() function that allow virt_spin_lock to hijack it.
> 
> Signed-off-by: Guo Ren 
> Signed-off-by: Guo Ren 
> ---
>  .../admin-guide/kernel-parameters.txt |  4 +++
>  arch/riscv/include/asm/spinlock.h | 22 
>  arch/riscv/kernel/setup.c | 26 +++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 2ac9f1511774..b7794c96d91e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3997,6 +3997,10 @@
>   no_uaccess_flush
>   [PPC] Don't flush the L1-D cache after accessing user 
> data.
>  
> + no_virt_spin[RISC-V] Disable virt_spin_lock in VM guest to use
> + native_queued_spinlock when the nopvspin option is 
> enabled.
> + This would help vcpu=pcpu scenarios.
> +
>   novmcoredd  [KNL,KDUMP]
>   Disable device dump. Device dump allows drivers to
>   append dump data to vmcore so you can collect driver
> diff --git a/arch/riscv/include/asm/spinlock.h 
> b/arch/riscv/include/asm/spinlock.h
> index d07643c07aae..7bbcf3d9fff0 100644
> --- a/arch/riscv/include/asm/spinlock.h
> +++ b/arch/riscv/include/asm/spinlock.h
> @@ -4,6 +4,28 @@
>  #define __ASM_RISCV_SPINLOCK_H
>  
>  #ifdef CONFIG_QUEUED_SPINLOCKS
> +/*
> + * The KVM guests fall back to a Test-and-Set spinlock, because fair locks
> + * have horrible lock 'holder' preemption issues. The virt_spin_lock_key
> + * would shortcut for the queued_spin_lock_slowpath() function that allow
> + * virt_spin_lock to hijack it.
> + */
> +DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +
> +#define virt_spin_lock virt_spin_lock
> +static inline bool virt_spin_lock(struct qspinlock *lock)
> +{
> + if (!static_branch_likely(&virt_spin_lock_key))
> + return false;
> +
> + do {
> + while (atomic_read(&lock->val) != 0)
> + cpu_relax();
> + } while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0);
> +
> + return true;
> +}
> +
>  #define _Q_PENDING_LOOPS (1 << 9)
>  #endif
>  
> diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
> index d9072a59831c..0bafb9fd6ea3 100644
> --- a/arch/riscv/kernel/setup.c
> +++ b/arch/riscv/kernel/setup.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -266,6 +267,27 @@ early_param("qspinlock", queued_spinlock_setup);
>  DEFINE_STATIC_KEY_TRUE(combo_qspinlock_key);
>  EXPORT_SYMBOL(combo_qspinlock_key);
>  
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> +static bool no_virt_spin __ro_after_init;
> +static int __init no_virt_spin_setup(char *p)
> +{
> + no_virt_spin = true;
> +
> + return 0;
> +}
> +early_param("no_virt_spin", no_virt_spin_setup);
> +
> +DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +
> +static void __init virt_spin_lock_init(void)
> +{
> + if (no_virt_spin)
> + static_branch_disable(&virt_spin_lock_key);
> + else
> + pr_info("Enable virt_spin_lock\n");
> +}
> +#endif
> +
>  static void __init riscv_spinlock_init(void)
>  {
>   if (!enable_qspinlock) {
> @@ -274,6 +296,10 @@ static void __init riscv_spinlock_init(void)
>   } else {
>   pr_info("Queued spinlock: enabled\n");
>   }
> +
> +#ifdef CONFIG_QUEUED_SPINLOCKS
> + virt_spin_lock_init();
> +#endif
>  }
>  #endif
>  
> -- 
> 2.40.1
> 

LGTM:
Reviewed-by: Leonardo Bras

Re: [PATCH V12 08/14] riscv: qspinlock: Force virt_spin_lock for KVM guests

2024-01-03 Thread Leonardo Bras

On Mon, Dec 25, 2023 at 07:58:41AM -0500, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Force to enable virt_spin_lock when KVM guest, because fair locks
> have horrible lock 'holder' preemption issues.
> 
> Suggested-by: Leonardo Bras 
> Link: https://lkml.kernel.org/kvm/zqk9-tn2mepxl...@redhat.com/
> Signed-off-by: Guo Ren 
> Signed-off-by: Guo Ren 
> ---
>  arch/riscv/include/asm/sbi.h | 8 
>  arch/riscv/kernel/sbi.c  | 2 +-
>  arch/riscv/kernel/setup.c| 6 +-
>  3 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index 0892f4421bc4..8f748d9e1b85 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -51,6 +51,13 @@ enum sbi_ext_base_fid {
>   SBI_EXT_BASE_GET_MIMPID,
>  };
>  
> +enum sbi_ext_base_impl_id {
> + SBI_EXT_BASE_IMPL_ID_BBL = 0,
> + SBI_EXT_BASE_IMPL_ID_OPENSBI,
> + SBI_EXT_BASE_IMPL_ID_XVISOR,
> + SBI_EXT_BASE_IMPL_ID_KVM,
> +};
> +
>  enum sbi_ext_time_fid {
>   SBI_EXT_TIME_SET_TIMER = 0,
>  };
> @@ -276,6 +283,7 @@ int sbi_console_getchar(void);
>  long sbi_get_mvendorid(void);
>  long sbi_get_marchid(void);
>  long sbi_get_mimpid(void);
> +long sbi_get_firmware_id(void);
>  void sbi_set_timer(uint64_t stime_value);
>  void sbi_shutdown(void);
>  void sbi_send_ipi(unsigned int cpu);
> diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
> index 5a62ed1da453..4330aedf65fd 100644
> --- a/arch/riscv/kernel/sbi.c
> +++ b/arch/riscv/kernel/sbi.c
> @@ -543,7 +543,7 @@ static inline long sbi_get_spec_version(void)
>   return __sbi_base_ecall(SBI_EXT_BASE_GET_SPEC_VERSION);
>  }
>  
> -static inline long sbi_get_firmware_id(void)
> +long sbi_get_firmware_id(void)
>  {
>   return __sbi_base_ecall(SBI_EXT_BASE_GET_IMP_ID);
>  }
> diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
> index 0bafb9fd6ea3..e33430e9d97e 100644
> --- a/arch/riscv/kernel/setup.c
> +++ b/arch/riscv/kernel/setup.c
> @@ -281,6 +281,9 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
>  
>  static void __init virt_spin_lock_init(void)
>  {
> + if (sbi_get_firmware_id() != SBI_EXT_BASE_IMPL_ID_KVM)
> + no_virt_spin = true;
> +
>   if (no_virt_spin)
>   static_branch_disable(&virt_spin_lock_key);
>   else
> @@ -290,7 +293,8 @@ static void __init virt_spin_lock_init(void)
>  
>  static void __init riscv_spinlock_init(void)
>  {
> - if (!enable_qspinlock) {
> + if ((!enable_qspinlock) &&
> + (sbi_get_firmware_id() != SBI_EXT_BASE_IMPL_ID_KVM)) {
>   static_branch_disable(&combo_qspinlock_key);
>   pr_info("Ticket spinlock: enabled\n");
>   } else {
> -- 
> 2.40.1
> 

LGTM:
Reviewed-by: Leonardo Bras

Re: [PATCH net-next] virtio_net: Fix napi_skb_cache_put warning

2024-09-03 Thread Leonardo Bras

On Mon, Jul 15, 2024 at 04:25:06AM -0700, Breno Leitao wrote:
> Hello Michael,
> 
> On Sun, Jul 14, 2024 at 03:38:42AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Jul 12, 2024 at 04:53:25AM -0700, Breno Leitao wrote:
> > > After the commit bdacf3e34945 ("net: Use nested-BH locking for
> > > napi_alloc_cache.") was merged, the following warning began to appear:
> > > 
> > >WARNING: CPU: 5 PID: 1 at net/core/skbuff.c:1451 
> > > napi_skb_cache_put+0x82/0x4b0
> > > 
> > > __warn+0x12f/0x340
> > > napi_skb_cache_put+0x82/0x4b0
> > > napi_skb_cache_put+0x82/0x4b0
> > > report_bug+0x165/0x370
> > > handle_bug+0x3d/0x80
> > > exc_invalid_op+0x1a/0x50
> > > asm_exc_invalid_op+0x1a/0x20
> > > __free_old_xmit+0x1c8/0x510
> > > napi_skb_cache_put+0x82/0x4b0
> > > __free_old_xmit+0x1c8/0x510
> > > __free_old_xmit+0x1c8/0x510
> > > __pfx___free_old_xmit+0x10/0x10
> > > 
> > > The issue arises because virtio is assuming it's running in NAPI context
> > > even when it's not, such as in the netpoll case.
> > > 
> > > To resolve this, modify virtnet_poll_tx() to only set NAPI when budget
> > > is available. Same for virtnet_poll_cleantx(), which always assumed that
> > > it was in a NAPI context.
> > > 
> > > Fixes: df133f3f9625 ("virtio_net: bulk free tx skbs")
> > > Suggested-by: Jakub Kicinski 
> > > Signed-off-by: Breno Leitao 
> > 
> > Acked-by: Michael S. Tsirkin 
> > 
> > though I'm not sure I understand the connection with bdacf3e34945.
> 
> The warning above appeared after bdacf3e34945 landed.

Hi Breno,
Thanks for fixing this!

I think the confusion is around the fact that the commit on Fixes 
(df133f3f9625) tag is different from the commit in the commit message
(bdacf3e34945).

Please help me check if the following is correct:
###
Any tree which includes df133f3f9625 ("virtio_net: bulk free tx skbs") 
should also include your patch, since it fixes stuff in there.

The fact that the warning was only made visible in 
bdacf3e34945 ("net: Use nested-BH locking for napi_alloc_cache.")
does not change the fact that it was already present before.

Also, having bdacf3e34945 is not necessary for the backport, since
it only made the bug visible.
###

Are above statements right?

It's important to make it clear since this helps the backporting process.

Thanks!
Leo

[PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-07 Thread Leonardo Bras

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 49 ++
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+#define QUERY_DDW_PGSIZE_4K0x01
+#define QUERY_DDW_PGSIZE_64K   0x02
+#define QUERY_DDW_PGSIZE_16M   0x04
+#define QUERY_DDW_PGSIZE_32M   0x08
+#define QUERY_DDW_PGSIZE_64M   0x10
+#define QUERY_DDW_PGSIZE_128M  0x20
+#define QUERY_DDW_PGSIZE_256M  0x40
+#define QUERY_DDW_PGSIZE_16G   0x80
+
+struct iommu_ddw_pagesize {
+   u32 mask;
+   int shift;
+};
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
struct iommu_table_group *table_group;
@@ -1099,6 +1113,31 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+/* Returns page shift based on "IO Page Sizes" output at 
ibm,query-pe-dma-window. See LoPAR */
+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const struct iommu_ddw_pagesize ddw_pagesize[] = {
+   { QUERY_DDW_PGSIZE_16G,  __builtin_ctz(SZ_16G)  },
+   { QUERY_DDW_PGSIZE_256M, __builtin_ctz(SZ_256M) },
+   { QUERY_DDW_PGSIZE_128M, __builtin_ctz(SZ_128M) },
+   { QUERY_DDW_PGSIZE_64M,  __builtin_ctz(SZ_64M)  },
+   { QUERY_DDW_PGSIZE_32M,  __builtin_ctz(SZ_32M)  },
+   { QUERY_DDW_PGSIZE_16M,  __builtin_ctz(SZ_16M)  },
+   { QUERY_DDW_PGSIZE_64K,  __builtin_ctz(SZ_64K)  },
+   { QUERY_DDW_PGSIZE_4K,   __builtin_ctz(SZ_4K)   }
+   };
+
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(ddw_pagesize); i++) {
+   if (query_page_size & ddw_pagesize[i].mask)
+   return ddw_pagesize[i].shift;
+   }
+
+   /* No valid page size found. */
+   return 0;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1206,13 +1245,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
-   page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
-   page_shift = 12; /* 4kB */
-   } else {
+
+   page_shift = iommu_get_page_shift(query.page_size);
+   if (!page_shift) {
dev_dbg(&dev->dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;
-- 
2.30.2

Re: [PATCH 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-07 Thread Leonardo Bras

Hello Alexey,

On Tue, 2021-03-23 at 18:41 +1100, Alexey Kardashevskiy wrote:
[...]
> > +#define IOMMU_PAGE_SHIFT_16G   34
> > +#define IOMMU_PAGE_SHIFT_256M  28
> > +#define IOMMU_PAGE_SHIFT_128M  27
> > +#define IOMMU_PAGE_SHIFT_64M   26
> > +#define IOMMU_PAGE_SHIFT_32M   25
> > +#define IOMMU_PAGE_SHIFT_16M   24
> > +#define IOMMU_PAGE_SHIFT_64K   16
> 
> 
> These are not very descriptive, these are just normal shifts, could be 
> as simple as __builtin_ctz(SZ_4K) (gcc will optimize this) and so on.
> 
> OTOH the PAPR page sizes need macros as they are the ones which are 
> weird and screaming for macros.
> 
> I'd steal/rework spapr_page_mask_to_query_mask() from QEMU. Thanks,
> 

Thanks for this feedback!
I just sent a v2 applying your suggestions.

Best regards,
Leonardo Bras

Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-07 Thread Leonardo Bras

Hello Michael, thank you for this feedback!
Comments inline:

On Thu, 2021-04-08 at 15:37 +1000, Michael Ellerman wrote:
> Leonardo Bras  writes:
> > According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
> > will let the OS know all possible pagesizes that can be used for creating a
> > new DDW.
> > 
> > Currently Linux will only try using 3 of the 8 available options:
> > 4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
> > 128M, 256M and 16G.
> 
> Do we know of any hardware & hypervisor combination that will actually
> give us bigger pages?
> 
> > Enabling bigger pages would be interesting for direct mapping systems
> > with a lot of RAM, while using less TCE entries.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 49 ++
> >  1 file changed, 42 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 9fc5217f0c8e..6cda1c92597d 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -53,6 +53,20 @@ enum {
> >     DDW_EXT_QUERY_OUT_SIZE = 2
> >  };
> 
> A comment saying where the values come from would be good.

Sure, I will add the information about LoPAR.

> 
> > +#define QUERY_DDW_PGSIZE_4K0x01
> > +#define QUERY_DDW_PGSIZE_64K   0x02
> > +#define QUERY_DDW_PGSIZE_16M   0x04
> > +#define QUERY_DDW_PGSIZE_32M   0x08
> > +#define QUERY_DDW_PGSIZE_64M   0x10
> > +#define QUERY_DDW_PGSIZE_128M  0x20
> > +#define QUERY_DDW_PGSIZE_256M  0x40
> > +#define QUERY_DDW_PGSIZE_16G   0x80
> 
> I'm not sure the #defines really gain us much vs just putting the
> literal values in the array below?

My v1 did not use the define approach, what do you think of that?
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210322190943.715368-1-leobra...@gmail.com/

> 
> > +struct iommu_ddw_pagesize {
> > +   u32 mask;
> > +   int shift;
> > +};
> > +
> >  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> >  {
> >     struct iommu_table_group *table_group;
> > @@ -1099,6 +1113,31 @@ static void reset_dma_window(struct pci_dev *dev, 
> > struct device_node *par_dn)
> >  ret);
> >  }
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > +/* Returns page shift based on "IO Page Sizes" output at 
> > ibm,query-pe-dma-window. See LoPAR */
> > +static int iommu_get_page_shift(u32 query_page_size)
> > +{
> > +   const struct iommu_ddw_pagesize ddw_pagesize[] = {
> > +   { QUERY_DDW_PGSIZE_16G,  __builtin_ctz(SZ_16G)  },
> > +   { QUERY_DDW_PGSIZE_256M, __builtin_ctz(SZ_256M) },
> > +   { QUERY_DDW_PGSIZE_128M, __builtin_ctz(SZ_128M) },
> > +   { QUERY_DDW_PGSIZE_64M,  __builtin_ctz(SZ_64M)  },
> > +   { QUERY_DDW_PGSIZE_32M,  __builtin_ctz(SZ_32M)  },
> > +   { QUERY_DDW_PGSIZE_16M,  __builtin_ctz(SZ_16M)  },
> > +   { QUERY_DDW_PGSIZE_64K,  __builtin_ctz(SZ_64K)  },
> > +   { QUERY_DDW_PGSIZE_4K,   __builtin_ctz(SZ_4K)   }
> > +   };
> 
> 
> cheers

Best regards,
Leonardo Bras

Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-07 Thread Leonardo Bras

On Thu, 2021-04-08 at 03:20 -0300, Leonardo Bras wrote:
> > > +#define QUERY_DDW_PGSIZE_4K  0x01
> > > +#define QUERY_DDW_PGSIZE_64K 0x02
> > > +#define QUERY_DDW_PGSIZE_16M 0x04
> > > +#define QUERY_DDW_PGSIZE_32M 0x08
> > > +#define QUERY_DDW_PGSIZE_64M 0x10
> > > +#define QUERY_DDW_PGSIZE_128M0x20
> > > +#define QUERY_DDW_PGSIZE_256M0x40
> > > +#define QUERY_DDW_PGSIZE_16G 0x80
> > 
> > I'm not sure the #defines really gain us much vs just putting the
> > literal values in the array below?
> 
> My v1 did not use the define approach, what do you think of that?
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210322190943.715368-1-leobra...@gmail.com/
> 
> 
(of course, it would be that without the pageshift defines also, using
the __builtin_ctz() approach suggested by Alexey.)

[PATCH v3 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-08 Thread Leonardo Bras

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.

Signed-off-by: Leonardo Bras 
---
Changes since v2:
 - Restore 'int array & shift' strategy
 - Remove defines for RTAS "IO Page Size" output of ibm,query-pe-dma-window
 - Added/Improved comments
Link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210407195613.131140-1-leobra...@gmail.com/
Changes since v1:
- Remove page shift defines, replace by __builtin_ctzll(SZ_XXX)
- Add bit field defines for RTAS "IO Page Shift" output of 
ibm,query-pe-dma-window
- Use struct array instead of int array to be more explicit on pagesizes
Link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210322190943.715368-1-leobra...@gmail.com/
 

 arch/powerpc/platforms/pseries/iommu.c | 37 +-
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..67c9953a6503 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1099,6 +1099,33 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+/* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
+static int iommu_get_page_shift(u32 query_page_size)
+{
+   /* Supported IO page-sizes according to LoPAR */
+   const int shift[] = {
+   __builtin_ctzll(SZ_4K),   __builtin_ctzll(SZ_64K), 
__builtin_ctzll(SZ_16M),
+   __builtin_ctzll(SZ_32M),  __builtin_ctzll(SZ_64M), 
__builtin_ctzll(SZ_128M),
+   __builtin_ctzll(SZ_256M), __builtin_ctzll(SZ_16G)
+   };
+
+   int i = ARRAY_SIZE(shift) - 1;
+
+   /*
+* On LoPAR, ibm,query-pe-dma-window outputs "IO Page Sizes" using a 
bit field:
+* - bit 31 means 4k pages are supported,
+* - bit 30 means 64k pages are supported, and so on.
+* Larger pagesizes map more memory with the same amount of TCEs, so 
start probing them.
+*/
+   for (; i >= 0 ; i--) {
+   if (query_page_size & (1 << i))
+   return shift[i];
+   }
+
+   /* No valid page size found. */
+   return 0;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1206,13 +1233,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
-   page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
-   page_shift = 12; /* 4kB */
-   } else {
+
+   page_shift = iommu_get_page_shift(query.page_size);
+   if (!page_shift) {
dev_dbg(&dev->dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;
-- 
2.30.2

Re: [PATCH 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-04-08 Thread Leonardo Bras

Hello David, thanks for your feedback.

On Mon, 2021-03-22 at 17:49 +1100, David Gibson wrote:
> I don't love this approach.  Adding the extra flag at this level seems
> a bit inelegant, and it means we're passing up an easy opportunity to
> reduce our resource footprint on the host.

I understand, but trying to reduce resource footprint in host, and
mostly failing is what causes hot-add and hot-remove to take so long.

> But... maybe we'll have to do it.  I'd like to see if we can get
> things to work well enough with just the "batching" to avoid multiple
> resize attempts first.

This batching is something I had thought a lot about.
Problem is that there are a lot of generic interfaces between memory
hotplug and actually resizing HPT. I tried a simpler approach in
patches 2 & 3, so I don't touch much stuff there.

Best regards,
Leonardo Bras

Re: [PATCH 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug

2021-04-08 Thread Leonardo Bras

Hello David, thanks for the feedback!

On Mon, 2021-03-22 at 18:55 +1100, David Gibson wrote:
> > +void hash_memory_batch_expand_prepare(unsigned long newsize)
> > +{
> > +   /*
> > +* Resizing-up HPT should never fail, but there are some cases system 
> > starts with higher
> > +* SHIFT than required, and we go through the funny case of resizing 
> > HPT down while
> > +* adding memory
> > +*/
> > +
> > +   while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
> > +   newsize *= 2;
> > +   pr_warn("Hash collision while resizing HPT\n");
> 
> This unbounded increase in newsize makes me nervous - we should be
> bounded by the current size of the HPT at least.  In practice we
> should be fine, since the resize should always succeed by the time we
> reach our current HPT size, but that's far from obvious from this
> point in the code.

Sure, I will add bounds in v2.

> 
> And... you're doubling newsize which is a value which might not be a
> power of 2.  I'm wondering if there's an edge case where this could
> actually cause us to skip the current size and erroneously resize to
> one bigger than we have currently.

I also though that at the start, but it seems quite reliable.
Before using this value, htab_shift_for_mem_size() will always round it
to next power of 2. 
Ex.
Any value between 0b0101 and 0b1000 will be rounded to 0b1000 for shift
calculation. If we multiply it by 2 (same as << 1), we have that
anything between 0b01010 and 0b1 will be rounded to 0b1. 

This works just fine as long as we are multiplying. 
Division may have the behavior you expect, as 0b0101 >> 1 would become
0b010 and skip a shift.

> > +void memory_batch_expand_prepare(unsigned long newsize)
> 
> This wrapper doesn't seem useful.

Yeah, it does little, but I can't just jump into hash_* functions
directly from hotplug-memory.c, without even knowing if it's using hash
pagetables. (in case the suggestion would be test for disable_radix
inside hash_memory_batch*)

> 
> > +{
> > +   if (!radix_enabled())
> > +   hash_memory_batch_expand_prepare(newsize);
> > +}
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> > 
> > +   memory_batch_expand_prepare(memblock_phys_mem_size() +
> > +drmem_info->n_lmbs * drmem_lmb_size());
> 
> This doesn't look right.  memory_add_by_index() is adding a *single*
> LMB, I think using drmem_info->n_lmbs here means you're counting this
> as adding again as much memory as you already have hotplugged.

Yeah, my mistake. This makes sense.
I will change it to something like 
memblock_phys_mem_size() + drmem_lmb_size()

> > 
> > +   memory_batch_expand_prepare(memblock_phys_mem_size() + lmbs_to_add * 
> > drmem_lmb_size());
> > +
> >     for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
> >     if (lmb->flags & DRCONF_MEM_ASSIGNED)
> >     continue;
> 
> I don't see memory_batch_expand_prepare() suppressing any existing HPT
> resizes.  Won't this just resize to the right size for the full add,
> then resize several times again as we perform the add?  Or.. I guess
> that will be suppressed by patch 1/3. 

Correct.

>  That's seems kinda fragile, though.

What do you mean by fragile here?
What would you suggest doing different?

Best regards,
Leonardo Bras

Re: [PATCH 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

2021-04-08 Thread Leonardo Bras

Hello David, thanks for commenting.

On Tue, 2021-03-23 at 10:45 +1100, David Gibson wrote:
> > @@ -805,6 +808,10 @@ static int resize_hpt_for_hotplug(unsigned long 
> > new_mem_size, bool shrinking)
> >     if (shrinking) {
> > 
> > +   /* When batch removing entries, only resizes HPT at the end. */
> > +   if (atomic_read_acquire(&hpt_resize_disable))
> > +   return 0;
> > +
> 
> I'm not quite convinced by this locking.  Couldn't hpt_resize_disable
> be set after this point, but while you're still inside
> resize_hpt_for_hotplug()?  Probably better to use an explicit mutex
> (and mutex_trylock()) to make the critical sections clearer.

Sure, I can do that for v2.

> Except... do we even need the fancy mechanics to suppress the resizes
> in one place to do them elswhere.  Couldn't we just replace the
> existing resize calls with the batched ones?

How do you think of having batched resizes-down in HPT? 
Other than the current approach, I could only think of a way that would
touch a lot of generic code, and/or duplicate some functions, as
dlpar_add_lmb() does a lot of other stuff.

> > +void hash_memory_batch_shrink_end(void)
> > +{
> > +   unsigned long newsize;
> > +
> > +   /* Re-enables HPT resize-down after hot-unplug */
> > +   atomic_set_release(&hpt_resize_disable, 0);
> > +
> > +   newsize = memblock_phys_mem_size();
> > +   /* Resize to smallest SHIFT possible */
> > +   while (resize_hpt_for_hotplug(newsize, true) == -ENOSPC) {
> > +   newsize *= 2;
> 
> As noted earlier, doing this without an explicit cap on the new hpt
> size (of the existing size) this makes me nervous. 
> 

I can add a stop in v2.

>  Less so, but doing
> the calculations on memory size, rather than explictly on HPT size /
> HPT order also seems kinda clunky.

Agree, but at this point, it would seem kind of a waste to find the
shift from newsize, then calculate (1 << shift) for each retry of
resize_hpt_for_hotplug() only to point that we are retrying the order
value.

But sure, if you think it looks better, I can change that. 

> > +void memory_batch_shrink_begin(void)
> > +{
> > +   if (!radix_enabled())
> > +   hash_memory_batch_shrink_begin();
> > +}
> > +
> > +void memory_batch_shrink_end(void)
> > +{
> > +   if (!radix_enabled())
> > +   hash_memory_batch_shrink_end();
> > +}
> 
> Again, these wrappers don't seem particularly useful to me.

Options would be add 'if (!radix_enabled())' to hotplug-memory.c
functions or to hash* functions, which look kind of wrong.

> > +   memory_batch_shrink_end();
> 
> remove_by_index only removes a single LMB, so there's no real point to
> batching here.

Sure, will be fixed for v2.

> > @@ -700,6 +712,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
> >     if (lmbs_added != lmbs_to_add) {
> >     pr_err("Memory hot-add failed, removing any added LMBs\n");
> > 
> > +   memory_batch_shrink_begin();
> 
> 
> The effect of these on the memory grow path is far from clear.
> 

On hotplug, HPT is resized-up before adding LMBs.
On hotunplug, HPT is resized-down after removing LMBs.
And each one has it's own mechanism to batch HPT resizes...

I can't understand exactly how using it on hotplug fail path can be any
different than using it on hotunplug.
> 

Can you please help me understanding this?

Best regards,
Leonardo Bras

[PATCH 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Before guest entry, TBU40 register is changed to reflect guest timebase.
After exitting guest, the register is reverted to it's original value.

If one tries to get the timestamp from host between those changes, it
will present an incorrect value.

An example would be trying to add a tracepoint in
kvmppc_guest_entry_inject_int(), which depending on last tracepoint
acquired could actually cause the host to crash.

Save the Timebase Offset to PACA and use it on sched_clock() to always
get the correct timestamp.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
 arch/powerpc/kernel/asm-offsets.c | 1 +
 arch/powerpc/kernel/time.c| 3 ++-
 arch/powerpc/kvm/book3s_hv.c  | 2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
 5 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 078f4648ea27..e2c12a10eed2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -131,6 +131,7 @@ struct kvmppc_host_state {
u64 cfar;
u64 ppr;
u64 host_fscr;
+   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
while on guest */
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index b12d7c049bfe..0beb8fdc6352 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -706,6 +706,7 @@ int main(void)
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
+   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 67feb3524460..adf6648e3572 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -699,7 +699,8 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
  */
 notrace unsigned long long sched_clock(void)
 {
-   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb - local_paca->kvm_hstate.tb_offset, 
tb_to_ns_scale)
+   << tb_to_ns_shift;
 }
 
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b3731572295e..c08593c63353 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = vc->tb_offset;
+   local_paca->kvm_hstate.tb_offset = vc->tb_offset;
}
 
if (vc->pcr)
@@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = 0;
+   local_paca->kvm_hstate.tb_offset = 0;
}
 
mtspr(SPRN_HDEC, 0x7fff);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b73140607875..8f7a9f7f4ee6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
cmpdi   r8,0
beq 37f
std r8, VCORE_TB_OFFSET_APPL(r5)
+   std r8, HSTATE_TB_OFFSET(r13)
mftbr6  /* current host timebase */
add r8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
@@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
beq 17f
li  r0, 0
std r0, VCORE_TB_OFFSET_APPL(r5)
+   std r0, HSTATE_TB_OFFSET(r13)
mftbr6  /* current guest timebase */
subfr8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
-- 
2.29.2

[PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Before guest entry, TBU40 register is changed to reflect guest timebase.
After exitting guest, the register is reverted to it's original value.

If one tries to get the timestamp from host between those changes, it
will present an incorrect value.

An example would be trying to add a tracepoint in
kvmppc_guest_entry_inject_int(), which depending on last tracepoint
acquired could actually cause the host to crash.

Save the Timebase Offset to PACA and use it on sched_clock() to always
get the correct timestamp.

Signed-off-by: Leonardo Bras 
Suggested-by: Paul Mackerras 
---
Changes since v1:
- Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
  CONFIG_PPC_BOOK3S_64 are defined.
---
 arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
 arch/powerpc/kernel/asm-offsets.c | 1 +
 arch/powerpc/kernel/time.c| 8 +++-
 arch/powerpc/kvm/book3s_hv.c  | 2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 078f4648ea27..e2c12a10eed2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -131,6 +131,7 @@ struct kvmppc_host_state {
u64 cfar;
u64 ppr;
u64 host_fscr;
+   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
while on guest */
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index b12d7c049bfe..0beb8fdc6352 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -706,6 +706,7 @@ int main(void)
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
+   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 67feb3524460..f27f0163792b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
  */
 notrace unsigned long long sched_clock(void)
 {
-   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
+   u64 tb = get_tb() - boot_tb;
+
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
+   tb -= local_paca->kvm_hstate.tb_offset;
+#endif
+
+   return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
 }
 
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b3731572295e..c08593c63353 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = vc->tb_offset;
+   local_paca->kvm_hstate.tb_offset = vc->tb_offset;
}
 
if (vc->pcr)
@@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = 0;
+   local_paca->kvm_hstate.tb_offset = 0;
}
 
mtspr(SPRN_HDEC, 0x7fff);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b73140607875..8f7a9f7f4ee6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
cmpdi   r8,0
beq 37f
std r8, VCORE_TB_OFFSET_APPL(r5)
+   std r8, HSTATE_TB_OFFSET(r13)
mftbr6  /* current host timebase */
add r8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
@@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
beq 17f
li  r0, 0
std r0, VCORE_TB_OFFSET_APPL(r5)
+   std r0, HSTATE_TB_OFFSET(r13)
mftbr6  /* current guest timebase */
subfr8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
-- 
2.29.2

Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Hey Nick, thanks for reviewing :)

On Fri, 2021-02-05 at 16:28 +1000, Nicholas Piggin wrote:
> Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
> > Before guest entry, TBU40 register is changed to reflect guest timebase.
> > After exitting guest, the register is reverted to it's original value.
> > 
> > If one tries to get the timestamp from host between those changes, it
> > will present an incorrect value.
> > 
> > An example would be trying to add a tracepoint in
> > kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> > acquired could actually cause the host to crash.
> > 
> > Save the Timebase Offset to PACA and use it on sched_clock() to always
> > get the correct timestamp.
> 
> Ouch. Not sure how reasonable it is to half switch into guest registers 
> and expect to call into the wider kernel, fixing things up as we go. 
> What if mftb is used in other places?

IIUC, the CPU is not supposed to call anything as host between guest
entry and guest exit, except guest-related cases, like
kvmppc_guest_entry_inject_int(), but anyway, if something calls mftb it
will still get the same value as before.

This is only supposed to change stuff that depends on sched_clock, like
Tracepoints, that can happen in those exceptions.


> Especially as it doesn't seem like there is a reason that function _has_
> to be called after the timebase is switched to guest, that's just how 
> the code is structured.

Correct, but if called, like in rb routines, used by tracepoints, the
difference between last tb and current (lower) tb may cause the CPU to
trap PROGRAM exception, crashing host. 

> As a local hack to work out a bug okay. If you really need it upstream 
> could you put it under a debug config option?

You mean something that is automatically selected whenever those
configs are enabled? 

CONFIG_TRACEPOINT && CONFIG_KVM_BOOK3S_HANDLER && CONFIG_PPC_BOOK3S_64

Or something the user need to select himself in menuconfig?

> 
> Thanks,
> Nick
> 

Thank you!
Leonardo Bras

> > Signed-off-by: Leonardo Bras 
> > Suggested-by: Paul Mackerras 
> > ---
> > Changes since v1:
> > - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
> >   CONFIG_PPC_BOOK3S_64 are defined.
> > ---
> >  arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
> >  arch/powerpc/kernel/asm-offsets.c | 1 +
> >  arch/powerpc/kernel/time.c| 8 +++-
> >  arch/powerpc/kvm/book3s_hv.c  | 2 ++
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
> >  5 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> > b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > index 078f4648ea27..e2c12a10eed2 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > @@ -131,6 +131,7 @@ struct kvmppc_host_state {
> >     u64 cfar;
> >     u64 ppr;
> >     u64 host_fscr;
> > +   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
> > while on guest */
> >  #endif
> >  };
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > diff --git a/arch/powerpc/kernel/asm-offsets.c 
> > b/arch/powerpc/kernel/asm-offsets.c
> > index b12d7c049bfe..0beb8fdc6352 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -706,6 +706,7 @@ int main(void)
> >     HSTATE_FIELD(HSTATE_CFAR, cfar);
> >     HSTATE_FIELD(HSTATE_PPR, ppr);
> >     HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
> > +   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
> >  #endif /* CONFIG_PPC_BOOK3S_64 */
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> &

Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-05 Thread Leonardo Bras

Hello Fabiano, 
Thanks for reviewing! 
(answers inline)

On Fri, 2021-02-05 at 10:09 -0300, Fabiano Rosas wrote:
> Leonardo Bras  writes:
> 
> > Before guest entry, TBU40 register is changed to reflect guest timebase.
> > After exitting guest, the register is reverted to it's original value.
> > 
> > If one tries to get the timestamp from host between those changes, it
> > will present an incorrect value.
> > 
> > An example would be trying to add a tracepoint in
> > kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> > acquired could actually cause the host to crash.
> > 
> > Save the Timebase Offset to PACA and use it on sched_clock() to always
> > get the correct timestamp.
> > 
> > Signed-off-by: Leonardo Bras 
> > Suggested-by: Paul Mackerras 
> > ---
> > Changes since v1:
> > - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
> >   CONFIG_PPC_BOOK3S_64 are defined.
> > ---
> >  arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
> >  arch/powerpc/kernel/asm-offsets.c | 1 +
> >  arch/powerpc/kernel/time.c| 8 +++-
> >  arch/powerpc/kvm/book3s_hv.c  | 2 ++
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
> >  5 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> > b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > index 078f4648ea27..e2c12a10eed2 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > @@ -131,6 +131,7 @@ struct kvmppc_host_state {
> >     u64 cfar;
> >     u64 ppr;
> >     u64 host_fscr;
> > +   u64 tb_offset;  /* Timebase offset: keeps correct
> > timebase while on guest */
> 
> Couldn't you use the vc->tb_offset_applied for this? We have a reference
> for the vcore in the hstate already.

But it's a pointer, which means we would have to keep checking for NULL
every time we need sched_clock(). 
Potentially it would cost a cache miss for PACA memory region that
contain vc, another for getting the part of *vc that contains the
tb_offset_applied, instead of only one for PACA struct region that
contains tb_offset.

On the other hand, it got me thinking: If the offset is applied per
cpu, why don't we get this info only in PACA, instead of in vc?
It could be a general way to get an offset applied for any purpose and
still get the sched_clock() right. 
(Not that I have any idea of any other purpose we could use it) 

Best regards!
Leonardo Bras

> 
> >  #endif
> >  };
> > 
> > diff --git a/arch/powerpc/kernel/asm-offsets.c 
> > b/arch/powerpc/kernel/asm-offsets.c
> > index b12d7c049bfe..0beb8fdc6352 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -706,6 +706,7 @@ int main(void)
> >     HSTATE_FIELD(HSTATE_CFAR, cfar);
> >     HSTATE_FIELD(HSTATE_PPR, ppr);
> >     HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
> > +   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
> >  #endif /* CONFIG_PPC_BOOK3S_64 */
> > 
> >  #else /* CONFIG_PPC_BOOK3S */
> > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> > index 67feb3524460..f27f0163792b 100644
> > --- a/arch/powerpc/kernel/time.c
> > +++ b/arch/powerpc/kernel/time.c
> > @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
> >   */
> >  notrace unsigned long long sched_clock(void)
> >  {
> > -   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
> > +   u64 tb = get_tb() - boot_tb;
> > +
> > +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
> > +   tb -= local_paca->kvm_hstate.tb_offset;
> > +#endif
> > +
> > +   return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
> >  }
> > 
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index b3731572295e..c08593c63353 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> > *vcpu, u64 time_limit,
> >     if ((tb & 0xff) < (new_tb & 0xff))
> >     mtspr(SPRN_TBU40, new_tb + 0x100);
> >     vc->tb_offset_applied = vc->tb_offset;
> > +   local_paca->kvm_hstate.tb_offset = vc->tb_offset;
> >     }
> > 
> >     if (vc->pcr)
> > @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> > *vcpu, u6

Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-08 Thread Leonardo Bras

Hello Nick,

On Sat, 2021-02-06 at 13:03 +1000, Nicholas Piggin wrote:
> Excerpts from Leonardo Bras's message of February 5, 2021 5:01 pm:
> > Hey Nick, thanks for reviewing :)
> > 
> > On Fri, 2021-02-05 at 16:28 +1000, Nicholas Piggin wrote:
> > > Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
> > > > Before guest entry, TBU40 register is changed to reflect guest timebase.
> > > > After exitting guest, the register is reverted to it's original value.
> > > > 
> > > > If one tries to get the timestamp from host between those changes, it
> > > > will present an incorrect value.
> > > > 
> > > > An example would be trying to add a tracepoint in
> > > > kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> > > > acquired could actually cause the host to crash.
> > > > 
> > > > Save the Timebase Offset to PACA and use it on sched_clock() to always
> > > > get the correct timestamp.
> > > 
> > > Ouch. Not sure how reasonable it is to half switch into guest registers 
> > > and expect to call into the wider kernel, fixing things up as we go. 
> > > What if mftb is used in other places?
> > 
> > IIUC, the CPU is not supposed to call anything as host between guest
> > entry and guest exit, except guest-related cases, like
> 
> When I say "call", I'm including tracing in that. If a function is not 
> marked as no trace, then it will call into the tracing subsystem.
> 
> > kvmppc_guest_entry_inject_int(), but anyway, if something calls mftb it
> > will still get the same value as before.
> 
> Right, so it'll be out of whack again.
> 
> > This is only supposed to change stuff that depends on sched_clock, like
> > Tracepoints, that can happen in those exceptions.
> 
> If they depend on sched_clock that's one thing. Do they definitely have 
> no dependencies on mftb from other calls?

We could change that on get_tb() or mftb() @ timebase.h, which would
have a broader reach, but would not reach any mftb from asm code.

> > > Especially as it doesn't seem like there is a reason that function _has_
> > > to be called after the timebase is switched to guest, that's just how 
> > > the code is structured.
> > 
> > Correct, but if called, like in rb routines, used by tracepoints, the
> > difference between last tb and current (lower) tb may cause the CPU to
> > trap PROGRAM exception, crashing host. 
> 
> Yes, so I agree with Michael any function that is involved when we begin 
> to switch into guest context (or have not completed switching back to 
> host going the other way) should be marked as no trace (noinstr even, 
> perhaps).

Sure, that would avoid having to get paca->tb_offset for every mftb()
called, and avoid inconsistencies when different ways to get time are
used in code.

On the other hand, it would make it very hard to debug functions like
kvmppc_guest_entry_inject_int() as I am doing right now.

> 
> > > As a local hack to work out a bug okay. If you really need it upstream 
> > > could you put it under a debug config option?
> > 
> > You mean something that is automatically selected whenever those
> > configs are enabled? 
> > 
> > CONFIG_TRACEPOINT && CONFIG_KVM_BOOK3S_HANDLER && CONFIG_PPC_BOOK3S_64
> > 
> > Or something the user need to select himself in menuconfig?
> 
> Yeah I meant a default n thing under powerpc kernel debugging somewhere.

So, IIUC all we can do is split this in 2 changes:
1 - Adding notrace to those functions
2 - Introducing a kernel debug config that reverts (1) and 'fixes' mftb

If that's correct, I have some ideas we can use. 

For debug option, should we add the offset on get_tb() or mftb()?

Another option would be to adding this tb_offset only in the routines
used by tracing. But this could probably mean having to add a function
in arch-generic code, but still an option.

What do you think?

> 
> Thanks,
> Nick

Thank you!
Leonardo Bras

Re: [PATCH 1/1] kernel/smp: Split call_single_queue into 3 queues

2021-02-08 Thread Leonardo Bras

Hello Sebastian, 
Thanks for the feedback!

On Thu, 2021-01-28 at 11:33 +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-28 03:55:06 [-0300], Leonardo Bras wrote:
> > Currently, during flush_smp_call_function_queue():
> > - All items are transversed once, for inverting.
> > - The SYNC items are transversed twice.
> > - The ASYNC & IRQ_WORK items are transversed tree times.
> > - The TTWU items are transversed four times;.
> > 
> > Also, a lot of extra work is done to keep track and remove the items
> > already processed in each step.
> > 
> > By using three queues, it's possible to avoid all this work, and
> > all items in list are transversed only twice: once for inverting,
> > and once for processing..
> > 
> > In exchange, this requires 2 extra llist_del_all() in the beginning
> > of flush_smp_call_function_queue(), and some extra logic to decide
> > the correct queue to add the desired csd.
> > 
> > This is not supposed to cause any change in the order the items are
> > processed, but will change the order of printing (cpu offlining)
> > to the order the items will be proceessed.
> > 
> > (The above transversed count ignores the cpu-offlining case, in
> > which all items would be transversed again, in both cases.)
> 
> Numbers would be good.
> 

Sure, I will try to get some time to compare performance.


>  Having three queues increases the memory foot
> print from one pointer to three but we still remain in one cache line.
> One difference your patch makes is this hunk:
> 
> > +   if (smp_add_to_queue(cpu, node))
> >     send_call_function_single_ipi(cpu);
> 
> Previously only the first addition resulted in sending an IPI. With this
> change you could send two IPIs, one for adding to two independent queues.

Yes, you are correct. 
I need to change this to looking into all queues, which should just add
a few compares, given all llist_heads are in the same cacheline.

> 
> A quick smoke test ended up
>   -0   [005] d..h1..   146.255996: 
> flush_smp_call_function_queue: A1 S2 I0 T0 X3
> 
> with the patch at the bottom of the mail. This shows that in my
> smoke test at least, the number of items in the individual list is low.

Yes, but depending on workload this list may get longer.

My patch also needs some other changes, so I will send a v2 with those
+ the proposed changes.

> Sebastian

Best regards,
Leonardo Bras

Re: [PATCH v5 2/2] powerpc/rtas: Implement reentrant rtas call

2020-05-18 Thread Leonardo Bras

On Sat, 2020-05-16 at 17:36 +1000, Nicholas Piggin wrote:
> Good, I think this should work as you want now. Can you allocate it like 
> lppacas? Put it under PSERIES (and in the paca) and check for !HV?

Sure, I will do that. 

> Oh and while there, could you prefix the name with rtas_?

Sure, replacing reentrant_args with rtas_args_reentrant.

>
> Thanks,
> Nick

Thank you for the feedback!
Leonardo Bras

[PATCH v6 1/2] powerpc/rtas: Move type/struct definitions from rtas.h into rtas-types.h

2020-05-18 Thread Leonardo Bras

In order to get any rtas* struct into other headers, including rtas.h
may cause a lot of errors, regarding include dependency needed for
inline functions.

Create rtas-types.h and move there all type/struct definitions
from rtas.h, then include rtas-types.h into rtas.h.

Also, as suggested by checkpath.pl, replace uint8_t for u8, and keep
the same type pattern for the whole file, as they are the same
according to powerpc/boot/types.h.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/rtas-types.h | 126 ++
 arch/powerpc/include/asm/rtas.h   | 118 +---
 2 files changed, 127 insertions(+), 117 deletions(-)
 create mode 100644 arch/powerpc/include/asm/rtas-types.h

diff --git a/arch/powerpc/include/asm/rtas-types.h 
b/arch/powerpc/include/asm/rtas-types.h
new file mode 100644
index ..87354e28f160
--- /dev/null
+++ b/arch/powerpc/include/asm/rtas-types.h
@@ -0,0 +1,126 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _POWERPC_RTAS_TYPES_H
+#define _POWERPC_RTAS_TYPES_H
+#ifdef __KERNEL__
+
+#include 
+
+typedef __be32 rtas_arg_t;
+
+struct rtas_args {
+   __be32 token;
+   __be32 nargs;
+   __be32 nret;
+   rtas_arg_t args[16];
+   rtas_arg_t *rets; /* Pointer to return values in args[]. */
+};
+
+struct rtas_t {
+   unsigned long entry;/* physical address pointer */
+   unsigned long base; /* physical address pointer */
+   unsigned long size;
+   arch_spinlock_t lock;
+   struct rtas_args args;
+   struct device_node *dev;/* virtual address pointer */
+};
+
+struct rtas_suspend_me_data {
+   atomic_t working; /* number of cpus accessing this struct */
+   atomic_t done;
+   int token; /* ibm,suspend-me */
+   atomic_t error;
+   struct completion *complete; /* wait on this until working == 0 */
+};
+
+struct rtas_error_log {
+   /* Byte 0 */
+   u8  byte0;  /* Architectural version */
+
+   /* Byte 1 */
+   u8  byte1;
+   /* 
+* XXX  3: Severity level of error
+*XX2: Degree of recovery
+*  X   1: Extended log present?
+*   XX 2: Reserved
+*/
+
+   /* Byte 2 */
+   u8  byte2;
+   /* 
+*  4: Initiator of event
+*  4: Target of failed operation
+*/
+   u8  byte3;  /* General event or error*/
+   __be32  extended_log_length;/* length in bytes */
+   unsigned char   buffer[1];  /* Start of extended log */
+   /* Variable length.  */
+};
+
+/* RTAS general extended event log, Version 6. The extended log starts
+ * from "buffer" field of struct rtas_error_log defined above.
+ */
+struct rtas_ext_event_log_v6 {
+   /* Byte 0 */
+   u8 byte0;
+   /* 
+* X1: Log valid
+*  X   1: Unrecoverable error
+*   X  1: Recoverable (correctable or successfully retried)
+*X 1: Bypassed unrecoverable error (degraded operation)
+* X1: Predictive error
+*  X   1: "New" log (always 1 for data returned from RTAS)
+*   X  1: Big Endian
+*X 1: Reserved
+*/
+
+   /* Byte 1 */
+   u8 byte1;   /* reserved */
+
+   /* Byte 2 */
+   u8 byte2;
+   /* 
+* X1: Set to 1 (indicating log is in PowerPC format)
+*  XXX 3: Reserved
+*  4: Log format used for bytes 12-2047
+*/
+
+   /* Byte 3 */
+   u8 byte3;   /* reserved */
+   /* Byte 4-11 */
+   u8 reserved[8]; /* reserved */
+   /* Byte 12-15 */
+   __be32  company_id; /* Company ID of the company*/
+   /* that defines the format for  */
+   /* the vendor specific log type */
+   /* Byte 16-end of log */
+   u8 vendor_log[1];   /* Start of vendor specific log */
+   /* Variable length. */
+};
+
+/* Vendor specific Platform Event Log Format, Version 6, section header */
+struct pseries_errorlog {
+   __be16 id;  /* 0x00 2-byte ASCII section ID */
+   __be16 length;  /* 0x02 Section length in bytes */
+   u8 version; /* 0x04 Section version */
+   u8 subtype; /* 0x05 Section subtype */
+   __be16 creator_component;   /* 0x06 Creator component ID*/
+   u8 data[];  /* 0x08 Start of section data   */
+};
+
+/* RTAS pseries hotplug errorlo

[PATCH v6 0/2] Implement reentrant rtas call

2020-05-18 Thread Leonardo Bras

Patch 2 implement rtas_call_reentrant() for reentrant rtas-calls:
"ibm,int-on", "ibm,int-off",ibm,get-xive" and  "ibm,set-xive",
according to LoPAPR Version 1.1 (March 24, 2016).

For that, it's necessary that every call uses a different
rtas buffer (rtas_args). Paul Mackerras suggested using the PACA
structure for creating a per-cpu buffer for these calls.

Patch 1 was necessary to make PACA have a 'struct rtas_args' member.

Reentrant rtas calls can be useful to avoid deadlocks in crashing,
where rtas-calls are needed, but some other thread crashed holding
the rtas.lock.

This is a backtrace of a deadlock from a kdump testing environment:

  #0 arch_spin_lock
  #1  lock_rtas () 
  #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
  #3  ics_rtas_mask_real_irq (hw_irq=4100) 
  #4  machine_kexec_mask_interrupts
  #5  default_machine_crash_shutdown
  #6  machine_crash_shutdown 
  #7  __crash_kexec
  #8  crash_kexec
  #9  oops_end

Signed-off-by: Leonardo Bras 

Special thanks to Nick Piggin, who have been helping me a lot with
this series!

---
Changes since v5:
- Renames new paca member from reentrant_args to rtas_args_reentrant
- Compile out rtas_args_reentrant if CONFIG_PPC_RTAS=n
- new_rtas_args() is skipped (returns NULL) if CPU_FTR_HVMODE

Changes since v4:
- Insted of having the full buffer on PACA, adds only a pointer and
  allocate it during allocate_paca(), making sure it's in a memory
  range available for RTAS (32-bit). (Thanks Nick Piggin!)

Changes since v3:
- Adds protection from preemption and interruption

Changes since v2:
- Fixed build failure from ppc64e, by including spinlock_types.h on 
  rtas-types.h
- Improved commit messages

Changes since v1:
- Moved buffer from stack to PACA (as suggested by Paul Mackerras)
- Added missing output bits
- Improve documentation following kernel-doc format (as suggested by
  Nathan Lynch)


Leonardo Bras (2):
  powerpc/rtas: Move type/struct definitions from rtas.h into
rtas-types.h
  powerpc/rtas: Implement reentrant rtas call

 arch/powerpc/include/asm/paca.h   |   2 +
 arch/powerpc/include/asm/rtas-types.h | 126 ++
 arch/powerpc/include/asm/rtas.h   | 119 +---
 arch/powerpc/kernel/rtas.c|  42 +
 arch/powerpc/sysdev/xics/ics-rtas.c   |  22 ++---
 5 files changed, 183 insertions(+), 128 deletions(-)
 create mode 100644 arch/powerpc/include/asm/rtas-types.h

-- 
2.25.4

[PATCH v6 2/2] powerpc/rtas: Implement reentrant rtas call

2020-05-18 Thread Leonardo Bras

Implement rtas_call_reentrant() for reentrant rtas-calls:
"ibm,int-on", "ibm,int-off",ibm,get-xive" and  "ibm,set-xive".

On LoPAPR Version 1.1 (March 24, 2016), from 7.3.10.1 to 7.3.10.4,
items 2 and 3 say:

2 - For the PowerPC External Interrupt option: The * call must be
reentrant to the number of processors on the platform.
3 - For the PowerPC External Interrupt option: The * argument call
buffer for each simultaneous call must be physically unique.

So, these rtas-calls can be called in a lockless way, if using
a different buffer for each cpu doing such rtas call.

For this, it was suggested to add the buffer (struct rtas_args)
in the PACA struct, so each cpu can have it's own buffer.
The PACA struct received a pointer to rtas buffer, which is
allocated in the memory range available to rtas 32-bit.

Reentrant rtas calls are useful to avoid deadlocks in crashing,
where rtas-calls are needed, but some other thread crashed holding
the rtas.lock.

This is a backtrace of a deadlock from a kdump testing environment:

  #0 arch_spin_lock
  #1  lock_rtas ()
  #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
  #3  ics_rtas_mask_real_irq (hw_irq=4100)
  #4  machine_kexec_mask_interrupts
  #5  default_machine_crash_shutdown
  #6  machine_crash_shutdown
  #7  __crash_kexec
  #8  crash_kexec
  #9  oops_end

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/paca.h |  4 +++
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/paca.c  | 34 +++
 arch/powerpc/kernel/rtas.c  | 52 +
 arch/powerpc/sysdev/xics/ics-rtas.c | 22 ++--
 5 files changed, 102 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e3cc9eb9204d..1e2d45f3f84c 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -270,6 +271,9 @@ struct paca_struct {
 #ifdef CONFIG_MMIOWB
struct mmiowb_state mmiowb_state;
 #endif
+#ifdef CONFIG_PPC_RTAS
+   struct rtas_args *rtas_args_reentrant;
+#endif /* CONFIG_PPC_RTAS */
 } cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index c35c5350b7e4..fa7509c85881 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -236,6 +236,7 @@ extern struct rtas_t rtas;
 extern int rtas_token(const char *service);
 extern int rtas_service_present(const char *service);
 extern int rtas_call(int token, int, int, int *, ...);
+int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...);
 void rtas_call_unlocked(struct rtas_args *args, int token, int nargs,
int nret, ...);
 extern void __noreturn rtas_restart(char *cmd);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 3f91ccaa9c74..04855ad455c7 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "setup.h"
 
@@ -164,6 +165,32 @@ static struct slb_shadow * __init new_slb_shadow(int cpu, 
unsigned long limit)
 
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
+#ifdef CONFIG_PPC_RTAS
+
+/**
+ * new_rtas_args() - Allocates rtas args
+ * @cpu:   CPU number
+ * @limit: Memory limit for this allocation
+ *
+ * Allocates a struct rtas_args and return it's pointer,
+ * if not in Hypervisor mode
+ *
+ * Return: Pointer to allocated rtas_args
+ * NULL if CPU in Hypervisor Mode
+ */
+static struct rtas_args * __init new_rtas_args(int cpu, unsigned long limit)
+{
+   limit = min_t(unsigned long, limit, RTAS_INSTANTIATE_MAX);
+
+   if (early_cpu_has_feature(CPU_FTR_HVMODE))
+   return NULL;
+
+   return alloc_paca_data(sizeof(struct rtas_args), L1_CACHE_BYTES,
+  limit, cpu);
+}
+
+#endif /*CONFIG_PPC_RTAS*/
+
 /* The Paca is an array with one entry per processor.  Each contains an
  * lppaca, which contains the information shared between the
  * hypervisor and Linux.
@@ -202,6 +229,10 @@ void __init __nostackprotector initialise_paca(struct 
paca_struct *new_paca, int
/* For now -- if we have threads this will be adjusted later */
new_paca->tcd_ptr = &new_paca->tcd;
 #endif
+
+#ifdef CONFIG_PPC_RTAS
+   new_paca->rtas_args_reentrant = NULL;
+#endif
 }
 
 /* Put the paca pointer into r13 and SPRG_PACA */
@@ -273,6 +304,9 @@ void __init allocate_paca(int cpu)
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
paca->slb_shadow_ptr = new_slb_shadow(cpu, limit);
+#endif
+#ifdef CONFIG_PPC_RTAS
+   paca->rtas_args_reentrant = new_rtas_args(cpu, limit);
 #endif
paca_struct_size += sizeof(struct paca_struct);
 }
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kerne

[PATCH v3 1/1] powerpc/crash: Use NMI context for printk when starting to crash

2020-05-18 Thread Leonardo Bras

Currently, if printk lock (logbuf_lock) is held by other thread during
crash, there is a chance of deadlocking the crash on next printk, and
blocking a possibly desired kdump.

At the start of default_machine_crash_shutdown, make printk enter
NMI context, as it will use per-cpu buffers to store the message,
and avoid locking logbuf_lock.

Suggested-by: Michael Ellerman 
Signed-off-by: Leonardo Bras 

---
Changes since v2:
- Changes usage of printk_nmi_enter() to nmi_enter()
  (Suggested by Nick Piggin)

Changes since v1:
- Added in-code comment explaining the need of context change
- Function moved to the start of default_machine_crash_shutdown,
  to avoid locking any printk on crashing routine.
- Title was 'Use NMI context for printk after crashing other CPUs'

---
 arch/powerpc/kexec/crash.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..53c5cf9b6d3c 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -311,6 +312,13 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
unsigned int i;
int (*old_handler)(struct pt_regs *regs);
 
+   /*
+* Avoid hardlocking with irresponsive CPU holding logbuf_lock,
+* by using printk nmi_context
+*/
+   if (!in_nmi())
+   nmi_enter();
+
/*
 * This function is only called after the system
 * has panicked or is otherwise in a critical state.
-- 
2.25.4

Re: [PATCH v5 0/4] Allow bigger 64bit window by removing default DMA window

2020-08-11 Thread Leonardo Bras

Hello Michael,

Do you suggest any change for this patchset?
Any chance it can get in this merge window?

Best regards,
Leonardo Bras

On Wed, 2020-08-05 at 00:04 -0300, Leonardo Bras wrote:
> There are some devices in which a hypervisor may only allow 1 DMA window
> to exist at a time, and in those cases, a DDW is never created to them,
> since the default DMA window keeps using this resource.
> 
> LoPAR recommends this procedure:
> 1. Remove the default DMA window,
> 2. Query for which configs the DDW can be created,
> 3. Create a DDW.
> 
> Patch #1:
> Create defines for outputs of ibm,ddw-applicable, so it's easier to
> identify them.
> 
> Patch #2:
> - After LoPAR level 2.8, there is an extension that can make
>   ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
>   order of the outputs, and that can cause some trouble. 
> - query_ddw() was updated to check how many outputs the 
>   ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
>   deal correctly with the outputs in both cases.
> - This patch looks somehow unrelated to the series, but it can avoid future
>   problems on DDW creation.
> 
> Patch #3 moves the window-removing code from remove_ddw() to
> remove_dma_window(), creating a way to delete any DMA window, so it can be
> used to delete the default DMA window.
> 
> Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
> default DMA window before query_ddw(). It also implements a new rtas call
> to recover the default DMA window, in case anything fails after it was
> removed, and a DDW couldn't be created.
> 
> ---
> Changes since v4:
> - Removed patches 5+ in order to deal with a feature at a time
> - Remove unnecessary parentesis in patch #4
> - Changed patch #4 title from 
>   "Remove default DMA window before creating DDW"
> - Included David Dai tested-by
> - v4 link: 
> http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=190051&state=%2A&archive=both
> 
> Changes since v3:
> - Introduces new patch #5, to prepare for an important change in #6
> - struct iommu_table was not being updated, so include a way to do this
>   in patch #6.
> - Improved patch #4 based in a suggestion from Alexey, to make code
>   more easily understandable
> - v3 link: 
> http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187348&state=%2A&archive=both
> 
> Changes since v2:
> - Change the way ibm,ddw-extensions is accessed, using a proper function
>   instead of doing this inline everytime it's used.
> - Remove previous patch #6, as it doesn't look like it would be useful.
> - Add new patch, for changing names from direct* to dma*, as indirect 
>   mapping can be used from now on.
> - Fix some typos, corrects some define usage.
> - v2 link: 
> http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=185433&state=%2A&archive=both
> 
> Changes since v1:
> - Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs
> - Merge aux function query_ddw_out_sz() into query_ddw()
> - Merge reset_dma_window() patch (prev. #2) into remove default DMA
>   window patch (#4).
> - Keep device_node *np name instead of using pdn in remove_*()
> - Rename 'device_node *pdn' into 'parent' in new functions
> - Rename dfl_win to default_win
> - Only remove the default DMA window if there is no window available
>   in first query.
> - Check if default DMA window can be restored before removing it.
> - Fix 'unitialized use' (found by travis mpe:ci-test)
> - New patches #5 and #6
> - v1 link: 
> http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=184420&state=%2A&archive=both
> 
> Special thanks for Alexey Kardashevskiy, Brian King and
> Oliver O'Halloran for the feedback provided!
> 
> 
> Leonardo Bras (4):
>   powerpc/pseries/iommu: Create defines for operations in
> ibm,ddw-applicable
>   powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
>   powerpc/pseries/iommu: Move window-removing part of remove_ddw into
> remove_dma_window
>   powerpc/pseries/iommu: Allow bigger 64bit window by removing default
> DMA window
> 
>  arch/powerpc/platforms/pseries/iommu.c | 242 -
>  1 file changed, 195 insertions(+), 47 deletions(-)
>

Re: [PATCH v5 0/4] Allow bigger 64bit window by removing default DMA window

2020-08-05 Thread Leonardo Bras

Travis reported successful compilation with mpe/merge:

https://travis-ci.org/github/LeoBras/linux-ppc/builds/715028857

[PATCH v2 1/1] powerpc/kvm: Fix mask size for emulated msgsndp

2020-12-08 Thread Leonardo Bras

According to ISAv3.1 and ISAv3.0b, the msgsndp is described to split RB in:
msgtype <- (RB) 32:36
payload <- (RB) 37:63
t   <- (RB) 57:63

The current way of getting 'msgtype', and 't' is missing their MSB:
msgtype: ((arg >> 27) & 0xf) : Gets (RB) 33:36, missing bit 32
t:   (arg &= 0x3f)   : Gets (RB) 58:63, missing bit 57

Fixes this by applying the correct mask.

Signed-off-by: Leonardo Bras 
---
Changes since v1:
- Commit message 's/LSB/MSB/', because ISA ordering is big-endian.

 arch/powerpc/kvm/book3s_hv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e3b1839fc251..5af0a429cee8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1241,9 +1241,9 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
switch (get_xop(inst)) {
case OP_31_XOP_MSGSNDP:
arg = kvmppc_get_gpr(vcpu, rb);
-   if (((arg >> 27) & 0xf) != PPC_DBELL_SERVER)
+   if (((arg >> 27) & 0x1f) != PPC_DBELL_SERVER)
break;
-   arg &= 0x3f;
+   arg &= 0x7f;
if (arg >= kvm->arch.emul_smt_mode)
break;
tvcpu = kvmppc_find_vcpu(kvm, vcpu->vcpu_id - thr + arg);
@@ -1256,7 +1256,7 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
break;
case OP_31_XOP_MSGCLRP:
arg = kvmppc_get_gpr(vcpu, rb);
-   if (((arg >> 27) & 0xf) != PPC_DBELL_SERVER)
+   if (((arg >> 27) & 0x1f) != PPC_DBELL_SERVER)
break;
vcpu->arch.vcore->dpdes = 0;
vcpu->arch.doorbell_request = 0;
-- 
2.25.4

[PATCH 1/1] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs

2021-03-18 Thread Leonardo Bras

Currently both iommu_alloc_coherent() and iommu_free_coherent() align the
desired allocation size to PAGE_SIZE, and gets system pages and IOMMU
mappings (TCEs) for that value.

When IOMMU_PAGE_SIZE < PAGE_SIZE, this behavior may cause unnecessary
TCEs to be created for mapping the whole system page.

Example:
- PAGE_SIZE = 64k, IOMMU_PAGE_SIZE() = 4k
- iommu_alloc_coherent() is called for 128 bytes
- 1 system page (64k) is allocated
- 16 IOMMU pages (16 x 4k) are allocated (16 TCEs used)

It would be enough to use a single TCE for this, so 15 TCEs are
wasted in the process.

Update iommu_*_coherent() to make sure the size alignment happens only
for IOMMU_PAGE_SIZE() before calling iommu_alloc() and iommu_free().

Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
with IOMMU_PAGE_ALIGN(n, tbl), which is easier to read and does the
same.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5b69a6a72a0e..3329ef045805 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -851,6 +851,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   size_t size_io = size;
 
size = PAGE_ALIGN(size);
order = get_order(size);
@@ -877,8 +878,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
 
/* Set up tces to cover the allocated range */
-   nio_pages = size >> tbl->it_page_shift;
-   io_order = get_iommu_order(size, tbl);
+   size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
+   nio_pages = size_io >> tbl->it_page_shift;
+   io_order = get_iommu_order(size_io, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
if (mapping == DMA_MAPPING_ERROR) {
@@ -893,10 +895,9 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
 void *vaddr, dma_addr_t dma_handle)
 {
if (tbl) {
-   unsigned int nio_pages;
+   size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
+   unsigned int nio_pages = size_io >> tbl->it_page_shift;
 
-   size = PAGE_ALIGN(size);
-   nio_pages = size >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));
-- 
2.29.2

[PATCH 1/1] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

2021-03-18 Thread Leonardo Bras

As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
will only be able to use 3/4 of the available pages, given pages on
largepool  not being available for !largealloc.

This could mean some drivers not being able to fully use all the available
pages for the DMA window.

Add pages on largepool as a last resort for !largealloc, making all pages
of the DMA window available.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 3329ef045805..ae6ad8dca605 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -255,6 +255,15 @@ static unsigned long iommu_range_alloc(struct device *dev,
pass++;
goto again;
 
+   } else if (pass == tbl->nr_pools + 1) {
+   /* Last resort: try largepool */
+   spin_unlock(&pool->lock);
+   pool = &tbl->large_pool;
+   spin_lock(&pool->lock);
+   pool->hint = pool->start;
+   pass++;
+   goto again;
+
} else {
/* Give up */
spin_unlock_irqrestore(&(pool->lock), flags);
-- 
2.29.2

[PATCH 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-03-22 Thread Leonardo Bras

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/iommu.h   |  8 
 arch/powerpc/platforms/pseries/iommu.c | 28 +++---
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..c170048b7a1b 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -19,6 +19,14 @@
 #include 
 #include 
 
+#define IOMMU_PAGE_SHIFT_16G   34
+#define IOMMU_PAGE_SHIFT_256M  28
+#define IOMMU_PAGE_SHIFT_128M  27
+#define IOMMU_PAGE_SHIFT_64M   26
+#define IOMMU_PAGE_SHIFT_32M   25
+#define IOMMU_PAGE_SHIFT_16M   24
+#define IOMMU_PAGE_SHIFT_64K   16
+
 #define IOMMU_PAGE_SHIFT_4K  12
 #define IOMMU_PAGE_SIZE_4K   (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K)
 #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..02958e80aa91 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1099,6 +1099,24 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+/* Returns page shift based on "IO Page Sizes" output at 
ibm,query-pe-dma-window. SeeL LoPAR */
+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const int shift[] = {IOMMU_PAGE_SHIFT_4K,   IOMMU_PAGE_SHIFT_64K,  
IOMMU_PAGE_SHIFT_16M,
+IOMMU_PAGE_SHIFT_32M,  IOMMU_PAGE_SHIFT_64M,  
IOMMU_PAGE_SHIFT_128M,
+IOMMU_PAGE_SHIFT_256M, IOMMU_PAGE_SHIFT_16G};
+   int i = ARRAY_SIZE(shift) - 1;
+
+   /* Looks for the largest page size supported */
+   for (; i >= 0; i--) {
+   if (query_page_size & (1 << i))
+   return shift[i];
+   }
+
+   /* No valid page size found. */
+   return 0;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1206,13 +1224,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
-   page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
-   page_shift = 12; /* 4kB */
-   } else {
+
+   page_shift = iommu_get_page_shift(query.page_size);
+   if (!page_shift) {
dev_dbg(&dev->dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;
-- 
2.29.2

[PATCH 1/1] of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses

2021-04-15 Thread Leonardo Bras

Many other resource flag parsers already add this flag when the input
has bits 24 & 25 set, so update this one to do the same.

Some devices (like virtio-net) have more than one memory resource
(like MMIO32 and MMIO64) and without this flag it would be needed to
verify the address range to know which one is which.

Signed-off-by: Leonardo Bras 
---
 drivers/of/address.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 73ddf2540f3f..dc7147843783 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -116,9 +116,12 @@ static unsigned int of_bus_pci_get_flags(const __be32 
*addr)
flags |= IORESOURCE_IO;
break;
case 0x02: /* 32 bits */
-   case 0x03: /* 64 bits */
flags |= IORESOURCE_MEM;
break;
+
+   case 0x03: /* 64 bits */
+   flags |= IORESOURCE_MEM | IORESOURCE_MEM_64;
+   break;
}
if (w & 0x4000)
flags |= IORESOURCE_PREFETCH;
-- 
2.30.2

Re: [PATCH 1/1] of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses

2021-04-16 Thread Leonardo Bras

Hello Rob, thanks for this feedback!

On Thu, 2021-04-15 at 13:59 -0500, Rob Herring wrote:
> +PPC and PCI lists
> 
> On Thu, Apr 15, 2021 at 1:01 PM Leonardo Bras  wrote:
> > 
> > Many other resource flag parsers already add this flag when the input
> > has bits 24 & 25 set, so update this one to do the same.
> 
> Many others? Looks like sparc and powerpc to me. 
> 

s390 also does that, but it look like it comes from a device-tree.

> Those would be the
> ones I worry about breaking. Sparc doesn't use of/address.c so it's
> fine. Powerpc version of the flags code was only fixed in 2019, so I
> don't think powerpc will care either.

In powerpc I reach this function with this stack, while configuring a
virtio-net device for a qemu/KVM pseries guest:

pci_process_bridge_OF_ranges+0xac/0x2d4
pSeries_discover_phbs+0xc4/0x158
discover_phbs+0x40/0x60
do_one_initcall+0x60/0x2d0
kernel_init_freeable+0x308/0x3a8
kernel_init+0x2c/0x168
ret_from_kernel_thread+0x5c/0x70

For this, both MMIO32 and MMIO64 resources will have flags 0x200.

> 
> I noticed both sparc and powerpc set PCI_BASE_ADDRESS_MEM_TYPE_64 in
> the flags. AFAICT, that's not set anywhere outside of arch code. So
> never for riscv, arm and arm64 at least. That leads me to
> pci_std_update_resource() which is where the PCI code sets BARs and
> just copies the flags in PCI_BASE_ADDRESS_MEM_MASK ignoring
> IORESOURCE_* flags. So it seems like 64-bit is still not handled and
> neither is prefetch.
> 

I am not sure if you mean here:
a) it's ok to add IORESOURCE_MEM_64 here, because it does not affect
anything else, or
b) it should be using PCI_BASE_ADDRESS_MEM_TYPE_64 
(or IORESOURCE_MEM_64 | PCI_BASE_ADDRESS_MEM_TYPE_64) instead, since
it's how it's added in powerpc/sparc, and else there is no point.

Again, thanks for helping!

Best regards,
Leonardo Bras

Re: [PATCH 1/1] of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses

2021-04-16 Thread Leonardo Bras

Hello Rob, thanks for this feedback!

On Thu, 2021-04-15 at 13:59 -0500, Rob Herring wrote:
> +PPC and PCI lists
> 
> On Thu, Apr 15, 2021 at 1:01 PM Leonardo Bras  wrote:
> > 
> > Many other resource flag parsers already add this flag when the input
> > has bits 24 & 25 set, so update this one to do the same.
> 
> Many others? Looks like sparc and powerpc to me. 
> 

s390 also does that, but it look like it comes from a device-tree.

> Those would be the
> ones I worry about breaking. Sparc doesn't use of/address.c so it's
> fine. Powerpc version of the flags code was only fixed in 2019, so I
> don't think powerpc will care either.

In powerpc I reach this function with this stack, while configuring a
virtio-net device for a qemu/KVM pseries guest:

pci_process_bridge_OF_ranges+0xac/0x2d4
pSeries_discover_phbs+0xc4/0x158
discover_phbs+0x40/0x60
do_one_initcall+0x60/0x2d0
kernel_init_freeable+0x308/0x3a8
kernel_init+0x2c/0x168
ret_from_kernel_thread+0x5c/0x70

For this, both MMIO32 and MMIO64 resources will have flags 0x200.

> 
> I noticed both sparc and powerpc set PCI_BASE_ADDRESS_MEM_TYPE_64 in
> the flags. AFAICT, that's not set anywhere outside of arch code. So
> never for riscv, arm and arm64 at least. That leads me to
> pci_std_update_resource() which is where the PCI code sets BARs and
> just copies the flags in PCI_BASE_ADDRESS_MEM_MASK ignoring
> IORESOURCE_* flags. So it seems like 64-bit is still not handled and
> neither is prefetch.
> 

I am not sure if you mean here:
a) it's ok to add IORESOURCE_MEM_64 here, because it does not affect
anything else, or
b) it should be using PCI_BASE_ADDRESS_MEM_TYPE_64 
(or IORESOURCE_MEM_64 | PCI_BASE_ADDRESS_MEM_TYPE_64) instead, since
it's how it's added in powerpc/sparc, and else there is no point.

Again, thanks for helping!

Best regards,
Leonardo Bras

Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-13 Thread Leonardo Bras

On Mon, 2021-04-12 at 17:21 -0500, Segher Boessenkool wrote:
> On Fri, Apr 09, 2021 at 02:36:16PM +1000, Alexey Kardashevskiy wrote:
> > On 08/04/2021 19:04, Michael Ellerman wrote:
> > > > > > +#define QUERY_DDW_PGSIZE_4K0x01
> > > > > > +#define QUERY_DDW_PGSIZE_64K   0x02
> > > > > > +#define QUERY_DDW_PGSIZE_16M   0x04
> > > > > > +#define QUERY_DDW_PGSIZE_32M   0x08
> > > > > > +#define QUERY_DDW_PGSIZE_64M   0x10
> > > > > > +#define QUERY_DDW_PGSIZE_128M  0x20
> > > > > > +#define QUERY_DDW_PGSIZE_256M  0x40
> > > > > > +#define QUERY_DDW_PGSIZE_16G   0x80
> > > > > 
> > > > > I'm not sure the #defines really gain us much vs just putting the
> > > > > literal values in the array below?
> > > > 
> > > > Then someone says "u magic values" :) I do not mind either way. 
> > > > Thanks,
> > > 
> > > Yeah that's true. But #defining them doesn't make them less magic, if
> > > you only use them in one place :)
> > 
> > Defining them with "QUERY_DDW" in the names kinda tells where they are 
> > from. Can also grep QEMU using these to see how the other side handles 
> > it. Dunno.
> 
> And *not* defining anything reduces the mental load a lot.  You can add
> a comment at the single spot you use them, explaining what this is, in a
> much better way!
> 
> Comments are *good*.
> 
> 
> Segher

Thanks for the feedback Alexey, Michael and Segher!

I have sent a v3 for this patch. 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210408201915.174217-1-leobra...@gmail.com/

Please let me know of your feedback in it.

Best regards,
Leonardo Bras

Re: [PATCH 1/1] of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses

2021-04-19 Thread Leonardo Bras

On Mon, 2021-04-19 at 10:44 -0500, Rob Herring wrote:
> On Fri, Apr 16, 2021 at 3:58 PM Leonardo Bras  wrote:
> > 
> > Hello Rob, thanks for this feedback!
> > 
> > On Thu, 2021-04-15 at 13:59 -0500, Rob Herring wrote:
> > > +PPC and PCI lists
> > > 
> > > On Thu, Apr 15, 2021 at 1:01 PM Leonardo Bras  wrote:
> > > > 
> > > > Many other resource flag parsers already add this flag when the input
> > > > has bits 24 & 25 set, so update this one to do the same.
> > > 
> > > Many others? Looks like sparc and powerpc to me.
> > > 
> > 
> > s390 also does that, but it look like it comes from a device-tree.
> 
> I'm only looking at DT based platforms, and s390 doesn't use DT.

Correct. 
Sorry, I somehow write above the opposite of what I was thinking.

> 
> > > Those would be the
> > > ones I worry about breaking. Sparc doesn't use of/address.c so it's
> > > fine. Powerpc version of the flags code was only fixed in 2019, so I
> > > don't think powerpc will care either.
> > 
> > In powerpc I reach this function with this stack, while configuring a
> > virtio-net device for a qemu/KVM pseries guest:
> > 
> > pci_process_bridge_OF_ranges+0xac/0x2d4
> > pSeries_discover_phbs+0xc4/0x158
> > discover_phbs+0x40/0x60
> > do_one_initcall+0x60/0x2d0
> > kernel_init_freeable+0x308/0x3a8
> > kernel_init+0x2c/0x168
> > ret_from_kernel_thread+0x5c/0x70
> > 
> > For this, both MMIO32 and MMIO64 resources will have flags 0x200.
> 
> Oh good, powerpc has 2 possible flags parsing functions. So in the
> above path, do we need to set PCI_BASE_ADDRESS_MEM_TYPE_64?
> 
> Does pci_parse_of_flags() get called in your case?
> 

It's called in some cases, but not for the device I am debugging
(virtio-net pci@8002000). 

For the above device, here is an expanded stack trace:

of_bus_pci_get_flags() (from parser->bus->get_flags()) 
of_pci_range_parser_one() (from macro for_each_of_pci_range)
pci_process_bridge_OF_ranges+0xac/0x2d4
pSeries_discover_phbs+0xc4/0x158
discover_phbs+0x40/0x60
do_one_initcall+0x60/0x2d0
kernel_init_freeable+0x308/0x3a8
kernel_init+0x2c/0x168
ret_from_kernel_thread+0x5c/0x70

For other devices, I could also see the following stack trace:
## device ethernet@8

pci_parse_of_flags()
of_create_pci_dev+0x7f0/0xa40
__of_scan_bus+0x248/0x320
pcibios_scan_phb+0x370/0x3b0
pcibios_init+0x8c/0x12c
do_one_initcall+0x60/0x2d0
kernel_init_freeable+0x308/0x3a8
kernel_init+0x2c/0x168
ret_from_kernel_thread+0x5c/0x70

Devices that get parsed with of_bus_pci_get_flags() appears first at
dmesg (around 0.015s in my test), while devices that get parsed by
pci_parse_of_flags() appears later (0.025s in my test).

I am not really used to this code, but having the term "discover phbs"
in the first trace and the term "scan phb" in the second, makes me
wonder if the first trace is seen on devices that are seen/described in
the device-tree and the second trace is seen in devices not present in
the device-tree and found scanning pci bus.

> > > I noticed both sparc and powerpc set PCI_BASE_ADDRESS_MEM_TYPE_64 in
> > > the flags. AFAICT, that's not set anywhere outside of arch code. So
> > > never for riscv, arm and arm64 at least. That leads me to
> > > pci_std_update_resource() which is where the PCI code sets BARs and
> > > just copies the flags in PCI_BASE_ADDRESS_MEM_MASK ignoring
> > > IORESOURCE_* flags. So it seems like 64-bit is still not handled and
> > > neither is prefetch.
> > > 
> > 
> > I am not sure if you mean here:
> > a) it's ok to add IORESOURCE_MEM_64 here, because it does not affect
> > anything else, or
> > b) it should be using PCI_BASE_ADDRESS_MEM_TYPE_64
> > (or IORESOURCE_MEM_64 | PCI_BASE_ADDRESS_MEM_TYPE_64) instead, since
> > it's how it's added in powerpc/sparc, and else there is no point.
> 
> I'm wondering if a) is incomplete and PCI_BASE_ADDRESS_MEM_TYPE_64
> also needs to be set. The question is ultimately are BARs getting set
> correctly for 64-bit? It looks to me like they aren't.

I am not used to these terms, does BAR means 'Base Address Register'?

If so, those are the addresses stored in pci->phb->mem_resources[i] and
pci->phb->mem_offset[i], printed from enable_ddw() (which takes place a
lot after discovering the device (0.17s in my run)).

resource #1 pci@8002000: start=0x20008000
end=0x2000 flags=0x200 desc=0x0 offset=0x2000
resource #2 pci@8002000: start=0x2100
end=0x21ff flags=0x200 desc=0x0 offset=0x0

The message above was printed without this patch.
With the patch, the flags for memory resource #2 gets ORed with 
0x0010.

Is it enough to know if BARs are correctly set for 64-bit?
If it's not, how can I check?

> 
> Rob

Thanks Rob!

Leonardo Brás

Re: [PATCH 1/1] of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses

2021-04-19 Thread Leonardo Bras

On Mon, 2021-04-19 at 20:39 -0500, Rob Herring wrote:
> On Mon, Apr 19, 2021 at 7:35 PM Leonardo Bras  wrote:
> > 
> > On Mon, 2021-04-19 at 10:44 -0500, Rob Herring wrote:
> > > On Fri, Apr 16, 2021 at 3:58 PM Leonardo Bras  wrote:
> > > > 
> > > > Hello Rob, thanks for this feedback!
> > > > 
> > > > On Thu, 2021-04-15 at 13:59 -0500, Rob Herring wrote:
> > > > > +PPC and PCI lists
> > > > > 
> > > > > On Thu, Apr 15, 2021 at 1:01 PM Leonardo Bras  
> > > > > wrote:
> > > > > > 
> > > > > > Many other resource flag parsers already add this flag when the 
> > > > > > input
> > > > > > has bits 24 & 25 set, so update this one to do the same.
> > > > > 
> > > > > Many others? Looks like sparc and powerpc to me.
> > > > > 
> > > > 
> > > > s390 also does that, but it look like it comes from a device-tree.
> > > 
> > > I'm only looking at DT based platforms, and s390 doesn't use DT.
> > 
> > Correct.
> > Sorry, I somehow write above the opposite of what I was thinking.
> > 
> > > 
> > > > > Those would be the
> > > > > ones I worry about breaking. Sparc doesn't use of/address.c so it's
> > > > > fine. Powerpc version of the flags code was only fixed in 2019, so I
> > > > > don't think powerpc will care either.
> > > > 
> > > > In powerpc I reach this function with this stack, while configuring a
> > > > virtio-net device for a qemu/KVM pseries guest:
> > > > 
> > > > pci_process_bridge_OF_ranges+0xac/0x2d4
> > > > pSeries_discover_phbs+0xc4/0x158
> > > > discover_phbs+0x40/0x60
> > > > do_one_initcall+0x60/0x2d0
> > > > kernel_init_freeable+0x308/0x3a8
> > > > kernel_init+0x2c/0x168
> > > > ret_from_kernel_thread+0x5c/0x70
> > > > 
> > > > For this, both MMIO32 and MMIO64 resources will have flags 0x200.
> > > 
> > > Oh good, powerpc has 2 possible flags parsing functions. So in the
> > > above path, do we need to set PCI_BASE_ADDRESS_MEM_TYPE_64?
> > > 
> > > Does pci_parse_of_flags() get called in your case?
> > > 
> > 
> > It's called in some cases, but not for the device I am debugging
> > (virtio-net pci@8002000).
> > 
> > For the above device, here is an expanded stack trace:
> > 
> > of_bus_pci_get_flags() (from parser->bus->get_flags())
> > of_pci_range_parser_one() (from macro for_each_of_pci_range)
> > pci_process_bridge_OF_ranges+0xac/0x2d4
> > pSeries_discover_phbs+0xc4/0x158
> > discover_phbs+0x40/0x60
> > do_one_initcall+0x60/0x2d0
> > kernel_init_freeable+0x308/0x3a8
> > kernel_init+0x2c/0x168
> > ret_from_kernel_thread+0x5c/0x70
> > 
> > For other devices, I could also see the following stack trace:
> > ## device ethernet@8
> > 
> > pci_parse_of_flags()
> > of_create_pci_dev+0x7f0/0xa40
> > __of_scan_bus+0x248/0x320
> > pcibios_scan_phb+0x370/0x3b0
> > pcibios_init+0x8c/0x12c
> > do_one_initcall+0x60/0x2d0
> > kernel_init_freeable+0x308/0x3a8
> > kernel_init+0x2c/0x168
> > ret_from_kernel_thread+0x5c/0x70
> > 
> > Devices that get parsed with of_bus_pci_get_flags() appears first at
> > dmesg (around 0.015s in my test), while devices that get parsed by
> > pci_parse_of_flags() appears later (0.025s in my test).
> > 
> > I am not really used to this code, but having the term "discover phbs"
> > in the first trace and the term "scan phb" in the second, makes me
> > wonder if the first trace is seen on devices that are seen/described in
> > the device-tree and the second trace is seen in devices not present in
> > the device-tree and found scanning pci bus.
> 
> That was my guess as well. I think on pSeries that most PCI devices
> are in the DT whereas on Arm and other flattened DT (non OpenFirmware)
> platforms PCI devices are not in DT.
> 

It makes sense to me. 

>  Of course, for virtio devices,
> they would not be in DT in either case.

I don't get this part... in pseries it looks like virtio devices can be
in device-tree.

Oh, I think I get it... this pci@8002000 looks like a bus
(described in device-tree, so discovered), and then the devices are
inside it, getting scanned.

The virtio device gets the correct flags (from pci_parse_of_flags), but
the bus (pci@8002000) does not

[PATCH 1/1] powerpc/pseries/iommu: Fix window size for direct mapping with pmem

2021-04-19 Thread Leonardo Bras

As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS) it's
possible to use direct DMA mapping even with pmem region.

But, if that happens, the window size (len) is set to
(MAX_PHYSMEM_BITS - page_shift) instead of MAX_PHYSMEM_BITS, causing a
pagesize times smaller DDW to be created, being insufficient for correct
usage.

Fix this so the correct window size is used in this case.

Fixes: bf6e2d562bbc4("powerpc/dma: Fallback to dma_ops when persistent memory 
present")
Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..836cbbe0ecc5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1229,7 +1229,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (pmem_present) {
if (query.largest_available_block >=
(1ULL << (MAX_PHYSMEM_BITS - page_shift)))
-   len = MAX_PHYSMEM_BITS - page_shift;
+   len = MAX_PHYSMEM_BITS;
else
dev_info(&dev->dev, "Skipping ibm,pmemory");
}
-- 
2.30.2

Re: [PATCH 1/1] powerpc/pseries/iommu: Fix window size for direct mapping with pmem

2021-04-19 Thread Leonardo Bras

On Tue, 2021-04-20 at 15:18 +1000, Alexey Kardashevskiy wrote:
> 
> On 20/04/2021 14:54, Leonardo Bras wrote:
> > As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS) it's
> > possible to use direct DMA mapping even with pmem region.
> > 
> > But, if that happens, the window size (len) is set to
> > (MAX_PHYSMEM_BITS - page_shift) instead of MAX_PHYSMEM_BITS, causing a
> > pagesize times smaller DDW to be created, being insufficient for correct
> > usage.
> > 
> > Fix this so the correct window size is used in this case.
> 
> Good find indeed.
> 
> afaict this does not create a huge problem though as 
> query.largest_available_block is always smaller than (MAX_PHYSMEM_BITS - 
> page_shift) where it matters (phyp).
> 
> 
> Reviewed-by: Alexey Kardashevskiy 
> 

Thanks for reviewing!

Leonardo Bras

[PATCH 0/3] powerpc/mm/hash: Time improvements for memory hot(un)plug

2021-03-11 Thread Leonardo Bras

This patchset intends to reduce time needed for processing memory
hotplug/hotunplug in hash guests.

The first one, makes sure guests with pagesize over 4k don't need to
go through HPT resize-downs after memory hotplug.

The second and third patches make hotplug / hotunplug perform a single
HPT resize per operation, instead of one for each shift change, or one
for each LMB in case of resize-down error.

Why haven't the same mechanism used for both memory hotplug and hotunplug?
They both have different requirements:

Memory hotplug causes (usually) HPT resize-ups, which are fine happening
at the start of hotplug, but resize-ups should not ever be disabled, as
other mechanisms may try to increase memory, hitting issues with a HPT
that is too small.

Memory hotunplug causes HPT resize-downs, which can be disabled (HPT will
just remain larger for a while), but need to happen at the end of an
hotunplug operation. If we want to batch it, we need to disable
resize-downs and perform it only at the end.

Tests done with this patchset in the same machine / guest config:
Starting memory: 129GB, DIMM: 256GB
Before patchset: hotplug = 710s, hotunplug = 621s.
After patchset: hotplug  = 21s, hotunplug = 100s.

Any feedback will be appreciated!
I believe the code may not be very well placed in available files,
so please give some feedback on that.

Best regards,

Leonardo Bras (3):
  powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug
  powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug
  powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

 arch/powerpc/include/asm/book3s/64/hash.h |  4 +
 arch/powerpc/include/asm/sparsemem.h  |  4 +
 arch/powerpc/mm/book3s64/hash_utils.c | 78 +++
 arch/powerpc/mm/book3s64/pgtable.c| 18 +
 .../platforms/pseries/hotplug-memory.c| 22 ++
 5 files changed, 111 insertions(+), 15 deletions(-)

-- 
2.29.2

[PATCH 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug

2021-03-11 Thread Leonardo Bras

Because hypervisors may need to create HPTs without knowing the guest
page size, the smallest used page-size (4k) may be chosen, resulting in
a HPT that is possibly bigger than needed.

On a guest with bigger page-sizes, the amount of entries for HTP may be
too high, causing the guest to ask for a HPT resize-down on the first
hotplug.

This becomes a problem when HPT resize-down fails, and causes the
HPT resize to be performed on every LMB added, until HPT size is
compatible to guest memory size, causing a major slowdown.

So, avoiding HPT resizing-down on hot-add significantly improves memory
hotplug times.

As an example, hotplugging 256GB on a 129GB guest took 710s without this
patch, and 21s after applied.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 36 ---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 73b06adb6eeb..cfb3ec164f56 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -794,7 +794,7 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-static int resize_hpt_for_hotplug(unsigned long new_mem_size)
+static int resize_hpt_for_hotplug(unsigned long new_mem_size, bool shrinking)
 {
unsigned target_hpt_shift;
 
@@ -803,19 +803,25 @@ static int resize_hpt_for_hotplug(unsigned long 
new_mem_size)
 
target_hpt_shift = htab_shift_for_mem_size(new_mem_size);
 
-   /*
-* To avoid lots of HPT resizes if memory size is fluctuating
-* across a boundary, we deliberately have some hysterisis
-* here: we immediately increase the HPT size if the target
-* shift exceeds the current shift, but we won't attempt to
-* reduce unless the target shift is at least 2 below the
-* current shift
-*/
-   if (target_hpt_shift > ppc64_pft_size ||
-   target_hpt_shift < ppc64_pft_size - 1)
-   return mmu_hash_ops.resize_hpt(target_hpt_shift);
+   if (shrinking) {
 
-   return 0;
+   /*
+* To avoid lots of HPT resizes if memory size is fluctuating
+* across a boundary, we deliberately have some hysterisis
+* here: we immediately increase the HPT size if the target
+* shift exceeds the current shift, but we won't attempt to
+* reduce unless the target shift is at least 2 below the
+* current shift
+*/
+
+   if (target_hpt_shift >= ppc64_pft_size - 1)
+   return 0;
+
+   } else if (target_hpt_shift <= ppc64_pft_size) {
+   return 0;
+   }
+
+   return mmu_hash_ops.resize_hpt(target_hpt_shift);
 }
 
 int hash__create_section_mapping(unsigned long start, unsigned long end,
@@ -828,7 +834,7 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
return -1;
}
 
-   resize_hpt_for_hotplug(memblock_phys_mem_size());
+   resize_hpt_for_hotplug(memblock_phys_mem_size(), false);
 
rc = htab_bolt_mapping(start, end, __pa(start),
   pgprot_val(prot), mmu_linear_psize,
@@ -847,7 +853,7 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
int rc = htab_remove_mapping(start, end, mmu_linear_psize,
 mmu_kernel_ssize);
 
-   if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
+   if (resize_hpt_for_hotplug(memblock_phys_mem_size(), true) == -ENOSPC)
pr_warn("Hash collision while resizing HPT\n");
 
return rc;
-- 
2.29.2

[PATCH 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug

2021-03-11 Thread Leonardo Bras

During memory hotunplug, after each LMB is removed, the HPT may be
resized-down if it would map a max of 4 times the current amount of memory.
(2 shifts, due to introduced histeresis)

It usually is not an issue, but it can take a lot of time if HPT
resizing-down fails. This happens  because resize-down failures
usually repeat at each LMB removal, until there are no more bolted entries
conflict, which can take a while to happen.

This can be solved by doing a single HPT resize at the end of memory
hotunplug, after all requested entries are removed.

To make this happen, it's necessary to temporarily disable all HPT
resize-downs before hotunplug, re-enable them after hotunplug ends,
and then resize-down HPT to the current memory size.

As an example, hotunplugging 256GB from a 385GB guest took 621s without
this patch, and 100s after applied.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  2 ++
 arch/powerpc/include/asm/sparsemem.h  |  2 ++
 arch/powerpc/mm/book3s64/hash_utils.c | 28 +++
 arch/powerpc/mm/book3s64/pgtable.c| 12 
 .../platforms/pseries/hotplug-memory.c| 16 +++
 5 files changed, 60 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 843b0a178590..f92697c107f7 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -256,6 +256,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
 void hash_memory_batch_expand_prepare(unsigned long newsize);
+void hash_memory_batch_shrink_begin(void);
+void hash_memory_batch_shrink_end(void);
 
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 16b5f5300c84..a7a8a0d070fc 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -18,6 +18,8 @@ extern int memory_add_physaddr_to_nid(u64 start);
 #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
 
 void memory_batch_expand_prepare(unsigned long newsize);
+void memory_batch_shrink_begin(void);
+void memory_batch_shrink_end(void);
 
 #ifdef CONFIG_NUMA
 extern int hot_add_scn_to_nid(unsigned long scn_addr);
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 1f6aa0bf27e7..e16f207de8e4 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -794,6 +794,9 @@ static unsigned long __init htab_get_table_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+
+atomic_t hpt_resize_disable = ATOMIC_INIT(0);
+
 static int resize_hpt_for_hotplug(unsigned long new_mem_size, bool shrinking)
 {
unsigned target_hpt_shift;
@@ -805,6 +808,10 @@ static int resize_hpt_for_hotplug(unsigned long 
new_mem_size, bool shrinking)
 
if (shrinking) {
 
+   /* When batch removing entries, only resizes HPT at the end. */
+   if (atomic_read_acquire(&hpt_resize_disable))
+   return 0;
+
/*
 * To avoid lots of HPT resizes if memory size is fluctuating
 * across a boundary, we deliberately have some hysterisis
@@ -872,6 +879,27 @@ void hash_memory_batch_expand_prepare(unsigned long 
newsize)
pr_warn("Hash collision while resizing HPT\n");
}
 }
+
+void hash_memory_batch_shrink_begin(void)
+{
+   /* Disable HPT resize-down during hot-unplug */
+   atomic_set_release(&hpt_resize_disable, 1);
+}
+
+void hash_memory_batch_shrink_end(void)
+{
+   unsigned long newsize;
+
+   /* Re-enables HPT resize-down after hot-unplug */
+   atomic_set_release(&hpt_resize_disable, 0);
+
+   newsize = memblock_phys_mem_size();
+   /* Resize to smallest SHIFT possible */
+   while (resize_hpt_for_hotplug(newsize, true) == -ENOSPC) {
+   newsize *= 2;
+   pr_warn("Hash collision while resizing HPT\n");
+   }
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static void __init hash_init_partition_table(phys_addr_t hash_table,
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index f1cd8af0f67f..e01681e22e00 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -199,6 +199,18 @@ void memory_batch_expand_prepare(unsigned long newsize)
if (!radix_enabled())
hash_memory_batch_expand_prepare(newsize);
 }
+
+void memory_batch_shrink_begin(void)
+{
+   if (!radix_enabled())
+   hash_memory_batch_shrink_begin();
+}
+
+void memory_batch_shrink_end(void)
+{
+   if (!radix_enabled())
+   hash_memory_batch_shrink_end();
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 void __init mmu_partition_table

[PATCH 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug

2021-03-11 Thread Leonardo Bras

Every time a memory hotplug happens, and the memory limit crosses a 2^n
value, it may be necessary to perform HPT resizing-up, which can take
some time (over 100ms in my tests).

It usually is not an issue, but it can take some time if a lot of memory
is added to a guest with little starting memory:
Adding 256G to a 2GB guest, for example will require 8 HPT resizes.

Perform an HPT resize before memory hotplug, updating HPT to its
final size (considering a successful hotplug), taking the number of
HPT resizes to at most one per memory hotplug action.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/hash.h   |  2 ++
 arch/powerpc/include/asm/sparsemem.h|  2 ++
 arch/powerpc/mm/book3s64/hash_utils.c   | 14 ++
 arch/powerpc/mm/book3s64/pgtable.c  |  6 ++
 arch/powerpc/platforms/pseries/hotplug-memory.c |  6 ++
 5 files changed, 30 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index d959b0195ad9..843b0a178590 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -255,6 +255,8 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end,
 int nid, pgprot_t prot);
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
+void hash_memory_batch_expand_prepare(unsigned long newsize);
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_H */
diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index d072866842e4..16b5f5300c84 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,6 +17,8 @@ extern int remove_section_mapping(unsigned long start, 
unsigned long end);
 extern int memory_add_physaddr_to_nid(u64 start);
 #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
 
+void memory_batch_expand_prepare(unsigned long newsize);
+
 #ifdef CONFIG_NUMA
 extern int hot_add_scn_to_nid(unsigned long scn_addr);
 #else
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index cfb3ec164f56..1f6aa0bf27e7 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -858,6 +858,20 @@ int hash__remove_section_mapping(unsigned long start, 
unsigned long end)
 
return rc;
 }
+
+void hash_memory_batch_expand_prepare(unsigned long newsize)
+{
+   /*
+* Resizing-up HPT should never fail, but there are some cases system 
starts with higher
+* SHIFT than required, and we go through the funny case of resizing 
HPT down while
+* adding memory
+*/
+
+   while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
+   newsize *= 2;
+   pr_warn("Hash collision while resizing HPT\n");
+   }
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static void __init hash_init_partition_table(phys_addr_t hash_table,
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 5b3a3bae21aa..f1cd8af0f67f 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -193,6 +193,12 @@ int __meminit remove_section_mapping(unsigned long start, 
unsigned long end)
 
return hash__remove_section_mapping(start, end);
 }
+
+void memory_batch_expand_prepare(unsigned long newsize)
+{
+   if (!radix_enabled())
+   hash_memory_batch_expand_prepare(newsize);
+}
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 void __init mmu_partition_table_init(void)
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 8377f1f7c78e..353c71249214 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -671,6 +671,8 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
if (lmbs_available < lmbs_to_add)
return -EINVAL;
 
+   memory_batch_expand_prepare(memblock_phys_mem_size() + lmbs_to_add * 
drmem_lmb_size());
+
for_each_drmem_lmb(lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;
@@ -734,6 +736,8 @@ static int dlpar_memory_add_by_index(u32 drc_index)
 
pr_info("Attempting to hot-add LMB, drc index %x\n", drc_index);
 
+   memory_batch_expand_prepare(memblock_phys_mem_size() +
+drmem_info->n_lmbs * drmem_lmb_size());
lmb_found = 0;
for_each_drmem_lmb(lmb) {
if (lmb->drc_index == drc_index) {
@@ -788,6 +792,8 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 
drc_index)
if (lmbs_available < lmbs_to_add)
return -EINVAL;
 
+   memory_batch_expand_prepare(memblock_phys_mem_size() + lmbs_to_add * 
drmem_lmb_size());
+

[PATCH 1/1] kernel/smp: Split call_single_queue into 3 queues

2021-01-27 Thread Leonardo Bras

Currently, during flush_smp_call_function_queue():
- All items are transversed once, for inverting.
- The SYNC items are transversed twice.
- The ASYNC & IRQ_WORK items are transversed tree times.
- The TTWU items are transversed four times;.

Also, a lot of extra work is done to keep track and remove the items
already processed in each step.

By using three queues, it's possible to avoid all this work, and
all items in list are transversed only twice: once for inverting,
and once for processing..

In exchange, this requires 2 extra llist_del_all() in the beginning
of flush_smp_call_function_queue(), and some extra logic to decide
the correct queue to add the desired csd.

This is not supposed to cause any change in the order the items are
processed, but will change the order of printing (cpu offlining)
to the order the items will be proceessed.

(The above transversed count ignores the cpu-offlining case, in
which all items would be transversed again, in both cases.)

Signed-off-by: Leonardo Bras 
---
 kernel/smp.c | 173 ++-
 1 file changed, 87 insertions(+), 86 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 1b6070bf97bb..67fb415873f9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -37,7 +37,13 @@ struct call_function_data {
 
 static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data);
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
+struct call_multi_queue {
+   struct llist_head sync;
+   struct llist_head async_n_irq_work;
+   struct llist_head ttwu;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_multi_queue, call_mq);
 
 static void flush_smp_call_function_queue(bool warn_cpu_offline);
 
@@ -93,8 +99,13 @@ void __init call_function_init(void)
 {
int i;
 
-   for_each_possible_cpu(i)
-   init_llist_head(&per_cpu(call_single_queue, i));
+   for_each_possible_cpu(i) {
+   struct call_multi_queue *cmq = &per_cpu(call_mq, i);
+
+   init_llist_head(&cmq->sync);
+   init_llist_head(&cmq->async_n_irq_work);
+   init_llist_head(&cmq->ttwu);
+   }
 
smpcfd_prepare_cpu(smp_processor_id());
 }
@@ -253,6 +264,31 @@ static __always_inline void csd_unlock(call_single_data_t 
*csd)
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+static __always_inline bool smp_add_to_queue(int cpu, struct llist_node *node)
+{
+   struct llist_head *head;
+   call_single_data_t *csd = llist_entry(node, call_single_data_t, 
node.llist);
+   struct call_multi_queue *cmq = &per_cpu(call_mq, cpu);
+
+   switch (CSD_TYPE(csd)) {
+   case CSD_TYPE_SYNC:
+   head = &cmq->sync;
+   break;
+   case CSD_TYPE_ASYNC:
+   case CSD_TYPE_IRQ_WORK:
+   head = &cmq->async_n_irq_work;
+   break;
+   case CSD_TYPE_TTWU:
+   head = &cmq->ttwu;
+   break;
+   default:
+   pr_warn("IPI callback, unknown type %d blocked from %d\n", cpu, 
CSD_TYPE(csd));
+   return false;
+   }
+
+   return llist_add(node, head);
+}
+
 void __smp_call_single_queue(int cpu, struct llist_node *node)
 {
/*
@@ -266,7 +302,7 @@ void __smp_call_single_queue(int cpu, struct llist_node 
*node)
 * locking and barrier primitives. Generic code isn't really
 * equipped to do the right thing...
 */
-   if (llist_add(node, &per_cpu(call_single_queue, cpu)))
+   if (smp_add_to_queue(cpu, node))
send_call_function_single_ipi(cpu);
 }
 
@@ -327,124 +363,89 @@ void generic_smp_call_function_single_interrupt(void)
  * to ensure that all pending IPI callbacks are run before it goes completely
  * offline.
  *
- * Loop through the call_single_queue and run all the queued callbacks.
+ * Loop through the call_multi_queue->* and run all the queued callbacks.
  * Must be called with interrupts disabled.
  */
 static void flush_smp_call_function_queue(bool warn_cpu_offline)
 {
-   call_single_data_t *csd, *csd_next;
-   struct llist_node *entry, *prev;
-   struct llist_head *head;
+   call_single_data_t *csd;
+   struct llist_node *entry_sync, *entry_async_n_irq_work, *entry_ttwu;
+   struct call_multi_queue *cmq;
static bool warned;
 
lockdep_assert_irqs_disabled();
 
-   head = this_cpu_ptr(&call_single_queue);
-   entry = llist_del_all(head);
-   entry = llist_reverse_order(entry);
+   cmq = this_cpu_ptr(&call_mq);
+   entry_sync = llist_del_all(&cmq->sync);
+   entry_async_n_irq_work = llist_del_all(&cmq->async_n_irq_work);
+   entry_ttwu = llist_del_all(&cmq->ttwu);
+
+   entry_sync = llist_reverse_order(entry_sync);
+   entry_async_n_irq_work = llist_reverse_order(entry_async_n_irq_

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-29 Thread Leonardo Bras

> Thats a good point -- Leonardo, is the
> "net.bridge.bridge-nf-call-ip6tables" sysctl on?

Running
# sudo sysctl -a
I can see:
net.bridge.bridge-nf-call-ip6tables = 1
 
So this packets are sent to host iptables for processing?


(Sorry for the delay, I did not received the previous e-mails.
Please include me in to/cc.)


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-29 Thread Leonardo Bras

On Thu, 2019-08-29 at 17:04 -0300, Leonardo Bras wrote:
> > Thats a good point -- Leonardo, is the
> > "net.bridge.bridge-nf-call-ip6tables" sysctl on?
> 
> Running
> # sudo sysctl -a
> I can see:
> net.bridge.bridge-nf-call-ip6tables = 1

Also, doing
# echo 0 >  /proc/sys/net/bridge/bridge-nf-call-ip6tables 
And then trying to boot the guest will not crash the host.

Which would make sense, since host iptables is not dealing with guest
IPv6 packets.

So, the real cause of this bug is the bridge making host ip6tables deal
with guest IPv6 packets ? 
If so, would it be ok if write a patch testing ipv6_mod_enabled()
before passing guest ipv6 packets to host ip6tables? 

Best regards,

>  
> So this packets are sent to host iptables for processing?
> 
> 
> (Sorry for the delay, I did not received the previous e-mails.
> Please include me in to/cc.)

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-30 Thread Leonardo Bras

On Thu, 2019-08-29 at 22:58 +0200, Florian Westphal wrote:
> In any case your patch looks ok to me.

Great! Please give your feedback on v3: 
http://patchwork.ozlabs.org/patch/1154040/

[...]
> 
> Even if we disable call-ip6tables in br_netfilter we will at least
> in addition need a patch for nft_fib_netdev.c.
> 
> From a "avoid calls to ipv6 stack when its disabled" standpoint,
> the safest fix is to disable call-ip6tables functionality if ipv6
> module is off *and* fix nft_fib_netdev.c to BREAK in ipv6 is off case.
> 
> I started to place a list of suspicous modules here, but that got out
> of hand quickly.
> 
> So, given I don't want to plaster ipv6_mod_enabled() everywhere, I
> would suggest this course of action:
> 
> 1. add a patch to BREAK in nft_fib_netdev.c for !ipv6_mod_enabled()
> 2. change net/bridge/br_netfilter_hooks.c, br_nf_pre_routing() to
>make sure ipv6_mod_enabled() is true before doing the ipv6 stack
>"emulation".
> 
> Makes sense?
IMHO sure.

Shortly, I will send a couple patches proposing the above changes.
(Or my best understanding about them :) ) 

> 
> Thanks,
> Florian

Thank you,
Leonardo Bras


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-30 Thread Leonardo Bras

On Thu, 2019-08-29 at 22:58 +0200, Florian Westphal wrote:
[...]
> 1. add a patch to BREAK in nft_fib_netdev.c for !ipv6_mod_enabled()
[...]

But this is still needed? I mean, in nft_fib_netdev_eval there are only
2 functions being called for IPv6 protocol : nft_fib6_eval and
nft_fib6_eval_type. Both are already protected by this current patch.

Is your 1st suggestion about this patch, or you think it's better to
move this change to nft_fib_netdev_eval ?

Best regards,
Leonardo Bras


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-30 Thread Leonardo Bras

On Thu, 2019-08-29 at 22:58 +0200, Florian Westphal wrote:

> Ah, it was the latter.
> Making bridge netfilter not pass packets up with ipv6 off closes
> the problem for fib_ipv6 and inet, so only _netdev.c needs fixing.

Ok then, preparing a v4.
https://lkml.org/lkml/2019/8/30/843



signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-21 Thread Leonardo Bras

On Wed, 2019-08-21 at 11:58 +0200, Pablo Neira Ayuso wrote:
> On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> > On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > > Wouldn't fib_netdev.c have the same problem?
> > Probably, but I haven't hit this issue yet.
> > 
> > > If so, might be better to place this test in both
> > > nft_fib6_eval_type and nft_fib6_eval.
> > 
> > I think that is possible, and not very hard to do.
> > 
> > But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> > nft_fib_netdev_eval() have the responsibility to choose a valid
> > protocol or drop the package. 
> > I am not sure if it would be a good move to transfer this
> > responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> > rather add the same test to nft_fib_netdev_eval().
> > 
> > Does it make sense?
> 
> Please, update common code to netdev and ip6 extensions as Florian
> suggests.
> 
> Thanks.

Ok then, I will send a v2 with that change.

Thanks,


signature.asc
Description: This is a digitally signed message part

[PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-21 Thread Leonardo Bras

If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
dealing with a IPv6 package, it causes a kernel panic in
fib6_node_lookup_1(), crashing in bad_page_fault.

The panic is caused by trying to deference a very low address (0x38
in ppc64le), due to ipv6.fib6_main_tbl = NULL.
BUG: Kernel NULL pointer dereference at 0x0038

Fix this behavior by dropping IPv6 packages if !ipv6_mod_enabled().

Signed-off-by: Leonardo Bras 
---
Changes from v1:
- Move drop logic from nft_fib_inet_eval() to nft_fib6_eval{,_type}
so it can affect other usages of these functions. 

 net/ipv6/netfilter/nft_fib_ipv6.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c 
b/net/ipv6/netfilter/nft_fib_ipv6.c
index 7ece86afd079..75acc417e2ff 100644
--- a/net/ipv6/netfilter/nft_fib_ipv6.c
+++ b/net/ipv6/netfilter/nft_fib_ipv6.c
@@ -125,6 +125,11 @@ void nft_fib6_eval_type(const struct nft_expr *expr, 
struct nft_regs *regs,
u32 *dest = ®s->data[priv->dreg];
struct ipv6hdr *iph, _iph;
 
+   if (!ipv6_mod_enabled()) {
+   regs->verdict.code = NF_DROP;
+   return;
+   }
+
iph = skb_header_pointer(pkt->skb, noff, sizeof(_iph), &_iph);
if (!iph) {
regs->verdict.code = NFT_BREAK;
@@ -150,6 +155,11 @@ void nft_fib6_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
struct rt6_info *rt;
int lookup_flags;
 
+   if (!ipv6_mod_enabled()) {
+   regs->verdict.code = NF_DROP;
+   return;
+   }
+
if (priv->flags & NFTA_FIB_F_IIF)
oif = nft_in(pkt);
else if (priv->flags & NFTA_FIB_F_OIF)
-- 
2.20.1

Re: [PATCH 1/1] fs/splice.c: Fix old documentation about moving pages

2019-08-16 Thread Leonardo Bras

On Thu, 2019-08-08 at 15:19 -0300, Leonardo Bras wrote:
> On Thu, 2019-08-01 at 19:38 -0300, Leonardo Bras wrote:
> > Since commit 485ddb4b9741 ("1/2 splice: dont steal")' (2007),
> > the SPLICE_F_MOVE support was removed (became a no-op according
> > to man pages), and thus disabling steal operation that would make
> > moving pages possible.
> > 
> > This fixes the comment, making clear pages are not moved.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  fs/splice.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/fs/splice.c b/fs/splice.c
> > index 14cb602d9a2f..0ba151c40cef 100644
> > --- a/fs/splice.c
> > +++ b/fs/splice.c
> > @@ -671,8 +671,7 @@ ssize_t splice_from_pipe(struct pipe_inode_info *pipe, 
> > struct file *out,
> >   * @flags: splice modifier flags
> >   *
> >   * Description:
> > - *Will either move or copy pages (determined by @flags options) from
> > - *the given pipe inode to the given file.
> > + *Will copy pages from the given pipe inode to the given file.
> >   *This one is ->write_iter-based.
> >   *
> >   */
> 
> Could you give any feedback on this patch?
Please provide feedback on this patch



signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-27 Thread Leonardo Bras

On Tue, 2019-08-27 at 12:35 +0200, Pablo Neira Ayuso wrote:
> On Wed, Aug 21, 2019 at 11:15:06AM -0300, Leonardo Bras wrote:
> > If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
> > dealing with a IPv6 package, it causes a kernel panic in
> > fib6_node_lookup_1(), crashing in bad_page_fault.
> 
> Q: How do you get to see IPv6 packets if IPv6 module is disable?
I could reproduce this bug on a host ('ipv6.disable=1') starting a
guest with a virtio-net interface with 'filterref' over a virtual
bridge. It crashes the host during guest boot (just before login).

By that I could understand that a guest IPv6 network traffic (viavirtio-net) 
may cause this kernel panic. 

> 
> > The panic is caused by trying to deference a very low address (0x38
> > in ppc64le), due to ipv6.fib6_main_tbl = NULL.
> > BUG: Kernel NULL pointer dereference at 0x0038
> > 
> > Fix this behavior by dropping IPv6 packages if !ipv6_mod_enabled().
> 
> I'd suggest: s/package/packet/
Sure, I will make sure to put it on v3.
(Sorry, I am not very used to net subsystem.)
> 
> [...]
> > diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c 
> > b/net/ipv6/netfilter/nft_fib_ipv6.c
> > index 7ece86afd079..75acc417e2ff 100644
> > --- a/net/ipv6/netfilter/nft_fib_ipv6.c
> > +++ b/net/ipv6/netfilter/nft_fib_ipv6.c
> > @@ -125,6 +125,11 @@ void nft_fib6_eval_type(const struct nft_expr *expr, 
> > struct nft_regs *regs,
> > u32 *dest = ®s->data[priv->dreg];
> > struct ipv6hdr *iph, _iph;
> >  
> > +   if (!ipv6_mod_enabled()) {
> > +   regs->verdict.code = NF_DROP;
> 
> NFT_BREAK instead to stop evaluating this rule, this results in a
> mismatch, so you let the user decide what to do with packets that do
> not match your policy.
Ok, I will replace for v3.

> 
> The drop case at the bottom of the fib eval function never actually
> never happens.
Which one do you mean?



signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-27 Thread Leonardo Bras

On Tue, 2019-08-27 at 20:51 +0200, Pablo Neira Ayuso wrote:
> > > The drop case at the bottom of the fib eval function never actually
> > > never happens.
> > 
> > Which one do you mean?
> 
> Line 31 of net/netfilter/nft_fib_inet.c.
Oh, yeah, I was thinking about that when I wrote the patch.
Thanks for explaining :)

I will send the v3 in a few minutes.

Best regards,


signature.asc
Description: This is a digitally signed message part

[PATCH v3 1/1] netfilter: nf_tables: fib: Drop IPV6 packets if IPv6 is disabled on boot

2019-08-27 Thread Leonardo Bras

If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
dealing with a IPv6 packet, it causes a kernel panic in
fib6_node_lookup_1(), crashing in bad_page_fault.

The panic is caused by trying to deference a very low address (0x38
in ppc64le), due to ipv6.fib6_main_tbl = NULL.
BUG: Kernel NULL pointer dereference at 0x0038

Fix this behavior by dropping IPv6 packets if !ipv6_mod_enabled().

Signed-off-by: Leonardo Bras 
---
Changes from v2:
- Replace veredict.code from NF_DROP to NFT_BREAK
- Updated commit message (s/package/packet)

Changes from v1:
- Move drop logic from nft_fib_inet_eval() to nft_fib6_eval{,_type}
so it can affect other usages of these functions.

 net/ipv6/netfilter/nft_fib_ipv6.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c 
b/net/ipv6/netfilter/nft_fib_ipv6.c
index 7ece86afd079..8496e43b73bd 100644
--- a/net/ipv6/netfilter/nft_fib_ipv6.c
+++ b/net/ipv6/netfilter/nft_fib_ipv6.c
@@ -125,6 +125,11 @@ void nft_fib6_eval_type(const struct nft_expr *expr, 
struct nft_regs *regs,
u32 *dest = ®s->data[priv->dreg];
struct ipv6hdr *iph, _iph;
 
+   if (!ipv6_mod_enabled()) {
+   regs->verdict.code = NFT_BREAK;
+   return;
+   }
+
iph = skb_header_pointer(pkt->skb, noff, sizeof(_iph), &_iph);
if (!iph) {
regs->verdict.code = NFT_BREAK;
@@ -150,6 +155,11 @@ void nft_fib6_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
struct rt6_info *rt;
int lookup_flags;
 
+   if (!ipv6_mod_enabled()) {
+   regs->verdict.code = NFT_BREAK;
+   return;
+   }
+
if (priv->flags & NFTA_FIB_F_IIF)
oif = nft_in(pkt);
else if (priv->flags & NFTA_FIB_F_OIF)
-- 
2.20.1

Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot

2019-08-26 Thread Leonardo Bras

Hello Pablo, Florian,

I implemented a V2 of this patch with the changes you proposed.
Could you please give your feedback on that patch?
https://lkml.org/lkml/2019/8/21/527

Thanks!

On Wed, 2019-08-21 at 11:58 +0200, Pablo Neira Ayuso wrote:
> On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> > On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > > Wouldn't fib_netdev.c have the same problem?
> > Probably, but I haven't hit this issue yet.
> > 
> > > If so, might be better to place this test in both
> > > nft_fib6_eval_type and nft_fib6_eval.
> > 
> > I think that is possible, and not very hard to do.
> > 
> > But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> > nft_fib_netdev_eval() have the responsibility to choose a valid
> > protocol or drop the package. 
> > I am not sure if it would be a good move to transfer this
> > responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> > rather add the same test to nft_fib_netdev_eval().
> > 
> > Does it make sense?
> 
> Please, update common code to netdev and ip6 extensions as Florian
> suggests.
> 
> Thanks.


signature.asc
Description: This is a digitally signed message part

[PATCH 1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run

2019-09-16 Thread Leonardo Bras

If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk
of memory (496GB) mapped with THP, it takes an average of 275 seconds,
which can cause a lot of problems to the load (in qemu case, the guest
will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot of
time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it will
take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently
with the next part of the functions it's called in.
It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[];

So, by what I could understand, if there is no find_current_mm_pte()
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I propose
a counter that keeps track of how many find_current_mm_pte() are currently
running, and if there is none, just skip smp_call_function_many().

On my workload (qemu), I could see munmap's time reduction from 275 seconds
to 418ms.

Signed-off-by: Leonardo Bras 

---
I need more experienced people's help in order to understand if this is
really a valid improvement, and if mm_struct is the best place to put such
counter.

Thanks!
---
 arch/powerpc/include/asm/pte-walk.h | 3 +++
 arch/powerpc/mm/book3s64/pgtable.c  | 2 ++
 include/linux/mm_types.h| 1 +
 3 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/pte-walk.h 
b/arch/powerpc/include/asm/pte-walk.h
index 33fa5dd8ee6a..3b82cb3bd563 100644
--- a/arch/powerpc/include/asm/pte-walk.h
+++ b/arch/powerpc/include/asm/pte-walk.h
@@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, 
unsigned long ea,
 {
pte_t *pte;
 
+   atomic64_inc(¤t->mm->lockless_ptlookup_count);
+
VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", 
__func__);
VM_WARN(pgdir != current->mm->pgd,
"%s lock less page table lookup called on wrong mm\n", 
__func__);
@@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, 
unsigned long ea,
if (hshift)
WARN_ON(*hshift);
 #endif
+   atomic64_dec(¤t->mm->lockless_ptlookup_count);
return pte;
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..8f6fc2f80071 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,6 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
smp_mb();
+   if (atomic64_read(&mm->lockless_ptlookup_count) == 0)
+   return;
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..97fb2545e967 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -518,6 +518,7 @@ struct mm_struct {
 #endif
} __randomize_layout;
 
+   atomic64_t lockless_ptlookup_count;
/*
 * The mm_cpumask needs to be at the end of mm_struct, because it
 * is dynamically sized based on nr_cpu_ids.
-- 
2.20.1

Re: [PATCH 1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run

2019-09-17 Thread Leonardo Bras

Hello Aneesh, thanks for the feedback!

On Tue, 2019-09-17 at 08:26 +0530, Aneesh Kumar K.V wrote:
> We could possibly avoid that serialize for a full task exit unmap. ie, 
> when tlb->fullmm == 1 . But that won't help the Qemu case because it 
> does an umap of the guest ram range for which tlb->fullmm != 1.
Yes, in qemu we can hot-add memory to guests using memory allocated by
qemu using mmap(). If we want/need to free this memory to other
processes (or guests), we remove it from this guest and do a common
munmap(). This happens without exiting qemu, or pausing the guest.


> That is not correct. We need to keep the count updated  across the 
> local_irq_disable()/local_irq_enable(). 
So, by that you mean incrementing the counter before
local_irq_{disable,save} and decrementing only after
local_irq_{enable,restore}?

> That is we mostly should have a variant like 
> start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). 
And with that, you mean replacing the current local_irq_{disable,save}
with a start_lockless_pgtbl_walk() (that contains increment + irq
disable)? (and doing similarly with local_irq_{enable,restore} and
end_lockless_pgtbl_walk()) 

> Also there is hash_page_mm which we need to make sure we are protected 
> against w.r.t collapse pmd.
I did not get what I need to do here. 
Is it a new thing to protect? 
If so, I need to better understand how to protect it.
If not, I would like to understand how it's protected by current
behavior. 

> Move that to ppc64 specific mm_context_t.
Ok, fixed! I will send that on v2.

Best regards,
Leonardo Bras


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread Leonardo Bras

Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
> > If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> > chunk of memory (496GB) mapped with THP, it takes an average of 275
> > seconds, which can cause a lot of problems to the load (in qemu case,
> > the guest will lock for this time).
> > 
> > Trying to find the source of this bug, I found out most of this time is
> > spent on serialize_against_pte_lookup(). This function will take a lot
> > of time in smp_call_function_many() if there is more than a couple CPUs
> > running the user process. Since it has to happen to all THP mapped, it
> > will take a very long time for large amounts of memory.
> > 
> > By the docs, serialize_against_pte_lookup() is needed in order to avoid
> > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> > pagetable walk, to happen concurrently with THP splitting/collapsing.
> > 
> > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> > after interrupts are re-enabled.
> > Since, interrupts are (usually) disabled during lockless pagetable
> > walk, and serialize_against_pte_lookup will only return after
> > interrupts are enabled, it is protected.
> 
> This is something entirely specific to Power, you shouldn't be touching
> generic code at all.

Up to v4, I was declaring dummy functions so it would not mess up with
other archs: http://patchwork.ozlabs.org/patch/1168779/

But I was recommended to create a generic function that could guide the
way to archs: http://patchwork.ozlabs.org/patch/1168775/

The idea was to concentrate all routines of beginning/ending lockless
pagetable walks on these functions, and call them instead of
irq_disable/irq_enable.
Then it was easy to place the refcount-based tracking in these
functions. It should only be enabled in case the config chooses to do
so. 

> 
> Also, I'm not sure I understand things properly.
> 
> So serialize_against_pte_lookup() wants to wait for all currently
> out-standing __find_linux_pte() instances (which are very similar to
> gup_fast).
> 
> It seems to want to do this before flushing the THP TLB for some reason;
> why? Should not THP observe the normal page table freeing rules which
> includes a RCU-like grace period like this already.
> 
> Why is THP special here? This doesn't seem adequately explained.

"It's necessary to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them."

If a there is a THP split/collapse during the lockless pagetable walk,
the returned ptep can be a pointing to an invalid pte. 

To avoid that, the pmd is updated, then serialize_against_pte_lookup is
ran. Serialize runs a do_nothing in all cpu in cpu_mask. 

So, after all cpus finish running do_nothing(), there is a guarantee
that if there is any 'lockless pagetable walk' it is running on top of
a updated version of this pmd, and so, collapsing/splitting THP is
safe.

> 
> Also, specifically to munmap(), this seems entirely superfluous,
> munmap() uses the normal page-table freeing code and should be entirely
> fine without additional waiting.

To be honest, I remember it being needed in munmap case, but I really
don't remember the details. I will take a deeper look and come back
with this answer. 

> Furthermore, Power never accurately tracks mm_cpumask(), so using that
> makes the whole thing more expensive than it needs to be. Also, I
> suppose that is buggered vs file backed THP.

That accuracy of mm_cpumask is above my knowledge right now. =)

I agree that it's to expensive to do that. That's why I suggested this
method, that can check if there is any 'lockless pagetable walk'
running before trying to serialize. It reduced the waiting time a lot
for large amounts of memory. (more details on cover letter)

Best regards,

Leonardo Brás

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks

2019-10-03 Thread Leonardo Bras

On Thu, 2019-10-03 at 09:44 +0200, Peter Zijlstra wrote:
> This shouldn't be a user visible option at all. Either the arch needs
> it and selects it or not.

You are right. I will do that on v6.
Thanks for the feedback!


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-03 Thread Leonardo Bras

Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > index 818691846c90..3043ea9812d5 100644
> > > --- a/include/asm-generic/pgtable.h
> > > +++ b/include/asm-generic/pgtable.h
> > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > >  #endif
> > >  #endif
> > >  
> > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct 
> > > *mm)
> > > +{
> > > + unsigned long irq_mask;
> > > +
> > > + if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > + atomic_inc(&mm->lockless_pgtbl_walkers);
> > 
> > This will not work for file backed THP. Also, this is a fairly serious
> > contention point all on its own.
> 
> Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> your (PowerPC) use of mm_cpumask() for that IPI.

Could you please explain it?
I mean, why this breaks tmpfs-thp?

Also, why mm_cpumask() is also broken?

> 
> > > + /*
> > > +  * Interrupts must be disabled during the lockless page table walk.
> > > +  * That's because the deleting or splitting involves flushing TLBs,
> > > +  * which in turn issues interrupts, that will block when disabled.
> > > +  */
> > > + local_irq_save(irq_mask);
> > > +
> > > + /*
> > > +  * This memory barrier pairs with any code that is either trying to
> > > +  * delete page tables, or split huge pages. Without this barrier,
> > > +  * the page tables could be read speculatively outside of interrupt
> > > +  * disabling.
> > > +  */
> > > + smp_mb();
> > 
> > I don't think this is something smp_mb() can guarantee. smp_mb() is
> > defined to order memory accesses, in this case the store of the old
> > flags vs whatever comes after this.
> > 
> > It cannot (in generic) order against completion of prior instructions,
> > like clearing the interrupt enabled flags.
> > 
> > Possibly you want barrier_nospec().
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 
> If there isn't an interrupt (or it happens after disable) it is
> irrelevant.
> 
> Specifically, that serialize-IPI thing wants to ensure in-progress
> lookups are complete, and I can't find a scenario where
> local_irq_disable/enable() needs additional help vs IPIs. The moment an
> interrupt lands it kills speculation and forces things into
> program-order.
> 
> Did you perhaps want something like:
> 
>   if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
>   atomic_inc(&foo);
>   smp_mb__after_atomic();
>   }
> 
>   ...
> 
>   if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
>   smp_mb__before_atomic();
>   atomic_dec(&foo);
>   }
> 
> To ensure everything happens inside of the increment?
> 

I need to rethink this barrier, but yes. I think that's it. 
It's how it was on v4.

I have changed it because I thought it would be better this way. Well,
it was probably a mistake of my part.

> And I still think all that wrong, you really shouldn't need to wait on
> munmap().

That is something I need to better understand. I mean, before coming
with this patch, I thought exactly this: not serialize when on munmap. 

But on the way I was convinced it would not work on munmap. I need to
recall why, and if it was false to assume this, re-think the whole
solution.

Best regards,

Leonardo Brás


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread Leonardo Bras

On Thu, 2019-10-03 at 13:49 -0700, John Hubbard wrote:
> Yes. And to clarify, I was assuming that the changes to mm/gup.c were 
> required in order to accomplish your goals. Given that assumption, I 
> wanted the generic code to be "proper", and that's what that feedback
> is about.

You assumed right. On my counting approach it's necessary count all
'lockless pagetable walks', including the ones in generic code.

And, I think even without the counting approach, it was a good way
focus this 'lockless pagetable walk' routine in a single place.

> 
> Although the other questions about file-backed THP
> make it sound like some rethinking across the board is required now.

Yeap, I need to better understand how the file-backed THP problem
works.

Thanks,

Leonardo Brás

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-27 Thread Leonardo Bras

On Fri, 2019-09-27 at 11:46 -0300, Leonardo Bras wrote:
> I am not sure if it would be ok to use irq_{save,restore} in real mode,
> I will do some more reading of the docs before addressing this. 

It looks like it's unsafe to merge irq_{save,restore} in
{start,end}_lockless_pgtbl_walk(), due to a possible access of code
that is not accessible in real mode.

I am sending a v4 for the changes so far.
I will look forward for your feedback.

signature.asc
Description: This is a digitally signed message part

[PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

2019-09-27 Thread Leonardo Bras

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 45 
 3 files changed, 49 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@ typedef struct {
/* Number of users of the external (Nest) MMU */
atomic_t copros;
 
+   /* Number of running instances of lockless pagetable walk*/
+   atomic_t lockless_pgtbl_walk_count;
+
struct hash_mm_context *hash_context;
 
unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
 #endif
atomic_set(&mm->context.active_cpus, 0);
atomic_set(&mm->context.copros, 0);
+   atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
 
return 0;
 }
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..6ba6195bff1b 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   atomic_inc(&mm->context.lockless_pgtbl_walk_count);
+   /* Avoid reorder to garantee that the increment will happen before any
+* part of the walkless pagetable walk after it.
+*/
+   smp_mb();
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   /* Avoid reorder to garantee that it will only decrement after the last
+* use of the returned ptep from the lockless pagetable walk.
+*/
+   smp_mb();
+   atomic_dec(&mm->context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, 
and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return atomic_read(&mm->context.lockless_pgtbl_walk_count);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1

[PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-27 Thread Leonardo Bras

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Also, I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.

Changes since v3:
 Adds memory barrier to {start,end}_lockless_pgtbl_walk()
 Explain (comments) why some lockless pgtbl walks don't need
local_irq_disable (real mode + MSR_EE=0)
 Explain (comments) places where counting method is not needed (guest pgd,
which is not touched by THP)
 Fixes some misplaced local_irq_restore()
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/


Leonardo Bras (11):
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
pgtbl walks
  powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
walk
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++
 arch/powerpc/kernel/mce_power.c  | 13 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  3 ++
 arch/powerpc/kvm/book3s_hv_nested.c  | 22 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 18 ++--
 arch/powerpc/kvm/e500_mmu_host.c |  6 ++-
 arch/powerpc/mm/book3s64/hash_tlb.c  |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c| 12 -
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 48 +++-
 arch/powerpc/perf/callchain.c|  5 +-
 include/asm-generic/pgtable.h| 15 ++
 mm/gup.c |  8 
 16 files changed, 180 insertions(+), 13 deletions(-)

-- 
2.20.1

[PATCH v4 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras

Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c|  3 +++
 3 files changed, 35 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * We need to protect against page table destruction
 * hugepage split and collapse.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
   hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
write_ok = 1;
}
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
}
}
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..9b374b9838fa 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,6 +813,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 * Read the PTE from the process' radix tree and use that
 * so we get the shift and attribute bits.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
/*
@@ -821,12 +822,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 */
if (!ptep) {
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
 
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +975,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 kvm->arch.lpid);
+
return 0;   
 }
 
@@ -989,6 +998,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1027,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1049,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't d

[PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks

2019-09-27 Thread Leonardo Bras

There is a need to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup these cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

Given that there are lockless pagetable walks on generic code, it's
necessary to create dummy functions for archs that won't use the approach.

Signed-off-by: Leonardo Bras 
---
 include/asm-generic/pgtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..0831475e72d3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+static inline void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return 0;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.20.1

[PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-09-27 Thread Leonardo Bras

As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the couting method, it's necessary
to bound it with {start,end}_lockless_pgtbl_walk.

There are dummy functions, so it is not going to add any overhead on archs
that don't use this method.

Signed-off-by: Leonardo Bras 
---
 mm/gup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7105c829cf44 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, 
unsigned long end)
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
+   struct mm_struct *mm;
unsigned long len, end;
unsigned long flags;
int nr = 0;
@@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
}
 
return nr;
@@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
+   struct mm_struct *mm;
int nr = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_disable();
gup_pgd_range(addr, end, gup_flags, pages, &nr);
local_irq_enable();
+   end_lockless_pgtbl_walk(mm);
ret = nr;
}
 
-- 
2.20.1

1 2 3 4 >

1 - 100 of 301 matches

Mail list logo