[RESEND PATCH V8 04/11] KVM: Document Memory ROE

2019-01-21 Thread Ahmed Abd El Mawgood
ROE version documented here is implemented in the next 2 patches
Signed-off-by: Ahmed Abd El Mawgood 
---
 Documentation/virtual/kvm/hypercalls.txt | 40 
 1 file changed, 40 insertions(+)

diff --git a/Documentation/virtual/kvm/hypercalls.txt 
b/Documentation/virtual/kvm/hypercalls.txt
index da24c138c8..a31f316ce6 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -141,3 +141,43 @@ a0 corresponds to the APIC ID in the third argument (a2), 
bit 1
 corresponds to the APIC ID a2+1, and so on.
 
 Returns the number of CPUs to which the IPIs were delivered successfully.
+
+7. KVM_HC_ROE
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to apply Read-Only Enforcement to guest memory and
+registers
+Usage 1:
+ a0: ROE_VERSION
+
+Returns non-signed number that represents the current version of ROE
+implementation current version.
+
+Usage 2:
+
+ a0: ROE_MPROTECT  (requires version >= 1)
+ a1: Start address aligned to page boundary.
+ a2: Number of pages to be protected.
+
+This configuration lets a guest kernel have part of its read/write memory
+converted into read-only.  This action is irreversible.
+Upon successful run, the number of pages protected is returned.
+
+Usage 3:
+ a0: ROE_MPROTECT_CHUNK(requires version >= 2)
+ a1: Start address aligned to page boundary.
+ a2: Number of bytes to be protected.
+This configuration lets a guest kernel have part of its read/write memory
+converted into read-only with bytes granularity. ROE_MPROTECT_CHUNK is
+relatively slow compared to ROE_MPROTECT. This action is irreversible.
+Upon successful run, the number of bytes protected is returned.
+
+Error codes:
+   -KVM_ENOSYS: system call being triggered from ring 3 or it is not
+   implemented.
+   -EINVAL: error based on given parameters.
+
+Notes: KVM_HC_ROE can not be triggered from guest Ring 3 (user mode). The
+reason is that user mode malicious software can make use of it to enforce read
+only protection on an arbitrary memory page thus crashing the kernel.
-- 
2.19.2



Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa
On January 20, 2019 5:45:53 PM PST, Joel Fernandes  
wrote:
>On Sun, Jan 20, 2019 at 01:58:15PM -0800, h...@zytor.com wrote:
>> On January 20, 2019 8:10:03 AM PST, Joel Fernandes
> wrote:
>> >On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> >> On January 19, 2019 2:36:06 AM PST, Greg KH
>> > wrote:
>> >> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig
>wrote:
>> >> >> This seems like a pretty horrible idea and waste of kernel
>memory.
>> >> >
>> >> >It's only a waste if you want it to be a waste, i.e. if you load
>the
>> >> >kernel module.
>> >> >
>> >> >This really isn't any different from how /proc/config.gz works.
>> >> >
>> >> >> Just add support to kbuild to store a compressed archive in
>> >initramfs
>> >> >> and unpack it in the right place.
>> >> >
>> >> >I think the issue is that some devices do not use initramfs, or
>> >switch
>> >> >away from it after init happens or something like that.  Joel has
>> >all
>> >> >of
>> >> >the looney details that he can provide.
>> >> >
>> >> >thanks,
>> >> >
>> >> >greg k-h
>> >> 
>> >> Yeah, well... but it is kind of a losing game... the more
>in-kernel
>> >stuff there is the less smiley are things to actually be supported.
>> >
>> >It is better than nothing, and if this makes things a bit easier and
>> >solves
>> >real-world issues people have been having, and is optional, then I
>> >don't see
>> >why not.
>> >
>> >> Modularizing is it in some ways even crazier in the sense that at
>> >that point you are relying on the filesystem containing the module,
>> >which has to be loaded into the kernel by a root user. One could
>even
>> >wonder if a better way to do this would be to have "make
>> >modules_install" park an archive file – or even a directory as
>opposed
>> >to a symlink – with this stuff in /lib/modules. We could even
>provide a
>> >tmpfs shim which autoloads such an archive via the firmware loader;
>> >this might even be generically useful, who knows.
>> >
>> >All this seems to assume where the modules are located. In Android,
>we
>> >don't
>> >have /lib/modules. This patch generically fits into the grand scheme
>> >things
>> >and I think is just better made a part of the kernel since it is not
>> >that
>> >huge once compressed, as Dan also pointed. The more complex, and the
>> >more
>> >assumptions we make, the less likely people writing tools will get
>it
>> >right
>> >and be able to easily use it.
>> >
>> >> 
>> >> Note also that initramfs contents can be built into the kernel.
>> >Extracting such content into a single-instance tmpfs would again be
>a
>> >possibility
>> >
>> >Such an approach would bloat the kernel image size though, which may
>> >not work
>> >for everyone. The module based approach, on the other hand, gives an
>> >option
>> >to the user to enable the feature, but not have it loaded into
>memory
>> >or used
>> >until it is really needed.
>> >
>> >thanks,
>> >
>> > - Joel
>> 
>> Well, where are the modules? They must exist in the filesystem.
>
>The scheme of loading a module doesn't depend on _where_ the module is
>on the
>filesystem. As long as a distro knows how to load a module in its own
>way (by
>looking into whichever paths it cares about), that's all that matters. 
>And
>the module contains compressed headers which saves space, vs storing it
>uncompressed on the file system.
>
>To remove complete reliance on the filesystem, there is an option of
>not
>building it as a module, and making it as a built-in.
>
>I think I see your point now - you're saying if its built-in, then it
>becomes kernel memory that is lost and unswappable. Did I get that
>right?
>I am saying that if that's a major concern, then:
>1. Don't make it a built-in, make it a module.
>2. Don't enable it at for your distro, and use a linux-headers package
>or
>whatever else you have been using so far that works for you.
>
>thanks,
>
> - Joel

My point is that if we're going to actually solve a problem, we need to make it 
so that the distro won't just disable it anyway, and it ought to be something 
scalable; otherwise nothing is gained.

I am *not* disagreeing with the problem statement!

Now, /proc isn't something that will autoload modules. A filesystem *will*, 
although you need to be able to mount it; furthermore, it makes it trivially to 
extend it (and the firmware interface provides an . easy way to feed the data 
to such a filesystem without having to muck with anything magic.)

Heck, we could even make it a squashfs image that can just be mounted.

So, first of all, where does Android keep its modules, and what is actually 
included? Is /sbin/modprobe used to load the modules, as is normal? We might 
even be able to address this with some fairly trivial enhancements to modprobe; 
specifically to search in the module paths for something that isn't a module 
per se.

The best scenario would be if we could simply have the tools find the location 
equivalent of /lib/modules/$version/source...
-- 
Sent from my Android device wit

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa
On January 20, 2019 8:10:03 AM PST, Joel Fernandes  
wrote:
>On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> On January 19, 2019 2:36:06 AM PST, Greg KH
> wrote:
>> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig wrote:
>> >> This seems like a pretty horrible idea and waste of kernel memory.
>> >
>> >It's only a waste if you want it to be a waste, i.e. if you load the
>> >kernel module.
>> >
>> >This really isn't any different from how /proc/config.gz works.
>> >
>> >> Just add support to kbuild to store a compressed archive in
>initramfs
>> >> and unpack it in the right place.
>> >
>> >I think the issue is that some devices do not use initramfs, or
>switch
>> >away from it after init happens or something like that.  Joel has
>all
>> >of
>> >the looney details that he can provide.
>> >
>> >thanks,
>> >
>> >greg k-h
>> 
>> Yeah, well... but it is kind of a losing game... the more in-kernel
>stuff there is the less smiley are things to actually be supported.
>
>It is better than nothing, and if this makes things a bit easier and
>solves
>real-world issues people have been having, and is optional, then I
>don't see
>why not.
>
>> Modularizing is it in some ways even crazier in the sense that at
>that point you are relying on the filesystem containing the module,
>which has to be loaded into the kernel by a root user. One could even
>wonder if a better way to do this would be to have "make
>modules_install" park an archive file – or even a directory as opposed
>to a symlink – with this stuff in /lib/modules. We could even provide a
>tmpfs shim which autoloads such an archive via the firmware loader;
>this might even be generically useful, who knows.
>
>All this seems to assume where the modules are located. In Android, we
>don't
>have /lib/modules. This patch generically fits into the grand scheme
>things
>and I think is just better made a part of the kernel since it is not
>that
>huge once compressed, as Dan also pointed. The more complex, and the
>more
>assumptions we make, the less likely people writing tools will get it
>right
>and be able to easily use it.
>
>> 
>> Note also that initramfs contents can be built into the kernel.
>Extracting such content into a single-instance tmpfs would again be a
>possibility
>
>Such an approach would bloat the kernel image size though, which may
>not work
>for everyone. The module based approach, on the other hand, gives an
>option
>to the user to enable the feature, but not have it loaded into memory
>or used
>until it is really needed.
>
>thanks,
>
> - Joel

Well, where are the modules? They must exist in the filesystem.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


[RESEND PATCH V8 10/11] KVM: Log ROE violations in system log

2019-01-21 Thread Ahmed Abd El Mawgood
Signed-off-by: Ahmed Abd El Mawgood 
---
 virt/kvm/kvm_main.c|  3 ++-
 virt/kvm/roe.c | 25 +
 virt/kvm/roe_generic.h |  3 ++-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d92d300539..b3dc7255b0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1945,13 +1945,14 @@ static u64 roe_gfn_to_hva(struct kvm_memory_slot *slot, 
gfn_t gfn, int offset,
addr = __gfn_to_hva_many(slot, gfn, NULL, false);
return addr;
 }
+
 static int __kvm_write_guest_page(struct kvm_memory_slot *memslot, gfn_t gfn,
  const void *data, int offset, int len)
 {
int r;
unsigned long addr;
-
addr = roe_gfn_to_hva(memslot, gfn, offset, len);
+   kvm_roe_check_and_log(memslot, gfn, data, offset, len);
if (kvm_is_error_hva(addr))
return -EFAULT;
r = __copy_to_user((void __user *)addr + offset, data, len);
diff --git a/virt/kvm/roe.c b/virt/kvm/roe.c
index 9540473f89..e424b45e1c 100644
--- a/virt/kvm/roe.c
+++ b/virt/kvm/roe.c
@@ -76,6 +76,31 @@ void kvm_roe_free(struct kvm_memory_slot *slot)
kvfree(slot->prot_list);
 }
 
+static void kvm_warning_roe_violation(u64 addr, const void *data, int len)
+{
+   int i;
+   const char *d = data;
+   char *buf = kvmalloc(len * 3 + 1, GFP_KERNEL);
+
+   for (i = 0; i < len; i++)
+   sprintf(buf+3*i, " %02x", d[i]);
+   pr_warn("ROE violation:\n");
+   pr_warn("\tAttempt to write %d bytes at address 0x%08llx\n", len, addr);
+   pr_warn("\tData: %s\n", buf);
+   kvfree(buf);
+}
+
+void kvm_roe_check_and_log(struct kvm_memory_slot *memslot, gfn_t gfn,
+   const void *data, int offset, int len)
+{
+   if (!memslot)
+   return;
+   if (!gfn_is_full_roe(memslot, gfn) &&
+   !kvm_roe_check_range(memslot, gfn, offset, len))
+   return;
+   kvm_warning_roe_violation((gfn << PAGE_SHIFT) + offset, data, len);
+}
+
 static void kvm_roe_protect_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, u64 npages, bool partial)
 {
diff --git a/virt/kvm/roe_generic.h b/virt/kvm/roe_generic.h
index f1ce4a8aec..6c5f0cf381 100644
--- a/virt/kvm/roe_generic.h
+++ b/virt/kvm/roe_generic.h
@@ -14,5 +14,6 @@ void kvm_roe_free(struct kvm_memory_slot *slot);
 int kvm_roe_init(struct kvm_memory_slot *slot);
 bool kvm_roe_check_range(struct kvm_memory_slot *slot, gfn_t gfn, int offset,
int len);
-
+void kvm_roe_check_and_log(struct kvm_memory_slot *memslot, gfn_t gfn,
+   const void *data, int offset, int len);
 #endif
-- 
2.19.2



Re: [PATCH] doc:process: add missing internal link in stable-kernel-rules

2019-01-21 Thread Jonathan Corbet
On Sun, 20 Jan 2019 12:16:29 +0100
Federico Vaga  wrote:

> Keep consistent the document. In the document, option references
> are always linked, except for the one I fixed with this patch
> 
> Signed-off-by: Federico Vaga 

Applied, thanks.

jon


Re: [PATCH v5] coding-style: Clarify the expectations around bool

2019-01-21 Thread Jonathan Corbet
On Fri, 18 Jan 2019 15:50:47 -0700
Jason Gunthorpe  wrote:

> There has been some confusion since checkpatch started warning about bool
> use in structures, and people have been avoiding using it.
> 
> Many people feel there is still a legitimate place for bool in structures,
> so provide some guidance on bool usage derived from the entire thread that
> spawned the checkpatch warning.
> 
> Link: 
> https://lkml.kernel.org/r/ca+55afwvzk1ofb9t2v014ptakfhtvan_zj2dojncy3x6e4u...@mail.gmail.com
> Signed-off-by: Joe Perches 
> Acked-by: Joe Perches 
> Reviewed-by: Bart Van Assche 
> Acked-by: Jani Nikula 
> Reviewed-by: Joey Pabalinas 
> Signed-off-by: Jason Gunthorpe 

So this seems ready; I've applied it.  

Thanks,

jon


Re: [PATCH v2 1/1] doc: networking: integrate scaling document into doc tree

2019-01-21 Thread Jonathan Corbet
On Fri, 18 Jan 2019 21:38:32 +0100
Otto Sabart  wrote:

> Convert scaling document into reStructuredText and add reference to
> scaling document into main table of contents in network documentation.
> 
> There are no semantic changes.
> 
> There are no references to "scaling.txt" file. Whole kernel tree was
> checked using:
> $ grep -r "scaling\.txt"
> 
> Signed-off-by: Otto Sabart 

So this didn't apply cleanly to docs-next due to a conflict with
commit d96bedb2b248 ("doc: networking: add offload documents into main
index file") by one Otto Sabart.  I've fixed that up, but in the future
it will be helpful if you can be sure that your patches are against the
docs tree if you want me to merge them.

Thanks,

jon


Re: [PATCH] doc:it_IT: add translations in process/

2019-01-21 Thread Federico Vaga
On Monday, January 21, 2019 2:56:17 AM CET Jonathan Corbet wrote:
> On Sat, 19 Jan 2019 23:13:41 +0100
> 
> Federico Vaga  wrote:
> > This patch adds the Italian translation for the following documents
> > in Documentation/process:
> > 
> > - applying-patches
> > - submit-checklist
> > - submitting-drivers
> > - changes
> > - stable api nonsense
> > 
> > Signed-off-by: Federico Vaga 
> 
> In general, this looks good.  One thing jumped at me, though...(OK, more
> 
> than one, but only one to fix):
> > +Perl
> > +
> > +
> > +Per compilare il kernel vi servirà perl 5 e i seguenti moduli
> > ``Getopt::Long``, +``Getopt::Std``, ``File::Basename``, e ``File::Find``.
> 
> Didn't Rob Landley go though some considerable pain a while back to
> eliminate Perl from the basic kernel build?  This, perhaps, should come out
> of the original.  (As long as it's still there, it makes sense to be in the
> translation, of course).

I can have a deeper look and try to fix the original document too

> 
> > +Modifiche architetturali
> > +
> > +
> > +DevFS è stato reso obsoleto da udev
> > +(http://www.kernel.org/pub/linux/utils/kernel/hotplug/)
> > +
> > +Il supporto per UID a 32-bit è ora disponibile.  Divertitevi!
> 
> Speaking of stuff that should come out of the original...this is not
> exactly news at this point...

Well, generally speaking this document mix things from the past and the 
"present". Probably it requires a review.
 
> Meanwhile, here is the thing that actually needs to be fixed:
> > +Raiserfsprogs
> > +-
> > +
> > +Il pacchetto raiserfsprogs dovrebbe essere usato con raiserfs-3.6.x
> > (Linux
> > +kernel 2.4.x).  Questo è un pacchetto combinato che contiene versioni
> > +funzionanti di ``mkreiserfs``, ``resize_reiserfs``, ``debugreiserfs`` e
> > +``reiserfsck``.  Questi programmi funzionano sulle piattaforme i386 e
> > alpha.
> Even in Italian, I believe that 'Reiser" is spelled "Reiser"...

oops, your are right. I will fix this and send a V2 patch. Then I will have a 
look at the original document and try to update it

> Thanks,
> 
> jon


-- 
Federico Vaga
http://www.federicovaga.it/




Re: [PATCH] doc:process: remove note from 'stable api nonsense'

2019-01-21 Thread Federico Vaga
On Monday, January 21, 2019 2:43:38 AM CET Jonathan Corbet wrote:
> On Fri, 18 Jan 2019 22:58:04 +0100
> 
> Federico Vaga  wrote:
> > The link referred by the note can't be retrieved: this patch just
> > remove that old note.
> > 
> > Signed-off-by: Federico Vaga 
> > ---
> > 
> >  Documentation/process/stable-api-nonsense.rst | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/process/stable-api-nonsense.rst
> > b/Documentation/process/stable-api-nonsense.rst index
> > 24f5aeecee91..57d95a49c096 100644
> > --- a/Documentation/process/stable-api-nonsense.rst
> > +++ b/Documentation/process/stable-api-nonsense.rst
> > @@ -171,8 +171,7 @@ is also a rough job.
> > 
> >  Simple, get your kernel driver into the main kernel tree (remember we
> >  are talking about GPL released drivers here, if your code doesn't fall
> > 
> > -under this category, good luck, you are on your own here, you leech
> > -.)  If your
> > +under this category, good luck, you are on your own here, you leech).  If
> > your> 
> >  driver is in the tree, and a kernel interface changes, it will be fixed
> >  up by the person who did the kernel change in the first place.  This
> >  ensures that your driver is always buildable, and works over time, with
> 
> I've applied this.  I do wonder if the "you leech" should maybe come out
> too, though.  I don't think that parasitic worms are a protected class
> under the CoC, but they might still suffer emotionally from being
> compared to the purveyors of proprietary modules...

I agree, do you want me to change the patch?

> 
> jon


-- 
Federico Vaga
http://www.federicovaga.it/




Re: [PATCH] Documentation: DMA-API: fix two typos

2019-01-21 Thread Jonathan Corbet
On Fri, 18 Jan 2019 13:38:22 +
Corentin Labbe  wrote:

> This patch fixes two typos, a missing "e" and dma-api/driver_filter was
> incorrectly typed dma-api/driver-filter.
> 
> Signed-off-by: Corentin Labbe 

So I've applied this, but...

>  Documentation/DMA-API.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
> index 78114ee63057..bd009f3069f8 100644
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -531,7 +531,7 @@ that simply cannot make consistent memory.
>  dma_addr_t dma_handle, unsigned long attrs)
>  
>  Free memory allocated by the dma_alloc_attrs().  All parameters common
> -parameters must identical to those otherwise passed to dma_fre_coherent,
> +parameters must identical to those otherwise passed to dma_free_coherent,

That sentence clearly needs a lot more help than just an extra "e"; I
took the liberty of fixing it up on the way in.

Thanks,

jon


[RESEND PATCH V8 03/11] KVM: X86: Add helper function to convert SPTE to GFN

2019-01-21 Thread Ahmed Abd El Mawgood
Signed-off-by: Ahmed Abd El Mawgood 
---
 arch/x86/kvm/mmu.c | 7 +++
 arch/x86/kvm/mmu.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 098df7d135..bbfe3f2863 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1053,6 +1053,13 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page 
*sp, int index)
 
return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
+gfn_t spte_to_gfn(u64 *spte)
+{
+   struct kvm_mmu_page *sp;
+
+   sp = page_header(__pa(spte));
+   return kvm_mmu_page_get_gfn(sp, spte - sp->spt);
+}
 
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b333147c..49d7f2f002 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -211,4 +211,5 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, 
gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
+gfn_t spte_to_gfn(u64 *sptep);
 #endif
-- 
2.19.2



Re: [PATCH] doc:process: remove note from 'stable api nonsense'

2019-01-21 Thread Greg KH
On Mon, Jan 21, 2019 at 09:14:00AM +0100, Federico Vaga wrote:
> On Monday, January 21, 2019 2:43:38 AM CET Jonathan Corbet wrote:
> > On Fri, 18 Jan 2019 22:58:04 +0100
> > 
> > Federico Vaga  wrote:
> > > The link referred by the note can't be retrieved: this patch just
> > > remove that old note.
> > > 
> > > Signed-off-by: Federico Vaga 
> > > ---
> > > 
> > >  Documentation/process/stable-api-nonsense.rst | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/process/stable-api-nonsense.rst
> > > b/Documentation/process/stable-api-nonsense.rst index
> > > 24f5aeecee91..57d95a49c096 100644
> > > --- a/Documentation/process/stable-api-nonsense.rst
> > > +++ b/Documentation/process/stable-api-nonsense.rst
> > > @@ -171,8 +171,7 @@ is also a rough job.
> > > 
> > >  Simple, get your kernel driver into the main kernel tree (remember we
> > >  are talking about GPL released drivers here, if your code doesn't fall
> > > 
> > > -under this category, good luck, you are on your own here, you leech
> > > -.)  If your
> > > +under this category, good luck, you are on your own here, you leech).  If
> > > your> 
> > >  driver is in the tree, and a kernel interface changes, it will be fixed
> > >  up by the person who did the kernel change in the first place.  This
> > >  ensures that your driver is always buildable, and works over time, with
> > 
> > I've applied this.  I do wonder if the "you leech" should maybe come out
> > too, though.  I don't think that parasitic worms are a protected class
> > under the CoC, but they might still suffer emotionally from being
> > compared to the purveyors of proprietary modules...
> 
> I agree, do you want me to change the patch?

I would leave it as-is for now please.  When this was written, there was
a lot of discussion about closed source modules, and how the companies
that created them were leeches on our development community.  No one
disagreed with that statement, and a number of companies privately
agreed with us.

That still has not changed.

So I would like to see this remain.

thanks,

greg k-h


[RESEND PATCH V8 05/11] KVM: Create architecture independent ROE skeleton

2019-01-21 Thread Ahmed Abd El Mawgood
This patch introduces a hypercall that can assist against subset of kernel
rootkits, it works by place readonly protection in shadow PTE. The end
result protection is also kept in a bitmap for each kvm_memory_slot and is
used as reference when updating SPTEs. The whole goal is to protect the
guest kernel static data from modification if attacker is running from
guest ring 0, for this reason there is no hypercall to revert effect of
Memory ROE hypercall. This patch doesn't implement integrity check on guest
TLB so obvious attack on the current implementation will involve guest
virtual address -> guest physical address remapping, but there are plans to
fix that. For this patch to work on a given arch/ one would need to
implement 2 function that are architecture specific:
kvm_roe_arch_commit_protection() and kvm_roe_arch_is_userspace(). Also it
would need to have kvm_roe invoked using the appropriate hypercall
mechanism.

Signed-off-by: Ahmed Abd El Mawgood 
---
 include/kvm/roe.h |  16 
 include/linux/kvm_host.h  |   1 +
 include/uapi/linux/kvm_para.h |   4 +
 virt/kvm/kvm_main.c   |  19 +++--
 virt/kvm/roe.c| 136 ++
 virt/kvm/roe_generic.h|  19 +
 6 files changed, 190 insertions(+), 5 deletions(-)
 create mode 100644 include/kvm/roe.h
 create mode 100644 virt/kvm/roe.c
 create mode 100644 virt/kvm/roe_generic.h

diff --git a/include/kvm/roe.h b/include/kvm/roe.h
new file mode 100644
index 00..6a86866623
--- /dev/null
+++ b/include/kvm/roe.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __KVM_ROE_H__
+#define __KVM_ROE_H__
+/*
+ * KVM Read Only Enforcement
+ * Copyright (c) 2018 Ahmed Abd El Mawgood
+ *
+ * Author Ahmed Abd El Mawgood 
+ *
+ */
+void kvm_roe_arch_commit_protection(struct kvm *kvm,
+   struct kvm_memory_slot *slot);
+int kvm_roe(struct kvm_vcpu *vcpu, u64 a0, u64 a1, u64 a2, u64 a3);
+bool kvm_roe_arch_is_userspace(struct kvm_vcpu *vcpu);
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e..a627c6e81a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct 
kvm_vcpu *vcpu)
 struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
+   unsigned long *roe_bitmap;
unsigned long *dirty_bitmap;
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 6c0ce49931..e6004e0750 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -28,7 +28,11 @@
 #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
 #define KVM_HC_CLOCK_PAIRING   9
 #define KVM_HC_SEND_IPI10
+#define KVM_HC_ROE 11
 
+/* ROE Functionality parameters */
+#define ROE_VERSION0
+#define ROE_MPROTECT   1
 /*
  * hypercalls use architecture specific
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2f37b4b6a2..88b5fbcbb0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -61,6 +61,7 @@
 #include "coalesced_mmio.h"
 #include "async_pf.h"
 #include "vfio.h"
+#include "roe_generic.h"
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -551,9 +552,10 @@ static void kvm_free_memslot(struct kvm *kvm, struct 
kvm_memory_slot *free,
  struct kvm_memory_slot *dont,
  enum kvm_mr_change change)
 {
-   if (change == KVM_MR_DELETE)
+   if (change == KVM_MR_DELETE) {
+   kvm_roe_free(free);
kvm_destroy_dirty_bitmap(free);
-
+   }
kvm_arch_free_memslot(kvm, free, dont);
 
free->npages = 0;
@@ -1018,6 +1020,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (kvm_create_dirty_bitmap(&new) < 0)
goto out_free;
}
+   if (kvm_roe_init(&new) < 0)
+   goto out_free;
 
slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!slots)
@@ -1348,13 +1352,18 @@ static bool memslot_is_readonly(struct kvm_memory_slot 
*slot)
return slot->flags & KVM_MEM_READONLY;
 }
 
+static bool gfn_is_readonly(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   return gfn_is_full_roe(slot, gfn) || memslot_is_readonly(slot);
+}
+
 static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
   gfn_t *nr_pages, bool write)
 {
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
return KVM_HVA_ERR_BAD;
 
-   if (memslot_is_readonly(slot) && write)
+   if (gfn_is_readonly(slot, gfn) && write)
return KVM_HVA_ERR_RO_BAD;
 
if (nr_pages)
@@ -1402,7 +1411,7 @@ unsigned long gfn_to_hva_memslot_prot(struct 
kvm_memory_slot *slot,
unsigned long hva = __gfn_to_hva_many(slot, gfn, NUL

[RESEND PATCH V8 11/11] KVM: ROE: Store protected chunks in red black tree

2019-01-21 Thread Ahmed Abd El Mawgood
The old way of storing protected chunks was a linked list. That made
linear overhead when searching for chunks. When reaching 2000 chunk, The
time taken two read the last chunk was about 10 times slower than the
first chunk. This patch stores the chunks as tree for faster search.

Signed-off-by: Ahmed Abd El Mawgood 
---
 include/linux/kvm_host.h |  36 ++-
 virt/kvm/roe.c   | 228 +++
 virt/kvm/roe_generic.h   |   3 +
 3 files changed, 197 insertions(+), 70 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9acf5f54ac..5f4bec0662 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -301,7 +302,7 @@ static inline int kvm_vcpu_exiting_guest_mode(struct 
kvm_vcpu *vcpu)
 struct protected_chunk {
gpa_t gpa;
u64 size;
-   struct list_head list;
+   struct rb_node node;
 };
 
 static inline bool kvm_roe_range_overlap(struct protected_chunk *chunk,
@@ -316,12 +317,43 @@ static inline bool kvm_roe_range_overlap(struct 
protected_chunk *chunk,
(gpa + len - 1 >= chunk->gpa);
 }
 
+static inline int kvm_roe_range_cmp_position(struct protected_chunk *chunk,
+   gpa_t gpa, int len) {
+   /*
+* returns -1 if the gpa and len are smaller than chunk.
+* returns 0 if they overlap or strictly adjacent
+* returns 1 if gpa and len are bigger than the chunk
+*/
+
+   if (gpa + len <= chunk->gpa)
+   return -1;
+   if (gpa >= chunk->gpa + chunk->size)
+   return 1;
+   return 0;
+}
+
+static inline int kvm_roe_range_cmp_mergability(struct protected_chunk *chunk,
+   gpa_t gpa, int len) {
+   /*
+* returns -1 if the gpa and len are smaller than chunk and not adjacent
+* to it
+* returns 0 if they overlap or strictly adjacent
+* returns 1 if gpa and len are bigger than the chunk and not adjacent
+* to it
+*/
+   if (gpa + len < chunk->gpa)
+   return -1;
+   if (gpa > chunk->gpa + chunk->size)
+   return 1;
+   return 0;
+
+}
 struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
unsigned long *roe_bitmap;
unsigned long *partial_roe_bitmap;
-   struct list_head *prot_list;
+   struct rb_root  *prot_root;
unsigned long *dirty_bitmap;
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
diff --git a/virt/kvm/roe.c b/virt/kvm/roe.c
index e424b45e1c..15297c0e57 100644
--- a/virt/kvm/roe.c
+++ b/virt/kvm/roe.c
@@ -23,10 +23,10 @@ int kvm_roe_init(struct kvm_memory_slot *slot)
sizeof(unsigned long), GFP_KERNEL);
if (!slot->partial_roe_bitmap)
goto fail2;
-   slot->prot_list = kvzalloc(sizeof(struct list_head), GFP_KERNEL);
-   if (!slot->prot_list)
+   slot->prot_root = kvzalloc(sizeof(struct rb_root), GFP_KERNEL);
+   if (!slot->prot_root)
goto fail3;
-   INIT_LIST_HEAD(slot->prot_list);
+   *slot->prot_root = RB_ROOT;
return 0;
 fail3:
kvfree(slot->partial_roe_bitmap);
@@ -40,12 +40,19 @@ int kvm_roe_init(struct kvm_memory_slot *slot)
 static bool kvm_roe_protected_range(struct kvm_memory_slot *slot, gpa_t gpa,
int len)
 {
-   struct list_head *pos;
-   struct protected_chunk *cur_chunk;
-
-   list_for_each(pos, slot->prot_list) {
-   cur_chunk = list_entry(pos, struct protected_chunk, list);
-   if (kvm_roe_range_overlap(cur_chunk, gpa, len))
+   struct rb_node *node = slot->prot_root->rb_node;
+
+   while (node) {
+   struct protected_chunk *cur_chunk;
+   int cmp;
+
+   cur_chunk = rb_entry(node, struct protected_chunk, node);
+   cmp = kvm_roe_range_cmp_position(cur_chunk, gpa, len);
+   if (cmp < 0)/*target chunk is before current node*/
+   node = node->rb_left;
+   else if (cmp > 0)/*target chunk is after current node*/
+   node = node->rb_right;
+   else
return true;
}
return false;
@@ -62,18 +69,24 @@ bool kvm_roe_check_range(struct kvm_memory_slot *slot, 
gfn_t gfn, int offset,
 }
 EXPORT_SYMBOL_GPL(kvm_roe_check_range);
 
-void kvm_roe_free(struct kvm_memory_slot *slot)
+static void kvm_roe_destroy_tree(struct rb_node *node)
 {
-   struct protected_chunk *pos, *n;
-   struct list_head *head = slot->prot_list;
+   struct protected_chunk *cur_chunk;
+
+   if (!node)
+   return;
+   kvm_roe_destroy_tree(node->rb_left);
+   kvm_roe_destroy_tree(node->rb_right);
+   cur_chunk = rb_entry(node, struct protected_chunk, node);
+   kvfree(cur_chunk);
+}
 
+void kvm_roe_free(str

[RESEND PATCH V8 02/11] KVM: X86: Add arbitrary data pointer in kvm memslot iterator functions

2019-01-21 Thread Ahmed Abd El Mawgood
This will help sharing data into the slot_level_handler callback. In my
case I need to a share a counter for the pages traversed to use it in some
bitmap. Being able to send arbitrary memory pointer into the
slot_level_handler callback made it easy.

Signed-off-by: Ahmed Abd El Mawgood 
---
 arch/x86/kvm/mmu.c | 65 ++
 1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b4462..098df7d135 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1525,7 +1525,7 @@ static bool spte_write_protect(u64 *sptep, bool 
pt_protect)
 
 static bool __rmap_write_protect(struct kvm *kvm,
 struct kvm_rmap_head *rmap_head,
-bool pt_protect)
+bool pt_protect, void *data)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1564,7 +1564,8 @@ static bool wrprot_ad_disabled_spte(u64 *sptep)
  * - W bit on ad-disabled SPTEs.
  * Returns true iff any D or W bits were cleared.
  */
-static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head 
*rmap_head)
+static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head 
*rmap_head,
+   void *data)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1590,7 +1591,8 @@ static bool spte_set_dirty(u64 *sptep)
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+   void *data)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1622,7 +1624,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm 
*kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
__ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmap_head, false);
+   __rmap_write_protect(kvm, rmap_head, false, NULL);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -1648,7 +1650,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
__ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_clear_dirty(kvm, rmap_head);
+   __rmap_clear_dirty(kvm, rmap_head, NULL);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -1701,7 +1703,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+   write_protected |= __rmap_write_protect(kvm, rmap_head, true,
+   NULL);
}
 
return write_protected;
@@ -1715,7 +1718,8 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 
gfn)
return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
 }
 
-static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+   void *data)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1735,7 +1739,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
   struct kvm_memory_slot *slot, gfn_t gfn, int level,
   unsigned long data)
 {
-   return kvm_zap_rmapp(kvm, rmap_head);
+   return kvm_zap_rmapp(kvm, rmap_head, NULL);
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
@@ -5552,13 +5556,15 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
 }
 
 /* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
*rmap_head);
+typedef bool (*slot_level_handler) (struct kvm *kvm,
+   struct kvm_rmap_head *rmap_head, void *data);
 
 /* The caller should hold mmu-lock before calling this function. */
 static __always_inline bool
 slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
slot_level_handler fn, int start_level, int end_level,
-   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
+   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb,
+   void *data)
 {
struct slot_rmap_walk_iterator iterator;
bool flush = false;
@@ -5566,7 +5572,7 @@ slot_handle_level_range(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
end_gfn, &iterator) {
 

[RESEND PATCH V8 01/11] KVM: State whether memory should be freed in kvm_free_memslot

2019-01-21 Thread Ahmed Abd El Mawgood
The conditions upon which kvm_free_memslot are kind of ad-hock,
it will be hard to extend memslot with allocatable data that needs to be
freed, so I replaced the current mechanism by clear flag that states if
the memory slot should be freed.

Signed-off-by: Ahmed Abd El Mawgood 
---
 virt/kvm/kvm_main.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f888a103f..2f37b4b6a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -548,9 +548,10 @@ static void kvm_destroy_dirty_bitmap(struct 
kvm_memory_slot *memslot)
  * Free any memory in @free but not in @dont.
  */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
- struct kvm_memory_slot *dont)
+ struct kvm_memory_slot *dont,
+ enum kvm_mr_change change)
 {
-   if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
+   if (change == KVM_MR_DELETE)
kvm_destroy_dirty_bitmap(free);
 
kvm_arch_free_memslot(kvm, free, dont);
@@ -566,7 +567,7 @@ static void kvm_free_memslots(struct kvm *kvm, struct 
kvm_memslots *slots)
return;
 
kvm_for_each_memslot(memslot, slots)
-   kvm_free_memslot(kvm, memslot, NULL);
+   kvm_free_memslot(kvm, memslot, NULL, KVM_MR_DELETE);
 
kvfree(slots);
 }
@@ -1061,14 +1062,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
kvm_arch_commit_memory_region(kvm, mem, &old, &new, change);
 
-   kvm_free_memslot(kvm, &old, &new);
+   kvm_free_memslot(kvm, &old, &new, change);
kvfree(old_memslots);
return 0;
 
 out_slots:
kvfree(slots);
 out_free:
-   kvm_free_memslot(kvm, &new, &old);
+   kvm_free_memslot(kvm, &new, &old, change);
 out:
return r;
 }
-- 
2.19.2



[RESEND PATCH V8 0/11] KVM: X86: Introducing ROE Protection Kernel Hardening

2019-01-21 Thread Ahmed Abd El Mawgood
-- Summary --

ROE is a hypercall that enables host operating system to restrict guest's access
to its own memory. This will provide a hardening mechanism that can be used to
stop rootkits from manipulating kernel static data structures and code. Once a
memory region is protected the guest kernel can't even request undoing the
protection.

Memory protected by ROE should be non-swapable because even if the ROE protected
page got swapped out, It won't be possible to write anything in its place.

ROE hypercall should be capable of either protecting a whole memory frame or
parts of it. With these two, it should be possible for guest kernel to protect
its memory and all the page table entries for that memory inside the page table.
I am still not sure whether this should be part of ROE job or the guest's job.

Our threat model assumes that an attacker got full root access to a running
guest and his goal is to manipulate kernel code/data (hook syscalls, overwrite
IDT ..etc).


-- Why didn't I implement ROE in host's userspace ? --

The reason why it would be better to implement this from inside kvm: It will
become a big performance hit to vmexit and switch to user space mode on each
fault, on the other hand, having the permission handled by EPT should make some
remarkable performance gain when writing in non protected page that contains
protected chunks. My tests showed that the bottle neck is the time taken in
context switching, reducing the number of switches did improve performance a
lot. Full lengthy explanation with numbers can be found in [2].


-- Future Work -- 
There is future work in progress to also put some sort of protection on the page
table register CR3 and other critical registers that can be intercepted by KVM.
This way it won't be possible for an attacker to manipulate any part of the
guests page table.


-- Test Case --

I was requested to add a test to tools/testing/selftests/kvm/. But the original
testing suite didn't work on my machine, I experienced shutdown due to triple
fault because of EPT fault with the current tests. I tried bisecting but the
triple fault was there from the very first commit.

So instead I would provide here a demo kernel module to test the current
implementation:

```
#include 
#include 
#include 
#include 
#include 
#include 
MODULE_LICENSE("GPL");
MODULE_AUTHOR("OddCoder");
MODULE_DESCRIPTION("ROE Hello world Module");
MODULE_VERSION("0.0.1");

#define KVM_HC_ROE 11
#define ROE_VERSION 0
#define ROE_MPROTECT 1
#define ROE_MPROTECT_CHUNK 2

static long roe_version(void){
return kvm_hypercall1 (KVM_HC_ROE, ROE_VERSION);
}

static long roe_mprotect(void *addr, long pg_count) {
return kvm_hypercall3 (KVM_HC_ROE, ROE_MPROTECT, (u64)addr, 
pg_count);
}

static long roe_mprotect_chunk(void *addr, long size) {
return kvm_hypercall3 (KVM_HC_ROE, ROE_MPROTECT_CHUNK, 
(u64)addr, size);
}

static int __init hello(void ) {
int x;
struct page *pg1, *pg2;
void *memory;
pg1 = alloc_page(GFP_KERNEL);
pg2 = alloc_page(GFP_KERNEL);
memory = page_to_virt(pg1);
pr_info ("ROE_VERSION: %ld\n", roe_version());
pr_info ("Allocated memory: 0x%llx\n", (u64)memory);
pr_info("Physical Address: 0x%llx\n", virt_to_phys(memory));
strcpy((char *)memory, "ROE PROTECTED");
pr_info("memory_content: %s\n", (char *)memory);
x = roe_mprotect((void *)memory, 1);
strcpy((char *)memory, "The strcpy should silently fail and"
"memory content won't be modified");
pr_info("memory_content: %s\n", (char *)memory);
memory = page_to_virt(pg2);
pr_info ("Allocated memory: 0x%llx\n", (u64)memory);
pr_info("Physical Address: 0x%llx\n", virt_to_phys(memory));
strcpy((char *)memory, "ROE PROTECTED PARTIALLY");
roe_mprotect_chunk((void *)memory, strlen((char *)memory));
pr_info("memory_content: %s\n", (char *)memory);
strcpy((char *)memory, "XXX"
" <- Text here not modified still Can 
concat");
pr_info("memory_content: %s\n", (char *)memory);
return 0;
}
static void __exit bye(void) {
pr_info("Allocated Memory May never be freed at all!\n");
pr_info("Actually this is more of an ABI demonstration\n");
pr_info("than actual use case\n");
}
module_init(hello);
module_exit(bye);

```

I tried this on Gentoo host with Ubuntu guest and Qemu from git after applyin

[PATCH v2 3/3] sched: Document Energy Aware Scheduling

2019-01-21 Thread Quentin Perret
Add some documentation detailing the main design points of EAS, as well
as a list of its dependencies.

Parts of this documentation are taken from Morten Rasmussen's original
EAS posting: https://lkml.org/lkml/2015/7/7/754

Reviewed-by: Qais Yousef 
Co-authored-by: Morten Rasmussen 
Signed-off-by: Quentin Perret 
---
 Documentation/scheduler/sched-energy.txt | 431 +++
 1 file changed, 431 insertions(+)
 create mode 100644 Documentation/scheduler/sched-energy.txt

diff --git a/Documentation/scheduler/sched-energy.txt 
b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index ..b91899cb2846
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,431 @@
+  ===
+  Energy Aware Scheduling
+  ===
+
+1. Introduction
+---
+
+Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
+the impact of its decisions on the energy consumed by CPUs. EAS relies on an
+Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
+with a minimal impact on throughput. This document aims at providing an
+introduction on how EAS works, what are the main design decisions behind it, 
and
+details what is needed to get it to run.
+
+Before going any further, please note that at the time of writing:
+
+   /!\ EAS does not support platforms with symmetric CPU topologies /!\
+
+EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
+because this is where the potential for saving energy through scheduling is
+the highest.
+
+The actual EM used by EAS is _not_ maintained by the scheduler, but by a
+dedicated framework. For details about this framework and what it provides,
+please refer to its documentation (which is available under
+Documentation/driver-api/pm/energy-model.rst).
+
+
+2. Background and Terminology
+-
+
+To make it clear from the start:
+ - energy = [joule] (resource like a battery on powered devices)
+ - power = energy/time = [joule/second] = [watt]
+
+The goal of EAS is to minimize energy, while still getting the job done. That
+is, we want to maximize:
+
+   performance [inst/s]
+   
+   power [W]
+
+which is equivalent to minimizing:
+
+   energy [J]
+   ---
+   instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance.
+
+The idea behind introducing an EM is to allow the scheduler to evaluate the
+implications of its decisions rather than blindly applying energy-saving
+techniques that may have positive effects only on some platforms. At the same
+time, the EM must be as simple as possible to minimize the scheduler latency
+impact.
+
+In short, EAS changes the way tasks are assigned to CPUs. When it is time
+for the scheduler to decide where a task should run (during wake-up), the EM
+is used to break the tie between several good CPU candidates and pick the one
+that is predicted to yield the best energy consumption without harming the
+system's throughput. EAS is applied only to CFS tasks at the time of writing,
+but it could be extended to other scheduling classes in the future.
+
+The predictions made by EAS rely on specific elements of knowledge about the
+platform's topology, which include the 'capacity' of CPUs (defined in Section 
3.
+below), and their respective energy costs.
+
+
+3. Topology information
+---
+
+EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
+differentiate CPUs with different computing throughput. The 'capacity' of a CPU
+represents the amount of work it can absorb when running at its highest
+frequency compared to the most capable CPU of the system. Capacity values are
+normalized in a 1024 range, and are comparable with the utilization signals of
+tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. 
Thanks
+to capacity and utilization values, EAS is able to estimate how big/busy a
+task/CPU is, and to take this into consideration when evaluating performance vs
+energy trade-offs. The capacity of CPUs is provided via arch-specific code
+through the arch_scale_cpu_capacity() callback. As an example, arm and arm64
+share an implementation of this callback which uses a combination of CPUFreq
+data and device-tree bindings to compute the capacity of CPUs (see
+Documentation/devicetree/bindings/arm/cpu-capacity.txt for more details).
+
+The rest of platform knowledge used by EAS is directly read from the Energy
+Model (EM) framework. The EM of a platform is composed of a power cost table
+per 'performance domain' in the system (for further details about performance
+domains, see Documentation/driver-api/pm/energy-mod

[PATCH v2 2/3] PM / EM: Document the Energy Model framework

2019-01-21 Thread Quentin Perret
Introduce a documentation file summarizing the key design points and
APIs of the newly introduced Energy Model framework.

Reviewed-by: Juri Lelli 
Signed-off-by: Quentin Perret 

---

Juri: Although I did change some things to the doc in v2 (translated to
rst mainly), I kept your 'Reviewed-by' as the content is still pretty
much the same. Please scream if you disagree :-)

Thanks,
Quentin
---
 Documentation/driver-api/pm/energy-model.rst | 150 +++
 Documentation/driver-api/pm/index.rst|   1 +
 2 files changed, 151 insertions(+)
 create mode 100644 Documentation/driver-api/pm/energy-model.rst

diff --git a/Documentation/driver-api/pm/energy-model.rst 
b/Documentation/driver-api/pm/energy-model.rst
new file mode 100644
index ..c447528c4e29
--- /dev/null
+++ b/Documentation/driver-api/pm/energy-model.rst
@@ -0,0 +1,150 @@
+
+Energy Model of CPUs
+
+
+Overview
+
+
+The Energy Model (EM) framework serves as an interface between drivers knowing
+the power consumed by CPUs at various performance levels, and the kernel
+subsystems willing to use that information to make energy-aware decisions.
+
+The source of the information about the power consumed by CPUs can vary greatly
+from one platform to another. These power costs can be estimated using
+devicetree data in some cases. In others, the firmware will know better.
+Alternatively, userspace might be best positioned. And so on. In order to avoid
+each and every client subsystem to re-implement support for each and every
+possible source of information on its own, the EM framework intervenes as an
+abstraction layer which standardizes the format of power cost tables in the
+kernel, hence enabling to avoid redundant work.
+
+The figure below depicts an example of drivers (Arm-specific here, but the
+approach is applicable to any architecture) providing power costs to the EM
+framework, and interested clients reading the data from it.
+
+.. code-block:: none
+
+  +---+  +-+  +---+
+  | Thermal (IPA) |  | Scheduler (EAS) |  | Other |
+  +---+  +-+  +---+
+  |   | em_pd_energy()|
+  |   | em_cpu_get()  |
+  +-+ | +-+
+| | |
+v v v
+   +-+
+   |Energy Model |
+   | Framework   |
+   +-+
+  ^   ^   ^
+  |   |   | em_register_perf_domain()
+   +--+   |   +-+
+   |  | |
+   +---+  +---+  +--+
+   |  cpufreq-dt   |  |   arm_scmi|  |Other |
+   +---+  +---+  +--+
+   ^  ^ ^
+   |  | |
+   +--+   +---+  +--+
+   | Device Tree  |   |   Firmware|  |  ?   |
+   +--+   +---+  +--+
+
+The EM framework manages power cost tables per 'performance domain' in the
+system. A performance domain is a group of CPUs whose performance is scaled
+together. Performance domains generally have a 1-to-1 mapping with CPUFreq
+policies. All CPUs in a performance domain are required to have the same
+micro-architecture. CPUs in different performance domains can have different
+micro-architectures.
+
+
+Core APIs overview
+==
+
+Config options
+--
+
+`CONFIG_ENERGY_MODEL` must be enabled to use the EM framework.
+
+
+Registration of performance domains
+---
+
+Drivers are expected to register performance domains into the EM framework by
+calling the :c:func:`em_register_perf_domain()` API. Drivers must specify the
+CPUs of the performance domains using a cpumask, and provide a callback 
function
+returning  tuples for each capacity state. The callback
+function provided by the driver is free to fetch data from any relevant 
location
+(DT, firmware, ...), and by any mean deemed necessary.
+
+
+Accessing performance domains
+-
+
+Subsystems interested in the energy model of a CPU can retrieve it using the
+:c:func:`em_cpu_get()` API. The energy model tables are allocated once upon
+creation of the performance domains, and kept in memory untouched.
+
+The energy consumed by a performance domain can be estimated using the
+:c:func:`em_pd_energy()` API. The estimation is performed assuming that the
+scheduti

[PATCH v2 0/3] Documentation: Explain EAS and EM

2019-01-21 Thread Quentin Perret
The recently introduced Energy Aware Scheduling (EAS) feature relies on
a large set of concepts, assumptions, and design choices that are
probably not obvious for an outsider. Moreover, enabling EAS on a
particular platform isn't straightforward because of all its
dependencies. This series tries to address this by introducing proper
documentation files for the scheduler's part of EAS and for the newly
introduced Energy Model (EM) framework. These are meant to explain not
only the design choices of EAS but also to list its dependencies in a
human-readable location.

Changes in v2:
 - Fixed typos and style in sched-energy.txt (Juri)
 - Moved EM doc under Documentation/driver-api/pm/ (Rafael)
 - Translated EM doc into .rst (Rafael)
 - Fixed EM kerneldoc comments to avoid htmldoc build errors

Quentin Perret (3):
  PM / EM: Fix broken kerneldoc
  PM / EM: Document the Energy Model framework
  sched: Document Energy Aware Scheduling

 Documentation/driver-api/pm/energy-model.rst | 150 +++
 Documentation/driver-api/pm/index.rst|   1 +
 Documentation/scheduler/sched-energy.txt | 431 +++
 include/linux/energy_model.h |   4 +-
 kernel/power/energy_model.c  |   2 +-
 5 files changed, 585 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/driver-api/pm/energy-model.rst
 create mode 100644 Documentation/scheduler/sched-energy.txt

-- 
2.20.1



[PATCH v2 1/3] PM / EM: Fix broken kerneldoc

2019-01-21 Thread Quentin Perret
Some of the kerneldoc comments about the Energy Model framework are
slightly broken, hence causing errors when compiling the html doc.

Fix them.

Signed-off-by: Quentin Perret 
---
 include/linux/energy_model.h | 4 ++--
 kernel/power/energy_model.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index aa027f7bcb3e..57889589e638 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_ENERGY_MODEL
 /**
- * em_cap_state - Capacity state of a performance domain
+ * struct em_cap_state - Capacity state of a performance domain
  * @frequency: The CPU frequency in KHz, for consistency with CPUFreq
  * @power: The power consumed by 1 CPU at this level, in milli-watts
  * @cost:  The cost coefficient associated with this level, used during
@@ -24,7 +24,7 @@ struct em_cap_state {
 };
 
 /**
- * em_perf_domain - Performance domain
+ * struct em_perf_domain - Performance domain
  * @table: List of capacity states, in ascending order
  * @nr_cap_states: Number of capacity states
  * @cpus:  Cpumask covering the CPUs of the domain
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9dc2c38764a..1e3a88ea4728 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -137,7 +137,7 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
  * If multiple clients register the same performance domain, all but the first
  * registration will be ignored.
  *
- * Return 0 on success
+ * Return: 0 on success
  */
 int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb)
-- 
2.20.1



[PATCH] pinctrl.txt: Remove outdated information

2019-01-21 Thread Ramon Fried
Returning -EAGAIN is no longer supported by pin_config_group_set()
since ad42fc6c8479 ("pinctrl: rip out the direct pinconf API")

Remove the relevant section from the documentation.

Signed-off-by: Ramon Fried 
---
 Documentation/driver-api/pinctl.rst | 9 -
 1 file changed, 9 deletions(-)

diff --git a/Documentation/driver-api/pinctl.rst 
b/Documentation/driver-api/pinctl.rst
index 6cb68d67fa75..2bb1bc484278 100644
--- a/Documentation/driver-api/pinctl.rst
+++ b/Documentation/driver-api/pinctl.rst
@@ -274,15 +274,6 @@ configuration in the pin controller ops like this::
.confops = &foo_pconf_ops,
};
 
-Since some controllers have special logic for handling entire groups of pins
-they can exploit the special whole-group pin control function. The
-pin_config_group_set() callback is allowed to return the error code -EAGAIN,
-for groups it does not want to handle, or if it just wants to do some
-group-level handling and then fall through to iterate over all pins, in which
-case each individual pin will be treated by separate pin_config_set() calls as
-well.
-
-
 Interaction with the GPIO subsystem
 ===
 
-- 
2.17.1



Dearest Friend

2019-01-21 Thread Mercy Kings
Dearest One

First i thanks your attention to me, I am mercy kings My parents
Mr.and Mrs.kings were assassinated here in IVORY COAST. Before my
Before my father's death he had (USD $5.9M) Five Million Nine Hundred
Thousand United State Dollars deposited in a bank here in Abidjan. I
want you to do me a favour to receive these funds to a safe account
in your country or any safer place as the beneficiary. I want to come
over to your country for the safety of my life from the hands of this
wicked assassins.I have plans to do investment in your country, like
real estate and industrial production This is my reason for writing to
you.

Your sister mercy kings.


Re: [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression

2019-01-21 Thread Kees Cook
On Fri, Jan 18, 2019 at 9:44 PM kernel test robot  wrote:
>
> Greeting,
>
> FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
>
>
> commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: 
> Add sysctl parameter to control irq counts latency")
> url: 
> https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818

Is this expected? (And it seems like other things in the report below
are faster? I don't understand why this particular regression was
called out?)

-Kees

>
>
> in testcase: reaim
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 
> 256G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 5000
> test: shared_memory
> cpufreq_governor: performance
> ucode: 0x3d
>
> test-description: REAIM is an updated and improved version of AIM 7 benchmark.
> test-url: https://sourceforge.net/projects/re-aim-7/
>
>
>
> Details are as below:
> -->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
>
> =
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
>   
> gcc-7/performance/x86_64-rhel-7.2/5000/debian-x86_64-2018-04-03.cgz/300s/lkp-hsw-ep5/shared_memory/reaim/0x3d
>
> commit:
>   51e8bce392 ("/proc/stat: Extract irqs counting code into show_stat_irqs()")
>   3047027b34 ("/proc/stat: Add sysctl parameter to control irq counts 
> latency")
>
> 51e8bce392dd2cc9 3047027b34b8c6404b50990305
>  --
>fail:runs  %reproductionfail:runs
>| | |
>   1:4  -25%:4 kmsg.igb#:#:#:exceed_max#second
>  %stddev %change %stddev
>  \  |\
> 101.96+7.5% 109.60reaim.child_systime
>  33.32-1.8%  32.73reaim.child_utime
>5534451-4.8%5271308reaim.jobs_per_min
>   1106-4.8%   1054reaim.jobs_per_min_child
>5800927-4.9%5517884reaim.max_jobs_per_min
>   5.42+5.0%   5.69reaim.parent_time
>   1.51+5.3%   1.59reaim.std_dev_time
>   29374932-2.8%   28558608reaim.time.minor_page_faults
>   1681+1.6%   1708
> reaim.time.percent_of_cpu_this_job_got
>   3841+4.5%   4012reaim.time.system_time
>   1234-4.4%   1180reaim.time.user_time
>   1850-2.7%   1800reaim.workload
>5495296 ±  9%  -9.5%4970496meminfo.DirectMap2M
>   5142 ± 18% -43.2%   2920 ± 46%  numa-vmstat.node0.nr_shmem
>  29.00 ± 32% +56.9%  45.50 ± 10%  vmstat.procs.r
>  67175 ± 37% +66.6% 111910 ± 20%  numa-meminfo.node0.AnonHugePages
>  20591 ± 18% -43.2%  11691 ± 46%  numa-meminfo.node0.Shmem
>  64688 ±  6% -36.8%  40906 ± 19%  slabinfo.kmalloc-8.active_objs
>  64691 ±  6% -36.8%  40908 ± 19%  slabinfo.kmalloc-8.num_objs
>  37.36 ±  7% +11.1%  41.53 ±  4%  boot-time.boot
>  29.15 ±  6% +14.3%  33.31 ±  3%  boot-time.dhcp
> 847.73 ±  9% +12.9% 957.09 ±  4%  boot-time.idle
> 202.50 ±100%+101.7% 408.50proc-vmstat.nr_mlock
>   8018 ±  9% -12.3%   7034 ±  2%  proc-vmstat.nr_shmem
>   29175944-2.8%   28369676proc-vmstat.numa_hit
>   29170351-2.8%   28364111proc-vmstat.numa_local
>   5439 ±  5% -18.7%   4423 ±  7%  proc-vmstat.pgactivate
>   30220220-2.8%   29374906proc-vmstat.pgalloc_normal
>   30182224-2.7%   29368266proc-vmstat.pgfault
>   30186671-2.8%   29341792proc-vmstat.pgfree
>  69510 ± 12% -34.2%  45759 ± 33%  sched_debug.cfs_rq:/.load.avg
>  30.21 ± 24% -33.6%  20.05 ± 20%  
> sched_debug.cfs_rq:/.runnable_load_avg.avg
>  66447 ± 12% -37.6%  41460 ± 37%  
> sched_debug.cfs_rq:/.runnable_weight.avg
>  12.35 ±  4% +88.0%  23.22 ± 15%  sched_debug.cpu.clock.stddev
>  12.35 ±  4% +88.0%  23.22 ± 15%  
> sched_debug.cpu.clock_task.stddev
>  30.06 ± 12% -26.5%  22.10 ± 13%  sched_debug.cpu.cpu_load[0].avg
>  29.37 ±  9% -22.6%  22.72 ± 13%  sched_debug.cpu.cpu_load[1].avg
>  28.71 ±  6% -21.1%  22.66 ± 16%  sched_debug.cpu.cpu_load[2].avg
>  17985   -12.0%  15823 ± 

[PATCH v9 0/3] watchdog: allow setting deadline for opening /dev/watchdogN

2019-01-21 Thread Rasmus Villemoes
If a watchdog driver tells the framework that the device is running,
the framework takes care of feeding the watchdog until userspace opens
the device. If the userspace application which is supposed to do that
never comes up properly, the watchdog is fed indefinitely by the
kernel. This can be especially problematic for embedded devices.

The existing handle_boot_enabled cmdline parameter/config option
partially solves that, but that is only usable for the subset of
hardware watchdogs that have (or can be configured by the bootloader
to have) a timeout that is sufficient to make it realistic for
userspace to come up. Many devices have timeouts of only a few
seconds, or even less, making handle_boot_enabled insufficient.

These patches allow one to set a maximum time for which the kernel
will feed the watchdog, thus ensuring that either userspace has come
up, or the board gets reset. This allows fallback logic in the
bootloader to attempt some recovery (for example, if an automatic
update is in progress, it could roll back to the previous version).

The patches have been tested on a Raspberry Pi 2 and a Wandboard.

Changes in v9: Make the unit seconds instead of milliseconds.

Changes in v8: Redo on top of 5.0-rc1 - in particular, adapt to the
jiffies->ktime_t conversion (1ff68820 "watchdog: core: make sure the
watchdog_worker is not deferred"). Add a patch to make the hardware
timeout at the deadline as requested by Guenther - which was actually
made very easy by the ktime_t conversion.

v7 submission at 


Rasmus Villemoes (3):
  watchdog: introduce watchdog.open_timeout commandline parameter
  watchdog: introduce CONFIG_WATCHDOG_OPEN_TIMEOUT
  watchdog: make the device time out at open_deadline when open_timeout
is used

 .../watchdog/watchdog-parameters.txt  |  8 
 drivers/watchdog/Kconfig  |  9 
 drivers/watchdog/watchdog_dev.c   | 42 +++
 3 files changed, 52 insertions(+), 7 deletions(-)

-- 
2.20.1



[PATCH v9 1/3] watchdog: introduce watchdog.open_timeout commandline parameter

2019-01-21 Thread Rasmus Villemoes
The watchdog framework takes care of feeding a hardware watchdog until
userspace opens /dev/watchdogN. If that never happens for some reason
(buggy init script, corrupt root filesystem or whatnot) but the kernel
itself is fine, the machine stays up indefinitely. This patch allows
setting an upper limit for how long the kernel will take care of the
watchdog, thus ensuring that the watchdog will eventually reset the
machine.

A value of 0 (the default) means infinite timeout, preserving the
current behaviour.

This is particularly useful for embedded devices where some fallback
logic is implemented in the bootloader (e.g., use a different root
partition, boot from network, ...).

There is already handle_boot_enabled serving a similar purpose. However,
such a binary choice is unsuitable if the hardware watchdog cannot be
programmed by the bootloader to provide a timeout long enough for
userspace to get up and running. Many of the embedded devices we see use
external (gpio-triggered) watchdogs with a fixed timeout of the order of
1-2 seconds.

The open timeout is also used as a maximum time for an application to
re-open /dev/watchdogN after closing it. Again, while the kernel already
has a nowayout mechanism, using that means userspace is at the mercy of
whatever timeout the hardware has.

Being a module parameter, one can revert to the ordinary behaviour of
having the kernel maintain the watchdog indefinitely by simply writing 0
to /sys/... after initially opening /dev/watchdog; conversely, one can
of course also have the current behaviour of allowing indefinite time
until the first open, and then set that module parameter.

Signed-off-by: Rasmus Villemoes 
---
 .../watchdog/watchdog-parameters.txt  |  8 +
 drivers/watchdog/watchdog_dev.c   | 30 +--
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/watchdog/watchdog-parameters.txt 
b/Documentation/watchdog/watchdog-parameters.txt
index 0b88e333f9e1..907c4bb13810 100644
--- a/Documentation/watchdog/watchdog-parameters.txt
+++ b/Documentation/watchdog/watchdog-parameters.txt
@@ -8,6 +8,14 @@ See Documentation/admin-guide/kernel-parameters.rst for 
information on
 providing kernel parameters for builtin drivers versus loadable
 modules.
 
+The watchdog core parameter watchdog.open_timeout is the maximum time,
+in seconds, for which the watchdog framework will take care of pinging
+a hardware watchdog until userspace opens the corresponding
+/dev/watchdogN device. A value of 0 (the default) means an infinite
+timeout. Setting this to a non-zero value can be useful to ensure that
+either userspace comes up properly, or the board gets reset and allows
+fallback logic in the bootloader to try something else.
+
 
 -
 acquirewdt:
diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index f6c24b22b37c..ab2ad20f13eb 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -69,6 +69,7 @@ struct watchdog_core_data {
struct mutex lock;
ktime_t last_keepalive;
ktime_t last_hw_keepalive;
+   ktime_t open_deadline;
struct hrtimer timer;
struct kthread_work work;
unsigned long status;   /* Internal status bits */
@@ -87,6 +88,19 @@ static struct kthread_worker *watchdog_kworker;
 static bool handle_boot_enabled =
IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED);
 
+static unsigned open_timeout;
+
+static bool watchdog_past_open_deadline(struct watchdog_core_data *data)
+{
+   return ktime_after(ktime_get(), data->open_deadline);
+}
+
+static void watchdog_set_open_deadline(struct watchdog_core_data *data)
+{
+   data->open_deadline = open_timeout ?
+   ktime_get() + ktime_set(open_timeout, 0) : KTIME_MAX;
+}
+
 static inline bool watchdog_need_worker(struct watchdog_device *wdd)
 {
/* All variables in milli-seconds */
@@ -211,7 +225,13 @@ static bool watchdog_worker_should_ping(struct 
watchdog_core_data *wd_data)
 {
struct watchdog_device *wdd = wd_data->wdd;
 
-   return wdd && (watchdog_active(wdd) || watchdog_hw_running(wdd));
+   if (!wdd)
+   return false;
+
+   if (watchdog_active(wdd))
+   return true;
+
+   return watchdog_hw_running(wdd) && 
!watchdog_past_open_deadline(wd_data);
 }
 
 static void watchdog_ping_work(struct kthread_work *work)
@@ -297,7 +317,7 @@ static int watchdog_stop(struct watchdog_device *wdd)
return -EBUSY;
}
 
-   if (wdd->ops->stop) {
+   if (wdd->ops->stop && !open_timeout) {
clear_bit(WDOG_HW_RUNNING, &wdd->status);
err = wdd->ops->stop(wdd);
} else {
@@ -883,6 +903,7 @@ static int watchdog_release(struct inode *inode, struct 
file *file)
watchdog_ping(wdd);
}
 
+   watchdog_set_open_deadline(wd_data);
watchdog_update_worker(

[PATCH v9 2/3] watchdog: introduce CONFIG_WATCHDOG_OPEN_TIMEOUT

2019-01-21 Thread Rasmus Villemoes
This allows setting a default value for the watchdog.open_timeout
commandline parameter via Kconfig.

Some BSPs allow remote updating of the kernel image and root file
system, but updating the bootloader requires physical access. Hence, if
one has a firmware update that requires relaxing the
watchdog.open_timeout a little, the value used must be baked into the
kernel image itself and cannot come from the u-boot environment via the
kernel command line.

Being able to set the initial value in .config doesn't change the fact
that the value on the command line, if present, takes precedence, and is
of course immensely useful for development purposes while one has
console acccess, as well as usable in the cases where one can make a
permanent update of the kernel command line.

Signed-off-by: Rasmus Villemoes 
---
 Documentation/watchdog/watchdog-parameters.txt | 8 
 drivers/watchdog/Kconfig   | 9 +
 drivers/watchdog/watchdog_dev.c| 5 +++--
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/Documentation/watchdog/watchdog-parameters.txt 
b/Documentation/watchdog/watchdog-parameters.txt
index 907c4bb13810..2fdbd07f9791 100644
--- a/Documentation/watchdog/watchdog-parameters.txt
+++ b/Documentation/watchdog/watchdog-parameters.txt
@@ -11,10 +11,10 @@ modules.
 The watchdog core parameter watchdog.open_timeout is the maximum time,
 in seconds, for which the watchdog framework will take care of pinging
 a hardware watchdog until userspace opens the corresponding
-/dev/watchdogN device. A value of 0 (the default) means an infinite
-timeout. Setting this to a non-zero value can be useful to ensure that
-either userspace comes up properly, or the board gets reset and allows
-fallback logic in the bootloader to try something else.
+/dev/watchdogN device. A value of 0 means an infinite timeout. Setting
+this to a non-zero value can be useful to ensure that either userspace
+comes up properly, or the board gets reset and allows fallback logic
+in the bootloader to try something else.
 
 
 -
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index 57f017d74a97..e1de5beb4e80 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -63,6 +63,15 @@ config WATCHDOG_SYSFS
  Say Y here if you want to enable watchdog device status read through
  sysfs attributes.
 
+config WATCHDOG_OPEN_TIMEOUT
+   int "Timeout value for opening watchdog device"
+   default 0
+   help
+ The maximum time, in seconds, for which the watchdog framework takes
+ care of pinging a hardware watchdog.  A value of 0 means infinite. The
+ value set here can be overridden by the commandline parameter
+ "watchdog.open_timeout".
+
 #
 # General Watchdog drivers
 #
diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index ab2ad20f13eb..b763080741cc 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -88,7 +88,7 @@ static struct kthread_worker *watchdog_kworker;
 static bool handle_boot_enabled =
IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED);
 
-static unsigned open_timeout;
+static unsigned open_timeout = CONFIG_WATCHDOG_OPEN_TIMEOUT;
 
 static bool watchdog_past_open_deadline(struct watchdog_core_data *data)
 {
@@ -1206,4 +1206,5 @@ MODULE_PARM_DESC(handle_boot_enabled,
 
 module_param(open_timeout, uint, 0644);
 MODULE_PARM_DESC(open_timeout,
-   "Maximum time (in seconds, 0 means infinity) for userspace to take over 
a running watchdog (default=0)");
+   "Maximum time (in seconds, 0 means infinity) for userspace to take over 
a running watchdog (default="
+   __MODULE_STRING(CONFIG_WATCHDOG_OPEN_TIMEOUT) ")");
-- 
2.20.1



[PATCH v9 3/3] watchdog: make the device time out at open_deadline when open_timeout is used

2019-01-21 Thread Rasmus Villemoes
When the watchdog device is not open by userspace, the kernel takes
care of pinging it. When the open_timeout feature is in use, we should
ensure that the hardware fires close to open_timeout seconds after the
kernel has assumed responsibility for the device (either at boot, or
after userspace has had it open and magic-closed it).

To do this, simply reuse the logic that is already in place for
ensuring the same thing when userspace is responsible for regularly
pinging the device:

- When watchdog_active(wdd), this patch doesn't change anything.

- When !watchdoc_active(wdd), the "virtual timeout" should be taken to
be ->open_deadline". When the open_timeout feature is not used (i.e.,
when open_timeout was 0 the last time watchdog_set_open_deadline was
called), ->open_deadline is KTIME_MAX, and the arithmetic ends up
returning keepalive_interval as we used to.

This has been tested on a Wandboard with various combinations of
open_timeout and timeout-sec properties for the on-board watchdog by
booting with 'init=/bin/sh', timestamping the lines on the serial
console, and comparing the timestamp of the 'imx2-wdt 20bc000.wdog:
timeout nnn sec' line with the timestamp of the 'U-Boot SPL ...'
line (which appears just after reset).

Suggested-by: Guenter Roeck 
Signed-off-by: Rasmus Villemoes 
---
 drivers/watchdog/watchdog_dev.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index b763080741cc..8ca310c4a9e4 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -133,14 +133,15 @@ static ktime_t watchdog_next_keepalive(struct 
watchdog_device *wdd)
ktime_t virt_timeout;
unsigned int hw_heartbeat_ms;
 
-   virt_timeout = ktime_add(wd_data->last_keepalive,
-ms_to_ktime(timeout_ms));
+   if (watchdog_active(wdd))
+   virt_timeout = ktime_add(wd_data->last_keepalive,
+ms_to_ktime(timeout_ms));
+   else
+   virt_timeout = wd_data->open_deadline;
+
hw_heartbeat_ms = min_not_zero(timeout_ms, wdd->max_hw_heartbeat_ms);
keepalive_interval = ms_to_ktime(hw_heartbeat_ms / 2);
 
-   if (!watchdog_active(wdd))
-   return keepalive_interval;
-
/*
 * To ensure that the watchdog times out wdd->timeout seconds
 * after the most recent ping from userspace, the last
-- 
2.20.1



[PATCH V2] doc:it_IT: add translations in process/

2019-01-21 Thread Federico Vaga
This patch adds the Italian translation for the following documents
in Documentation/process:

- applying-patches
- submit-checklist
- submitting-drivers
- changes
- stable api nonsense

Signed-off-by: Federico Vaga 
---
 .../translations/it_IT/doc-guide/sphinx.rst   |   2 +
 .../it_IT/process/applying-patches.rst|  12 +-
 .../translations/it_IT/process/changes.rst| 487 +-
 .../it_IT/process/stable-api-nonsense.rst | 202 +++-
 .../it_IT/process/submit-checklist.rst| 127 -
 .../it_IT/process/submitting-drivers.rst  |   8 +-
 6 files changed, 822 insertions(+), 16 deletions(-)

diff --git a/Documentation/translations/it_IT/doc-guide/sphinx.rst 
b/Documentation/translations/it_IT/doc-guide/sphinx.rst
index 474b7e127893..793b5cc33403 100644
--- a/Documentation/translations/it_IT/doc-guide/sphinx.rst
+++ b/Documentation/translations/it_IT/doc-guide/sphinx.rst
@@ -3,6 +3,8 @@
 .. note:: Per leggere la documentazione originale in inglese:
  :ref:`Documentation/doc-guide/index.rst `
 
+.. _it_sphinxdoc:
+
 Introduzione
 
 
diff --git a/Documentation/translations/it_IT/process/applying-patches.rst 
b/Documentation/translations/it_IT/process/applying-patches.rst
index f5e9c7d0b16d..1d30e5cd2a3d 100644
--- a/Documentation/translations/it_IT/process/applying-patches.rst
+++ b/Documentation/translations/it_IT/process/applying-patches.rst
@@ -1,13 +1,15 @@
 .. include:: ../disclaimer-ita.rst
 
 :Original: :ref:`Documentation/process/applying-patches.rst `
-
+:Translator: Federico Vaga 
 
 .. _it_applying_patches:
 
-Applicare modifiche al kernel Linux
-===
+Applicare patch al kernel Linux

 
-.. warning::
+.. note::
 
-TODO ancora da tradurre
+   Questo documento è obsoleto.  Nella maggior parte dei casi, piuttosto
+   che usare ``patch`` manualmente, vorrete usare Git.  Per questo motivo
+   il documento non verrà tradotto.
diff --git a/Documentation/translations/it_IT/process/changes.rst 
b/Documentation/translations/it_IT/process/changes.rst
index 956cf95a1214..d0874327f301 100644
--- a/Documentation/translations/it_IT/process/changes.rst
+++ b/Documentation/translations/it_IT/process/changes.rst
@@ -1,12 +1,495 @@
 .. include:: ../disclaimer-ita.rst
 
 :Original: :ref:`Documentation/process/changes.rst `
+:Translator: Federico Vaga 
 
 .. _it_changes:
 
 Requisiti minimi per compilare il kernel
 
 
-.. warning::
+Introduzione
+
 
-TODO ancora da tradurre
+Questo documento fornisce una lista dei software necessari per eseguire i
+kernel 4.x.
+
+Questo documento è basato sul file "Changes" del kernel 2.0.x e quindi le
+persone che lo scrissero meritano credito (Jared Mauch, Axel Boldt,
+Alessandro Sigala, e tanti altri nella rete).
+
+Requisiti minimi correnti
+*
+
+Prima di pensare d'avere trovato un baco, aggiornate i seguenti programmi
+**almeno** alla versione indicata!  Se non siete certi della versione che state
+usando, il comando indicato dovrebbe dirvelo.
+
+Questa lista presume che abbiate già un kernel Linux funzionante.  In aggiunta,
+non tutti gli strumenti sono necessari ovunque; ovviamente, se non avete un
+modem ISDN, per esempio, probabilmente non dovreste preoccuparvi di
+isdn4k-utils.
+
+== =  

+Programma   Versione minima   Comando per verificare la 
versione
+== =  

+GNU C  4.6gcc --version
+GNU make   3.81   make --version
+binutils   2.20   ld -v
+flex   2.5.35 flex --version
+bison  2.0bison --version
+util-linux 2.10o  fdformat --version
+kmod   13 depmod -V
+e2fsprogs  1.41.4 e2fsck -V
+jfsutils   1.1.3  fsck.jfs -V
+reiserfsprogs  3.6.3  reiserfsck -V
+xfsprogs   2.6.0  xfs_db -V
+squashfs-tools 4.0mksquashfs -version
+btrfs-progs0.18   btrfsck
+pcmciautils004pccardctl -V
+quota-tools3.09   quota -V
+PPP2.4.0  pppd --version
+isdn4k-utils   3.1pre1isdnctrl 2>&1|grep version
+nfs-utils  1.0.5  showmount --version
+procps 3.2.0  ps --version
+oprofile   0.9oprofiled --version
+udev   081udevd --version
+grub   0.93   grub --version || grub-install 
--version
+mcelog 0.6mcelog --version
+iptables   1

Re: [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression

2019-01-21 Thread Alexey Dobriyan
On Tue, Jan 22, 2019 at 09:02:53AM +1300, Kees Cook wrote:
> On Fri, Jan 18, 2019 at 9:44 PM kernel test robot  
> wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
> >
> >
> > commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: 
> > Add sysctl parameter to control irq counts latency")
> > url: 
> > https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
> 
> Is this expected? (And it seems like other things in the report below
> are faster? I don't understand why this particular regression was
> called out?)

No, but the sysctl has been dropped, so the point is moot.


Re: [RESEND PATCH V8 05/11] KVM: Create architecture independent ROE skeleton

2019-01-21 Thread Chao Gao
On Mon, Jan 21, 2019 at 01:39:34AM +0200, Ahmed Abd El Mawgood wrote:
>This patch introduces a hypercall that can assist against subset of kernel
>rootkits, it works by place readonly protection in shadow PTE. The end
>result protection is also kept in a bitmap for each kvm_memory_slot and is
>used as reference when updating SPTEs. The whole goal is to protect the
>guest kernel static data from modification if attacker is running from
>guest ring 0, for this reason there is no hypercall to revert effect of
>Memory ROE hypercall. This patch doesn't implement integrity check on guest
>TLB so obvious attack on the current implementation will involve guest
>virtual address -> guest physical address remapping, but there are plans to
>fix that.

Hello Ahmed,

I don't quite understand the attack. Do you mean that even one guest
page is protected by ROE, an attacker can map the virtual address to
another unprotected guest page by editing guest page table?

Thanks
Chao