Re: [PATCH v2 05/11] Doc: DT: thermal: new irq-mode for trip point

2018-11-13 Thread Lukasz Luba
Hi Krzysztof,

Thanks for the comments.

On 11/12/18 9:51 AM, Krzysztof Kozlowski wrote:
> On Wed, 7 Nov 2018 at 18:10, Lukasz Luba  wrote:
>>
> 
> Subject prefix:
> dt-bindings: thermal:
> 
>> Thermal trip point gets new flag in DT: irq-mode.
>> Trip point may have a new explicit flag which indicate
>> IRQ support when the temperature is met (so the thermal framework
>> deos not need to set polling for it).
>> It is useful for 'passive' cooling trip point,
>> which now will not register for polling the temperature.
> 
> You wrap lines in weird way making it more difficult to read.
> I already asked about this while reviewing v1. Please fix it.
> https://elixir.bootlin.com/linux/latest/source/Documentation/process/submitting-patches.rst#L633
>
I will fix it in v3.
>>
>> Update documentation about irq-mode for trip points.
>>
>> Cc: Zhang Rui 
>> Cc: Eduardo Valentin 
>> Cc: Daniel Lezcano 
>> Cc: Rob Herring 
>> Cc: Mark Rutland 
>> Cc: devicet...@vger.kernel.org
>> Signed-off-by: Lukasz Luba 
>> ---
>>   Documentation/devicetree/bindings/thermal/thermal.txt | 7 +++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt 
>> b/Documentation/devicetree/bindings/thermal/thermal.txt
>> index ca14ba9..bee21e3 100644
>> --- a/Documentation/devicetree/bindings/thermal/thermal.txt
>> +++ b/Documentation/devicetree/bindings/thermal/thermal.txt
>> @@ -90,6 +90,10 @@ Required properties:
>>  "critical": Hardware not reliable.
>> Type: string
>>
>> +- irq-mode:A flag indicating that trip rises irq, so there is no
> 
> "rises IRQ" (it is an abbreviation).
ACK
> 
> Best regards,
> Krzysztof
> 


Regards,
Lukasz


Re: [PATCH v2 06/11] arm64: dts: exynos5433: add support for thermal trip irq-mode

2018-11-13 Thread Lukasz Luba
Hi Krzysztof,

On 11/12/18 10:00 AM, Krzysztof Kozlowski wrote:
> Thanks Lukasz for patches. I like your work!
> 
> I have few more comments which will probably apply for all the DTS commits.
> 
> On Wed, 7 Nov 2018 at 18:10, Lukasz Luba  wrote:
>>
>> This patch adds support for new flag which indicates
> 
> This patch does not add support. Support for flag was added in your
> first drivers/thermal patches in this set. This patch adds new flag.
> (and does something more, so read on)
Got it, will change the commit message.
> 
>> that trip point triggers IRQ when temperature is met.
>> Exynos5433 supports 8 trip point which will trigger IRQ.
> 
> /8 trip point/8 trip points/
> 
>> Above that number other trip points should be registered
>> without 'irq-mode' flag.
> 
> Please fix the line-wrap.
OK
> 
>>
>> Cc: Kukjin Kim 
>> Cc: Krzysztof Kozlowski 
>> Cc: devicet...@vger.kernel.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-samsung-...@vger.kernel.org
>> Signed-off-by: Lukasz Luba 
>> ---
>>   arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi | 105 
>> -
>>   1 file changed, 70 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi 
>> b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
>> index fe3a0b1..c4330f6 100644
>> --- a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
>> +++ b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
>> @@ -17,37 +17,44 @@ thermal-zones {
>>  atlas0_alert_0: atlas0-alert-0 {
>>  temperature = <65000>;  /* millicelsius */
>>  hysteresis = <1000>;/* millicelsius */
>> -   type = "active";
>> +   type = "passive";
> 
> This change is not explained in commit msg.
> 
> Basically you are doing here two things (related to each other). You
> clearly define which trip points are IRQ-type and you correct the type
> of trip point. Active is incorrect. This second thing is missing in
> your commit msg and I believe it is main reason behind this patch. You
> should focus then on this reason - start from it. Subject could be
> like "Use proper passive type for thermal trip points".
> 
> Without this explanation I don't see the strong reason behind that
> patch. IOW, everything was working fine before... so why adding this
> new flag? Or maybe something was not fine... and then it is real
> reason why you are sending the patch. Usually commit message should
> answer to the most important question "WHY". Now, it explains
> "WHAT"... but I can see it from the source code. However from the code
> it is not easy to guess WHY.

OK, I will add the explanation 'why' similar to description from the
cover-letter to all the patches which add this flag in DT files.

> 
> Best regards,
> Krzysztof
> 
> 

Thank you for the review.

Regards,
Lukasz


Re: [PATCH v2 05/11] Doc: DT: thermal: new irq-mode for trip point

2018-11-13 Thread Lukasz Luba
Hi Rob,

On 11/12/18 8:09 PM, Rob Herring wrote:
> On Wed, Nov 07, 2018 at 06:09:47PM +0100, Lukasz Luba wrote:
>> Thermal trip point gets new flag in DT: irq-mode.
>> Trip point may have a new explicit flag which indicate
>> IRQ support when the temperature is met (so the thermal framework
>> deos not need to set polling for it).
>> It is useful for 'passive' cooling trip point,
>> which now will not register for polling the temperature.
>>
>> Update documentation about irq-mode for trip points.
> 
> This patch should come before you use it.
OK, I will re-order the patch set in v3.
> 
>>
>> Cc: Zhang Rui 
>> Cc: Eduardo Valentin 
>> Cc: Daniel Lezcano 
>> Cc: Rob Herring 
>> Cc: Mark Rutland 
>> Cc: devicet...@vger.kernel.org
>> Signed-off-by: Lukasz Luba 
>> ---
>>   Documentation/devicetree/bindings/thermal/thermal.txt | 7 +++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt 
>> b/Documentation/devicetree/bindings/thermal/thermal.txt
>> index ca14ba9..bee21e3 100644
>> --- a/Documentation/devicetree/bindings/thermal/thermal.txt
>> +++ b/Documentation/devicetree/bindings/thermal/thermal.txt
>> @@ -90,6 +90,10 @@ Required properties:
>>  "critical": Hardware not reliable.
>> Type: string
>>   
>> +- irq-mode: A flag indicating that trip rises irq, so there is no
>> +  Type: boolneed of polling in thermal framework.
>> +  Size: one cell
> 
> Should be optional, right?
Yes, it is optional. I will mention about it in change v3.

Thank you for the review.

Regards,
Lukasz


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Oleksandr Natalenko

Hi.


ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
I try mimic hugepages attribute, so mode have two states:
  * madvise  - old default behaviour
  * always [new] - allow ksm to get tasks vma and
   try working on that.
  * seeker_sleep_millisecs:
Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
Description: Many different chrome/electron apps + KDE
Ratio: 5%
Saved: ~100  MiB
Used:  ~2000 MiB

  * Name: K8s test VM
Description: Some small random running docker images
Ratio: 40%
Saved: ~160 MiB
Used:  ~920 MiB

  * Name: Ceph test VM
Description: Ceph Mon/OSD, some containers
Ratio: 20%
Saved: ~60 MiB
Used:  ~600 MiB

  * Name: BareMetal K8s backend server
Description: Different server apps in containers C, Java, GO & etc
Ratio: 72%
Saved: ~5800 MiB
Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
Description: Many instance of one CPU intensive application
Ratio: 55%
Saved: ~2600 MiB
Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
Description: Only OSD storage daemons running
Raio: 2%
Saved: ~190 MiB
Used:  ~11.7 GiB


Out of curiosity, have you compared these results with UKSM [1]?

Thanks.

--
  Oleksandr Natalenko (post-factum)

[1] https://github.com/dolohow/uksm


Re: [PATCH v3 2/3] dt-bindings: i3c: Document Synopsys DesignWare I3C master bindings

2018-11-13 Thread vitor

Hi Boris,

On 12/11/18 09:17, Boris Brezillon wrote:

On Thu,  8 Nov 2018 17:14:10 +
Vitor soares  wrote:


Document Synopsys DesignWare I3C master module

Signed-off-by: Vitor soares 
---
Changes in v3:
- Remove dummy characters

Changes in v2:
- Address the changes in Documentation/devicetree/bindings/i3c/i3c.txt
- Add controller version on compatible string

  .../devicetree/bindings/i3c/snps,dw-i3c-master.txt | 42 ++
  1 file changed, 42 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt

diff --git a/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt 
b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
new file mode 100644
index 000..b930c09
--- /dev/null
+++ b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
@@ -0,0 +1,42 @@
+Bindings for Synopsys DesignWare I3C master block
+=
+
+Required properties:
+
+- compatible: shall be "snps,dw-i3c-master-1.00a"
+- clocks: shall reference the core_clk

Are you sure this IP only requires only one clock? If you're unsure,
you'd better have a clock-names prop.

Yes, the controller only has this clock dependency to configure the SCL.



+- interrupts: the interrupt line connected to this I3C master
+- reg: Offset and length of I3C master registers
+
+Mandatory properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- #address-cells: shall be set to 3
+- #size-cells: shall be set to 0
+
+Optional properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- i2c-scl-hz
+- i3c-scl-hz
+
+I3C device connected on the bus follow the generic description (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details).
+
+Example:
+
+   i3c-master@2000 {
+   compatible = "snps,dw-i3c-master-1.00a";
+   #address-cells = <3>;
+   #size-cells = <0>;
+   reg = <0x02000 0x1000>;
+   interrupts = <0>;
+   clocks = <&i3cclk>;
+
+   eeprom@57{
+   compatible = "atmel,24c01";
+   reg = < 0x57 0x0 0x10>;

   ^ remove this white space.


+   pagesize = <0x8>;
+   };
+   };
+


Best regards,

Vitor Soares



Re: [PATCH v3 2/3] dt-bindings: i3c: Document Synopsys DesignWare I3C master bindings

2018-11-13 Thread vitor

Hi Rob,

On 13/11/18 02:09, Rob Herring wrote:

On Thu, Nov 08, 2018 at 05:14:10PM +, Vitor soares wrote:

Document Synopsys DesignWare I3C master module

Signed-off-by: Vitor soares 

Please make your author and S-o-b emails match.


This is an alias and both email will work. Anyway I agree that should be 
the same that appear in the mailing list and I will change it.



---
Changes in v3:
- Remove dummy characters

Changes in v2:
- Address the changes in Documentation/devicetree/bindings/i3c/i3c.txt
- Add controller version on compatible string

  .../devicetree/bindings/i3c/snps,dw-i3c-master.txt | 42 ++
  1 file changed, 42 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt

Otherwise,

Reviewed-by: Rob Herring 


Thanks for review.


Best regards,

Vitor Soares



[PATCH] Documentation/features: Refresh the features list to v4.20-rc2

2018-11-13 Thread Ingo Molnar


* Palmer Dabbelt  wrote:

> I didn't even know this existed until David submitted a patch set that
> included updates to the documentation as a result of some features he
> added to RISC-V.  It looks like there may be a handful of other people
> who don't know this exists either, so I figured I'd just mail out a
> patch set containing all the updates split out as well as I can.
> 
> This smells like something that sholud be automatic, so if I'm jumping
> the gun here then feel free to drop this.  If nobody says anything then
> I guess I'll submit this as a separate PR to Linus from my personal
> tree, as it's not really a RISC-V thing but it seems like it's worth
> having docs that match the code where it's trivial -- I assume that's
> what this does, I didn't actually read anything but the diff because I
> never trust documentation to be up to date...
> 
> I feel compelled to say something like "maybe this should be part of
> checkpatch?", but I'm definately not looking to learn perl :)

I don't think it should be automated or part of checkpatch, but I 
(obviously) agree with the changes, except that I think it should be a 
single patch (combined patch attached below).

Thanks,

Ingo

Subject: Documentation/features: Refresh the features list to v4.20-rc2
From: Palmer Dabbelt 

Run Documentation/features/scripts/feature-refresh.sh to refresh the 
kernel features support matrix list:

 - The new 'csky' architecture was added
 - s390now supports KASAN
 - powerpc now supports stackprotector
 - xtensa  now supports sg-chain
 - arm64   now supports queued-spinlocks
 - parisc  now supports kprobes-events
 - RISC-V  now supports pte_special

[ mingo: combined the patches and the changelogs. ]

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Ingo Molnar 
---

 Documentation/features/core/cBPF-JIT/arch-support.txt  | 1 +
 Documentation/features/core/eBPF-JIT/arch-support.txt  | 1 +
 Documentation/features/core/generic-idle-thread/arch-support.txt   | 1 +
 Documentation/features/core/jump-labels/arch-support.txt   | 1 +
 Documentation/features/core/tracehook/arch-support.txt | 1 +
 Documentation/features/debug/KASAN/arch-support.txt| 3 ++-
 Documentation/features/debug/gcov-profile-all/arch-support.txt | 1 +
 Documentation/features/debug/kgdb/arch-support.txt | 1 +
 Documentation/features/debug/kprobes-on-ftrace/arch-support.txt| 1 +
 Documentation/features/debug/kprobes/arch-support.txt  | 1 +
 Documentation/features/debug/kretprobes/arch-support.txt   | 1 +
 Documentation/features/debug/optprobes/arch-support.txt| 1 +
 Documentation/features/debug/stackprotector/arch-support.txt   | 3 ++-
 Documentation/features/debug/uprobes/arch-support.txt  | 1 +
 Documentation/features/debug/user-ret-profiler/arch-support.txt| 1 +
 Documentation/features/io/dma-contiguous/arch-support.txt  | 1 +
 Documentation/features/io/sg-chain/arch-support.txt| 3 ++-
 Documentation/features/locking/cmpxchg-local/arch-support.txt  | 1 +
 Documentation/features/locking/lockdep/arch-support.txt| 1 +
 Documentation/features/locking/queued-rwlocks/arch-support.txt | 1 +
 Documentation/features/locking/queued-spinlocks/arch-support.txt   | 3 ++-
 Documentation/features/locking/rwsem-optimized/arch-support.txt| 1 +
 Documentation/features/perf/kprobes-event/arch-support.txt | 3 ++-
 Documentation/features/perf/perf-regs/arch-support.txt | 1 +
 Documentation/features/perf/perf-stackdump/arch-support.txt| 1 +
 Documentation/features/sched/membarrier-sync-core/arch-support.txt | 1 +
 Documentation/features/sched/numa-balancing/arch-support.txt   | 1 +
 Documentation/features/seccomp/seccomp-filter/arch-support.txt | 1 +
 Documentation/features/time/arch-tick-broadcast/arch-support.txt   | 1 +
 Documentation/features/time/clockevents/arch-support.txt   | 1 +
 Documentation/features/time/context-tracking/arch-support.txt  | 1 +
 Documentation/features/time/irq-time-acct/arch-support.txt | 1 +
 Documentation/features/time/modern-timekeeping/arch-support.txt| 1 +
 Documentation/features/time/virt-cpuacct/arch-support.txt  | 1 +
 Documentation/features/vm/ELF-ASLR/arch-support.txt| 1 +
 Documentation/features/vm/PG_uncached/arch-support.txt | 1 +
 Documentation/features/vm/THP/arch-support.txt | 1 +
 Documentation/features/vm/TLB/arch-support.txt | 1 +
 Documentation/features/vm/huge-vmap/arch-support.txt   | 1 +
 Documentation/features/vm/ioremap_prot/arch-support.txt| 1 +
 Documentation/features/vm/numa-memblock/arch-support.txt   | 1 +
 Documentation/features/vm/pte_special/arch-support.txt | 3 ++-
 42 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/Documentation/feat

Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 04:49, Matthew Wilcox :
>
> On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote:
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> > Description: Many different chrome/electron apps + KDE
> > Ratio: 5%
> > Saved: ~100  MiB
> > Used:  ~2000 MiB
>
> Your _laptop_ saves 100MB of RAM?  That's extraordinary.  Essentially
> that's like getting an extra 100MB of page cache for free.  Is there
> any observable slowdown?  I could even see there being a speedup (due
> to your working set being allowed to be 5% larger)
>
> I am now a big fan of this patch and shall try to give it the review
> that it deserves.

I'm not sure if this is sarcasm,
anyway i try do my best to get that working.

On any x86 desktop with mixed load (browser, docs, games & etc)
You will always see something like 40-200 MiB of deduped pages,
based on type of load of course.

I'm just don't try use that numbers as reason to get general KSM
deduplication in kernel.
Because in current generation with several gigabytes of memory,
several saved MiB not looks serious for most of people.

Thanks!


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 05:25, Pavel Tatashin :
>
> On 18-11-13 02:13:44, Timofey Titovets wrote:
> > From: Timofey Titovets 
> >
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> > I try mimic hugepages attribute, so mode have two states:
> >   * madvise  - old default behaviour
> >   * always [new] - allow ksm to get tasks vma and
> >try working on that.
> >   * seeker_sleep_millisecs:
> > Add pauses between imports tasks VMA
> >
> > For rate limiting proporses and tasklist locking time,
> > ksm seeker thread only import VMAs from one task per loop.
> >
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> > Description: Many different chrome/electron apps + KDE
> > Ratio: 5%
> > Saved: ~100  MiB
> > Used:  ~2000 MiB
> >
> >   * Name: K8s test VM
> > Description: Some small random running docker images
> > Ratio: 40%
> > Saved: ~160 MiB
> > Used:  ~920 MiB
> >
> >   * Name: Ceph test VM
> > Description: Ceph Mon/OSD, some containers
> > Ratio: 20%
> > Saved: ~60 MiB
> > Used:  ~600 MiB
> >
> >   * Name: BareMetal K8s backend server
> > Description: Different server apps in containers C, Java, GO & etc
> > Ratio: 72%
> > Saved: ~5800 MiB
> > Used:  ~35.7 GiB
> >
> >   * Name: BareMetal K8s processing server
> > Description: Many instance of one CPU intensive application
> > Ratio: 55%
> > Saved: ~2600 MiB
> > Used:  ~28.0 GiB
> >
> >   * Name: BareMetal Ceph node
> > Description: Only OSD storage daemons running
> > Raio: 2%
> > Saved: ~190 MiB
> > Used:  ~11.7 GiB
> >
> > Changes:
> >   v1 -> v2:
> > * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
> >   v2 -> v3:
> > * Reformat patch description
> > * Rename mode normal to madvise
> > * Add some memory numbers
> > * Fix checkpatch.pl warnings
> > * Separate ksm vma seeker to another kthread
> > * Fix: "BUG: scheduling while atomic: ksmd"
> >   by move seeker to another thread
> >
> > Signed-off-by: Timofey Titovets 
> > CC: Matthew Wilcox 
> > CC: linux...@kvack.org
> > CC: linux-doc@vger.kernel.org
> > ---
> >  Documentation/admin-guide/mm/ksm.rst |  15 ++
> >  mm/ksm.c | 215 +++
> >  2 files changed, 198 insertions(+), 32 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/ksm.rst 
> > b/Documentation/admin-guide/mm/ksm.rst
> > index 9303786632d1..7cffd47f9b38 100644
> > --- a/Documentation/admin-guide/mm/ksm.rst
> > +++ b/Documentation/admin-guide/mm/ksm.rst
> > @@ -116,6 +116,21 @@ run
> >  Default: 0 (must be changed to 1 to activate KSM, except if
> >  CONFIG_SYSFS is disabled)
> >
> > +mode
> > +* set always to allow ksm deduplicate memory of every process
> > +* set madvise to use only madvised memory
> > +
> > +Default: madvise (dedupulicate only madvised memory as in
> > +earlier releases)
> > +
> > +seeker_sleep_millisecs
> > +how many milliseconds ksmd task seeker should sleep try another
> > +task.
> > +e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
> > +
> > +Default: 1000 (chosen for rate limit purposes)
> > +
> > +
> >  use_zero_pages
> >  specifies whether empty pages (i.e. allocated pages that only
> >  contain zeroes) should be treated specially.  When set to 1,
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 5b0894b45ee5..1a03b28b6288 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> >  /* Milliseconds ksmd should sleep between batches */
> >  static unsigned int ksm_thread_sleep_millisecs = 20;
> >
> > +/* Milliseconds ksmd seeker should sleep between runs */
> > +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
> > +
> >  /* Checksum of an empty (zeroed) page */
> >  static unsigned int zero_checksum __read_mostly;
> >
> > @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
> >  static unsigned long ksm_run = KSM_RUN_STOP;
> >  static void wait_while_offlining(void);
> >
> > +#define KSM_MODE_MADVISE 0
> > +#define KSM_MODE_ALWAYS  1
> > +static unsigned long ksm_mode = KSM_MODE_MADVISE;
> > +
> >  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
> > +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
> >  static DEFINE_MUTEX(ksm_thread_mutex);
> >  static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >
> > @@ -303,6 +3

[PATCH v4 0/8] clk: clkdev: managed clk lookup and provider registrations

2018-11-13 Thread Matti Vaittinen
Series add bd71837/bd71837 PMIC clock support + managed interfaces

Few clk drivers appear to be leaking clkdev lookup registrations at
driver remove. The patch series adds devm versions of lookup
registrations and cleans up few drivers. Driver clean-up patches have
not been tested as I lack the HW. All testing and comments if
driver/device removal is even possible for changed drivers is highly
appreciated. If removal is not possible I will gladly drop the patches
from series - although leaking lookups may serve as bad example for new
developers =)

Patch 8 adds support for clock gate in ROHM bd71837 and bd71847 PMICs.
This change is included in the series because it depends on new managed
interfaces introduced in this series.

bd718x7 driver and devm interfaces are tested on BeagleBoneBlack and 
bd71837 break-out board. Clk area register interface of bd71847 is
identical to bd71837.

Changed drivers are:
clk-max77686, clk-st, clk-hi655x, rk808, clk-twl6040
and apcs-msm8916. New driver is clk-bd718x7

This series has been discussed for a while now. For those who want to
see whole discussion:

The bd71837 driver was originally proposed here
https://lore.kernel.org/lkml/d99c8762b0fbbcb18ec4f4d104191364c0ea798c.1528117485.git.matti.vaitti...@fi.rohmeurope.com/

clk portion was separated from that series and devm variants were
proposed here
https://lore.kernel.org/linux-clk/cover.1535630942.git.matti.vaitti...@fi.rohmeurope.com/

Cleanup to other drivers was initiated in this series while waiting for
MFD portions of bd718x7 to be applied. And now when MFD dependencies are in-tree
his version (4) combines bd718x7 driver back to this series.

Changelog (for this series) v4
- Add support for ROHM bd718x7 PMIC clock gate. Included in this patch
  series because it depends on managed interfaces added in patch 1.

Changelog (for this series) v3
Address issues spotted by Krzysztof Kozlowski
- Drop patch 3 for clk-s3c2410-dclk as this device can never be removed
- Fix indentiation for clk-max77686
- Rest  of the patches are unchanged.

Changelog (for this series) v2
Issue spotted by 0-Day test suite
- Add a stub function 'devm_of_clk_add_parent_hw_provider' for no OF config.
- patches 2-8 are unchanged.

This patch series is based on clk-next

---

Matti Vaittinen (8):
  clk: clkdev/of_clk - add managed lookup and provider registrations
  clk: clk-max77686: Clean clkdev lookup leak and use devm
  clk: clk-st: avoid clkdev lookup leak at remove
  clk: clk-hi655x: Free of_provider at remove
  clk: rk808: use managed version of of_provider registration
  clk: clk-twl6040: Free of_provider at remove
  clk: apcs-msm8916: simplify probe cleanup by using devm
  clk: bd718x7: Initial support for ROHM bd71837/bd71847 PMIC clock

 Documentation/driver-model/devres.txt |   3 +
 drivers/clk/Kconfig   |   7 ++
 drivers/clk/Makefile  |   1 +
 drivers/clk/clk-bd718x7.c | 131 ++
 drivers/clk/clk-hi655x.c  |   4 +-
 drivers/clk/clk-max77686.c|  29 ++--
 drivers/clk/clk-rk808.c   |  15 +---
 drivers/clk/clk-twl6040.c |   5 +-
 drivers/clk/clk.c |  28 ++--
 drivers/clk/clkdev.c  | 122 ---
 drivers/clk/qcom/apcs-msm8916.c   |   5 +-
 drivers/clk/x86/clk-st.c  |   3 +-
 include/linux/clk-provider.h  |  11 +++
 include/linux/clkdev.h|   4 ++
 14 files changed, 292 insertions(+), 76 deletions(-)
 create mode 100644 drivers/clk/clk-bd718x7.c

-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 1/8] clk: clkdev/of_clk - add managed lookup and provider registrations

2018-11-13 Thread Matti Vaittinen
With MFD devices the clk properties may be contained in MFD (parent) DT
node. Current devm_of_clk_add_hw_provider assumes the clk is bound to MFD
subdevice not to MFD device (parent). Add
devm_of_clk_add_hw_provider_parent to tackle this issue.

Also clkdev registration lacks of managed registration functions and it
seems few drivers do not drop clkdev lookups at exit. Add
devm_clk_hw_register_clkdev and devm_clk_release_clkdev to ease lookup
releasing at exit.

Signed-off-by: Matti Vaittinen 
---
 Documentation/driver-model/devres.txt |   3 +
 drivers/clk/clk.c |  28 ++--
 drivers/clk/clkdev.c  | 122 ++
 include/linux/clk-provider.h  |  11 +++
 include/linux/clkdev.h|   4 ++
 5 files changed, 136 insertions(+), 32 deletions(-)

diff --git a/Documentation/driver-model/devres.txt 
b/Documentation/driver-model/devres.txt
index 43681ca0837f..fac63760b01c 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -238,6 +238,9 @@ CLOCK
   devm_clk_put()
   devm_clk_hw_register()
   devm_of_clk_add_hw_provider()
+  devm_of_clk_add_parent_hw_provider()
+  devm_clk_hw_register_clkdev()
+  devm_clk_release_clkdev()
 
 DMA
   dmaenginem_async_device_register()
diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index af011974d4ec..9bb921eb90f6 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -3893,12 +3893,12 @@ static void devm_of_clk_release_provider(struct device 
*dev, void *res)
of_clk_del_provider(*(struct device_node **)res);
 }
 
-int devm_of_clk_add_hw_provider(struct device *dev,
+static int __devm_of_clk_add_hw_provider(struct device *dev,
struct clk_hw *(*get)(struct of_phandle_args *clkspec,
  void *data),
-   void *data)
+   struct device_node *of_node, void *data)
 {
-   struct device_node **ptr, *np;
+   struct device_node **ptr;
int ret;
 
ptr = devres_alloc(devm_of_clk_release_provider, sizeof(*ptr),
@@ -3906,10 +3906,9 @@ int devm_of_clk_add_hw_provider(struct device *dev,
if (!ptr)
return -ENOMEM;
 
-   np = dev->of_node;
-   ret = of_clk_add_hw_provider(np, get, data);
+   *ptr = of_node;
+   ret = of_clk_add_hw_provider(of_node, get, data);
if (!ret) {
-   *ptr = np;
devres_add(dev, ptr);
} else {
devres_free(ptr);
@@ -3917,8 +3916,25 @@ int devm_of_clk_add_hw_provider(struct device *dev,
 
return ret;
 }
+int devm_of_clk_add_hw_provider(struct device *dev,
+   struct clk_hw *(*get)(struct of_phandle_args *clkspec,
+ void *data),
+   void *data)
+{
+   return __devm_of_clk_add_hw_provider(dev, get, dev->of_node, data);
+}
 EXPORT_SYMBOL_GPL(devm_of_clk_add_hw_provider);
 
+int devm_of_clk_add_parent_hw_provider(struct device *dev,
+   struct clk_hw *(*get)(struct of_phandle_args *clkspec,
+ void *data),
+   void *data)
+{
+   return __devm_of_clk_add_hw_provider(dev, get, dev->parent->of_node,
+data);
+}
+EXPORT_SYMBOL_GPL(devm_of_clk_add_parent_hw_provider);
+
 /**
  * of_clk_del_provider() - Remove a previously registered clock provider
  * @np: Device node pointer associated with clock provider
diff --git a/drivers/clk/clkdev.c b/drivers/clk/clkdev.c
index 9ab3db8b3988..f6100b6e06fd 100644
--- a/drivers/clk/clkdev.c
+++ b/drivers/clk/clkdev.c
@@ -401,6 +401,25 @@ static struct clk_lookup *__clk_register_clkdev(struct 
clk_hw *hw,
return cl;
 }
 
+static int do_clk_register_clkdev(struct clk_hw *hw,
+   struct clk_lookup **cl, const char *con_id, const char *dev_id)
+{
+
+   if (IS_ERR(hw))
+   return PTR_ERR(hw);
+   /*
+* Since dev_id can be NULL, and NULL is handled specially, we must
+* pass it as either a NULL format string, or with "%s".
+*/
+   if (dev_id)
+   *cl = __clk_register_clkdev(hw, con_id, "%s",
+  dev_id);
+   else
+   *cl = __clk_register_clkdev(hw, con_id, NULL);
+
+   return *cl ? 0 : -ENOMEM;
+}
+
 /**
  * clk_register_clkdev - register one clock lookup for a struct clk
  * @clk: struct clk to associate with all clk_lookups
@@ -420,20 +439,10 @@ int clk_register_clkdev(struct clk *clk, const char 
*con_id,
 {
struct clk_lookup *cl;
 
-   if (IS_ERR(clk))
-   return PTR_ERR(clk);
-
-   /*
-* Since dev_id can be NULL, and NULL is handled specially, we must
-* pass it as either a NULL format string, or with "%s".
-*/
-   if (dev_id)
-   cl = __clk_register_clkdev(__clk_get_hw(c

Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 14:06, Oleksandr Natalenko :
>
> Hi.
>
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> > I try mimic hugepages attribute, so mode have two states:
> >   * madvise  - old default behaviour
> >   * always [new] - allow ksm to get tasks vma and
> >try working on that.
> >   * seeker_sleep_millisecs:
> > Add pauses between imports tasks VMA
> >
> > For rate limiting proporses and tasklist locking time,
> > ksm seeker thread only import VMAs from one task per loop.
> >
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> > Description: Many different chrome/electron apps + KDE
> > Ratio: 5%
> > Saved: ~100  MiB
> > Used:  ~2000 MiB
> >
> >   * Name: K8s test VM
> > Description: Some small random running docker images
> > Ratio: 40%
> > Saved: ~160 MiB
> > Used:  ~920 MiB
> >
> >   * Name: Ceph test VM
> > Description: Ceph Mon/OSD, some containers
> > Ratio: 20%
> > Saved: ~60 MiB
> > Used:  ~600 MiB
> >
> >   * Name: BareMetal K8s backend server
> > Description: Different server apps in containers C, Java, GO & etc
> > Ratio: 72%
> > Saved: ~5800 MiB
> > Used:  ~35.7 GiB
> >
> >   * Name: BareMetal K8s processing server
> > Description: Many instance of one CPU intensive application
> > Ratio: 55%
> > Saved: ~2600 MiB
> > Used:  ~28.0 GiB
> >
> >   * Name: BareMetal Ceph node
> > Description: Only OSD storage daemons running
> > Raio: 2%
> > Saved: ~190 MiB
> > Used:  ~11.7 GiB
>
> Out of curiosity, have you compared these results with UKSM [1]?
>
> Thanks.
>
> --
>Oleksandr Natalenko (post-factum)
>
> [1] https://github.com/dolohow/uksm

Long story short:
I try get UKSM code in kernel, and i mostly know how UKSM works.
Yep, UKSM implement logic in different way, but UKSM _always_ will have
same or worse numbers (saved pages) in compare to KSM.

Why?

Both scan VMA pages and filter VMA by some flags (same set of flags).
So they both will see same subset of pages, which can be deduped.
But UKSM also try not dedup VMA pages, if some of VMA pages changes
more frequently - trashing.
Because of that UKSM will skip some pages, but KSM will try dedup all
pages, not changed between scans.

I skip UKSM internal logic, which can work better of course in
resource usage terms
(different hash implementation, different page tree),
but that doesn't matter in that case (if we talk about saved memory).

Only thing what UKSM have, which KSM can't do:
UKSM can add VMA memory it self, by hooks in mm.

KSM currently need help by madvise() for that.
That the reason, why i write that patchset for KSM.
(I also wrote some info to Pavel Tatashin in above mail)

Thanks!


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Jann Horn
On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
 wrote:
> ksm by default working only on memory that added by
> madvise().
>
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
>
> Lets use kernel task list and add logic to import VMAs from tasks.
>
> That behaviour controlled by new attributes:
>   * mode:
> I try mimic hugepages attribute, so mode have two states:
>   * madvise  - old default behaviour
>   * always [new] - allow ksm to get tasks vma and
>try working on that.

Please don't. And if you really have to for some reason, put some big
warnings on this, advising people that it's a security risk.

KSM is one of the favorite punching bags of side-channel and hardware
security researchers:

As a gigantic, problematic side channel:
http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
https://access.redhat.com/blogs/766093/posts/1976303
https://gruss.cc/files/dedup.pdf

In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
it possible to use malicious JavaScript to determine whether a given
page of memory exists elsewhere on your system.

And also as a way to target rowhammer-based faults:
https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/


[PATCH v4 2/8] clk: clk-max77686: Clean clkdev lookup leak and use devm

2018-11-13 Thread Matti Vaittinen
clk-max77686 never clean clkdev lookup at remove. This can cause
oops if clk-max77686 is removed and inserted again. Fix leak by
using new devm clkdev lookup registration. Simplify also error
path by using new devm_of_clk_add_parent_hw_provider.

Signed-off-by: Matti Vaittinen 
Reviewed-by: Krzysztof Kozlowski 
---
 drivers/clk/clk-max77686.c | 29 +++--
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/drivers/clk/clk-max77686.c b/drivers/clk/clk-max77686.c
index 22c937644c93..e65925d5e412 100644
--- a/drivers/clk/clk-max77686.c
+++ b/drivers/clk/clk-max77686.c
@@ -235,8 +235,9 @@ static int max77686_clk_probe(struct platform_device *pdev)
return ret;
}
 
-   ret = clk_hw_register_clkdev(&max_clk_data->hw,
-max_clk_data->clk_idata.name, 
NULL);
+   ret = devm_clk_hw_register_clkdev(dev, &max_clk_data->hw,
+ max_clk_data->clk_idata.name,
+ NULL);
if (ret < 0) {
dev_err(dev, "Failed to clkdev register: %d\n", ret);
return ret;
@@ -244,8 +245,9 @@ static int max77686_clk_probe(struct platform_device *pdev)
}
 
if (parent->of_node) {
-   ret = of_clk_add_hw_provider(parent->of_node, 
of_clk_max77686_get,
-drv_data);
+   ret = devm_of_clk_add_parent_hw_provider(dev,
+of_clk_max77686_get,
+drv_data);
 
if (ret < 0) {
dev_err(dev, "Failed to register OF clock provider: 
%d\n",
@@ -261,27 +263,11 @@ static int max77686_clk_probe(struct platform_device 
*pdev)
 1 << MAX77802_CLOCK_LOW_JITTER_SHIFT);
if (ret < 0) {
dev_err(dev, "Failed to config low-jitter: %d\n", ret);
-   goto remove_of_clk_provider;
+   return ret;
}
}
 
return 0;
-
-remove_of_clk_provider:
-   if (parent->of_node)
-   of_clk_del_provider(parent->of_node);
-
-   return ret;
-}
-
-static int max77686_clk_remove(struct platform_device *pdev)
-{
-   struct device *parent = pdev->dev.parent;
-
-   if (parent->of_node)
-   of_clk_del_provider(parent->of_node);
-
-   return 0;
 }
 
 static const struct platform_device_id max77686_clk_id[] = {
@@ -297,7 +283,6 @@ static struct platform_driver max77686_clk_driver = {
.name  = "max77686-clk",
},
.probe = max77686_clk_probe,
-   .remove = max77686_clk_remove,
.id_table = max77686_clk_id,
 };
 
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 3/8] clk: clk-st: avoid clkdev lookup leak at remove

2018-11-13 Thread Matti Vaittinen
Use devm based clkdev lookup registration to avoid leaking lookup
structures.

Signed-off-by: Matti Vaittinen 
---
 drivers/clk/x86/clk-st.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/clk/x86/clk-st.c b/drivers/clk/x86/clk-st.c
index 3a0996f2d556..25d4b97aff9b 100644
--- a/drivers/clk/x86/clk-st.c
+++ b/drivers/clk/x86/clk-st.c
@@ -52,7 +52,8 @@ static int st_clk_probe(struct platform_device *pdev)
0, st_data->base + MISCCLKCNTL1, OSCCLKENB,
CLK_GATE_SET_TO_DISABLE, NULL);
 
-   clk_hw_register_clkdev(hws[ST_CLK_GATE], "oscout1", NULL);
+   devm_clk_hw_register_clkdev(&pdev->dev, hws[ST_CLK_GATE], "oscout1",
+   NULL);
 
return 0;
 }
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 4/8] clk: clk-hi655x: Free of_provider at remove

2018-11-13 Thread Matti Vaittinen
use devm variant for of_provider registration so provider is freed
at exit.

Signed-off-by: Matti Vaittinen 
---
 drivers/clk/clk-hi655x.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/clk-hi655x.c b/drivers/clk/clk-hi655x.c
index 403a0188634a..394d0109104d 100644
--- a/drivers/clk/clk-hi655x.c
+++ b/drivers/clk/clk-hi655x.c
@@ -107,8 +107,8 @@ static int hi655x_clk_probe(struct platform_device *pdev)
if (ret)
return ret;
 
-   return of_clk_add_hw_provider(parent->of_node, of_clk_hw_simple_get,
-&hi655x_clk->clk_hw);
+   return devm_of_clk_add_parent_hw_provider(&pdev->dev,
+   of_clk_hw_simple_get, &hi655x_clk->clk_hw);
 }
 
 static struct platform_driver hi655x_clk_driver = {
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 5/8] clk: rk808: use managed version of of_provider registration

2018-11-13 Thread Matti Vaittinen
Simplify clean-up for rk808 by using managed version of of_provider
registration.

Signed-off-by: Matti Vaittinen 
---
 drivers/clk/clk-rk808.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/clk/clk-rk808.c b/drivers/clk/clk-rk808.c
index 6461f2820a5b..177340edaae5 100644
--- a/drivers/clk/clk-rk808.c
+++ b/drivers/clk/clk-rk808.c
@@ -138,23 +138,12 @@ static int rk808_clkout_probe(struct platform_device 
*pdev)
if (ret)
return ret;
 
-   return of_clk_add_hw_provider(node, of_clk_rk808_get, rk808_clkout);
-}
-
-static int rk808_clkout_remove(struct platform_device *pdev)
-{
-   struct rk808 *rk808 = dev_get_drvdata(pdev->dev.parent);
-   struct i2c_client *client = rk808->i2c;
-   struct device_node *node = client->dev.of_node;
-
-   of_clk_del_provider(node);
-
-   return 0;
+   return devm_of_clk_add_parent_hw_provider(&pdev->dev,
+   of_clk_rk808_get, rk808_clkout);
 }
 
 static struct platform_driver rk808_clkout_driver = {
.probe = rk808_clkout_probe,
-   .remove = rk808_clkout_remove,
.driver = {
.name   = "rk808-clkout",
},
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 6/8] clk: clk-twl6040: Free of_provider at remove

2018-11-13 Thread Matti Vaittinen
use devm variant for of_provider registration so provider is freed
at exit.

Signed-off-by: Matti Vaittinen 
Acked-by: Peter Ujfalusi 
---
 drivers/clk/clk-twl6040.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/clk/clk-twl6040.c b/drivers/clk/clk-twl6040.c
index 25dfe050ae9f..e9da09453eb2 100644
--- a/drivers/clk/clk-twl6040.c
+++ b/drivers/clk/clk-twl6040.c
@@ -108,9 +108,8 @@ static int twl6040_pdmclk_probe(struct platform_device 
*pdev)
 
platform_set_drvdata(pdev, clkdata);
 
-   return of_clk_add_hw_provider(pdev->dev.parent->of_node,
- of_clk_hw_simple_get,
- &clkdata->pdmclk_hw);
+   return devm_of_clk_add_parent_hw_provider(&pdev->dev,
+   of_clk_hw_simple_get, &clkdata->pdmclk_hw);
 }
 
 static struct platform_driver twl6040_pdmclk_driver = {
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 7/8] clk: apcs-msm8916: simplify probe cleanup by using devm

2018-11-13 Thread Matti Vaittinen
use devm variant for of_provider registration.

Signed-off-by: Matti Vaittinen 
---
 drivers/clk/qcom/apcs-msm8916.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/clk/qcom/apcs-msm8916.c b/drivers/clk/qcom/apcs-msm8916.c
index b1cc8dbcd327..f4e0c136ab1a 100644
--- a/drivers/clk/qcom/apcs-msm8916.c
+++ b/drivers/clk/qcom/apcs-msm8916.c
@@ -96,8 +96,8 @@ static int qcom_apcs_msm8916_clk_probe(struct platform_device 
*pdev)
goto err;
}
 
-   ret = of_clk_add_hw_provider(parent->of_node, of_clk_hw_simple_get,
-&a53cc->clkr.hw);
+   ret = devm_of_clk_add_parent_hw_provider(dev, of_clk_hw_simple_get,
+&a53cc->clkr.hw);
if (ret) {
dev_err(dev, "failed to add clock provider: %d\n", ret);
goto err;
@@ -118,7 +118,6 @@ static int qcom_apcs_msm8916_clk_remove(struct 
platform_device *pdev)
struct device *parent = pdev->dev.parent;
 
clk_notifier_unregister(a53cc->pclk, &a53cc->clk_nb);
-   of_clk_del_provider(parent->of_node);
 
return 0;
 }
-- 
2.14.3


-- 
Matti Vaittinen
ROHM Semiconductors

~~~ "I don't think so," said Rene Descartes.  Just then, he vanished ~~~


[PATCH v4 8/8] clk: bd718x7: Initial support for ROHM bd71837/bd71847 PMIC clock

2018-11-13 Thread Matti Vaittinen
ROHM bd71837 and bd71847 contain 32768Hz clock gate. Support the clock
using generic clock framework.

Signed-off-by: Matti Vaittinen 
---
 drivers/clk/Kconfig   |   7 +++
 drivers/clk/Makefile  |   1 +
 drivers/clk/clk-bd718x7.c | 131 ++
 3 files changed, 139 insertions(+)
 create mode 100644 drivers/clk/clk-bd718x7.c

diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
index 81cdb4eaca07..2dc12bf75b1b 100644
--- a/drivers/clk/Kconfig
+++ b/drivers/clk/Kconfig
@@ -283,6 +283,13 @@ config COMMON_CLK_STM32H7
---help---
  Support for stm32h7 SoC family clocks
 
+config COMMON_CLK_BD718XX
+   tristate "Clock driver for ROHM BD718x7 PMIC"
+   depends on MFD_ROHM_BD718XX
+   help
+ This driver supports ROHM BD71837 and ROHM BD71847
+ PMICs clock gates.
+
 source "drivers/clk/actions/Kconfig"
 source "drivers/clk/bcm/Kconfig"
 source "drivers/clk/hisilicon/Kconfig"
diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
index 72be7a38cff1..a47430b387db 100644
--- a/drivers/clk/Makefile
+++ b/drivers/clk/Makefile
@@ -21,6 +21,7 @@ endif
 obj-$(CONFIG_MACH_ASM9260) += clk-asm9260.o
 obj-$(CONFIG_COMMON_CLK_AXI_CLKGEN)+= clk-axi-clkgen.o
 obj-$(CONFIG_ARCH_AXXIA)   += clk-axm5516.o
+obj-$(CONFIG_COMMON_CLK_BD718XX)   += clk-bd718x7.o
 obj-$(CONFIG_COMMON_CLK_CDCE706)   += clk-cdce706.o
 obj-$(CONFIG_COMMON_CLK_CDCE925)   += clk-cdce925.o
 obj-$(CONFIG_ARCH_CLPS711X)+= clk-clps711x.o
diff --git a/drivers/clk/clk-bd718x7.c b/drivers/clk/clk-bd718x7.c
new file mode 100644
index ..df5f1068ce8e
--- /dev/null
+++ b/drivers/clk/clk-bd718x7.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2018 ROHM Semiconductors
+// bd71837.c  -- ROHM BD71837MWV clock driver
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct bd718xx_clk {
+   struct clk_hw hw;
+   u8 reg;
+   u8 mask;
+   struct platform_device *pdev;
+   struct bd718xx *mfd;
+};
+
+static int bd71837_clk_set(struct clk_hw *hw, int status)
+{
+   struct bd718xx_clk *c = container_of(hw, struct bd718xx_clk, hw);
+
+   return regmap_update_bits(c->mfd->regmap, c->reg, c->mask, status);
+}
+
+static void bd71837_clk_disable(struct clk_hw *hw)
+{
+   int rv;
+   struct bd718xx_clk *c = container_of(hw, struct bd718xx_clk, hw);
+
+   rv = bd71837_clk_set(hw, 0);
+   if (rv)
+   dev_dbg(&c->pdev->dev, "Failed to disable 32K clk (%d)\n", rv);
+}
+
+static int bd71837_clk_enable(struct clk_hw *hw)
+{
+   return bd71837_clk_set(hw, 1);
+}
+
+static int bd71837_clk_is_enabled(struct clk_hw *hw)
+{
+   int enabled;
+   int rval;
+   struct bd718xx_clk *c = container_of(hw, struct bd718xx_clk, hw);
+
+   rval = regmap_read(c->mfd->regmap, c->reg, &enabled);
+
+   if (rval)
+   return rval;
+
+   return enabled & c->mask;
+}
+
+static const struct clk_ops bd71837_clk_ops = {
+   .prepare = &bd71837_clk_enable,
+   .unprepare = &bd71837_clk_disable,
+   .is_prepared = &bd71837_clk_is_enabled,
+};
+
+static int bd71837_clk_probe(struct platform_device *pdev)
+{
+   struct bd718xx_clk *c;
+   int rval = -ENOMEM;
+   const char *parent_clk;
+   struct device *parent = pdev->dev.parent;
+   struct bd718xx *mfd = dev_get_drvdata(parent);
+   struct clk_init_data init = {
+   .name = "bd718xx-32k-out",
+   .ops = &bd71837_clk_ops,
+   };
+
+   c = devm_kzalloc(&pdev->dev, sizeof(*c), GFP_KERNEL);
+   if (!c)
+   return -ENOMEM;
+
+   init.num_parents = 1;
+   parent_clk = of_clk_get_parent_name(parent->of_node, 0);
+
+   init.parent_names = &parent_clk;
+   if (!parent_clk) {
+   dev_err(&pdev->dev, "No parent clk found\n");
+   return -EINVAL;
+   }
+
+   c->reg = BD718XX_REG_OUT32K;
+   c->mask = BD718XX_OUT32K_EN;
+   c->mfd = mfd;
+   c->pdev = pdev;
+   c->hw.init = &init;
+
+   of_property_read_string_index(parent->of_node,
+ "clock-output-names", 0, &init.name);
+
+   rval = devm_clk_hw_register(&pdev->dev, &c->hw);
+   if (!rval) {
+   rval = devm_clk_hw_register_clkdev(&pdev->dev,
+  &c->hw, init.name, NULL);
+   if (rval)
+   dev_warn(&pdev->dev, "Failed to register clkdev\n");
+   if (parent->of_node) {
+   rval = devm_of_clk_add_parent_hw_provider(&pdev->dev,
+of_clk_hw_simple_get, &c->hw);
+   if (rval)
+   dev_err(&pdev->dev,
+   "adding clk provider failed\n");
+   }
+   

Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 14:57, Jann Horn :
>
> On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
>  wrote:
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> > I try mimic hugepages attribute, so mode have two states:
> >   * madvise  - old default behaviour
> >   * always [new] - allow ksm to get tasks vma and
> >try working on that.
>
> Please don't. And if you really have to for some reason, put some big
> warnings on this, advising people that it's a security risk.
>
> KSM is one of the favorite punching bags of side-channel and hardware
> security researchers:
>
> As a gigantic, problematic side channel:
> http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> https://access.redhat.com/blogs/766093/posts/1976303
> https://gruss.cc/files/dedup.pdf
>
> In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> it possible to use malicious JavaScript to determine whether a given
> page of memory exists elsewhere on your system.
>
> And also as a way to target rowhammer-based faults:
> https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/

I'm very sorry, i'm not a security specialist.
But if i understood correctly, ksm have that security issues _without_
my patch set.
Even more, not only KSM have that type of issue, any memory
deduplication have that problems.
Any guy who care about security must decide on it self. Which things
him use and how he will
defend from others.
Even more on it self he must learn tools, what he use and make some
decision right?

So, if you really care about that problem in general, or only on KSM side,
that your initiative and your duty to warn people about that.

KSM already exists for 10+ years. You know about security implication
of use memory deduplication.
That your duty to send a patches to documentation, and add appropriate warnings.

Sorry for my passive aggressive,
i don't try hurt someone, or humiliate.

That's just my IMHO and i'm just to restricted in my english knowledge,
to write that more gentle.

Thanks!


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Jann Horn
On Tue, Nov 13, 2018 at 1:59 PM Timofey Titovets
 wrote:
>
> вт, 13 нояб. 2018 г. в 14:57, Jann Horn :
> >
> > On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
> >  wrote:
> > > ksm by default working only on memory that added by
> > > madvise().
> > >
> > > And only way get that work on other applications:
> > >   * Use LD_PRELOAD and libraries
> > >   * Patch kernel
> > >
> > > Lets use kernel task list and add logic to import VMAs from tasks.
> > >
> > > That behaviour controlled by new attributes:
> > >   * mode:
> > > I try mimic hugepages attribute, so mode have two states:
> > >   * madvise  - old default behaviour
> > >   * always [new] - allow ksm to get tasks vma and
> > >try working on that.
> >
> > Please don't. And if you really have to for some reason, put some big
> > warnings on this, advising people that it's a security risk.
> >
> > KSM is one of the favorite punching bags of side-channel and hardware
> > security researchers:
> >
> > As a gigantic, problematic side channel:
> > http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> > https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> > https://access.redhat.com/blogs/766093/posts/1976303
> > https://gruss.cc/files/dedup.pdf
> >
> > In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> > Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> > it possible to use malicious JavaScript to determine whether a given
> > page of memory exists elsewhere on your system.
> >
> > And also as a way to target rowhammer-based faults:
> > https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> > https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/
>
> I'm very sorry, i'm not a security specialist.
> But if i understood correctly, ksm have that security issues _without_
> my patch set.

Yep. However, so far, it requires an application to explicitly opt in
to this behavior, so it's not all that bad. Your patch would remove
the requirement for application opt-in, which, in my opinion, makes
this way worse and reduces the number of applications for which this
is acceptable.

> Even more, not only KSM have that type of issue, any memory
> deduplication have that problems.

Yup.

> Any guy who care about security must decide on it self. Which things
> him use and how he will
> defend from others.

> Even more on it self he must learn tools, what he use and make some
> decision right?
>
> So, if you really care about that problem in general, or only on KSM side,
> that your initiative and your duty to warn people about that.
>
> KSM already exists for 10+ years. You know about security implication
> of use memory deduplication.
> That your duty to send a patches to documentation, and add appropriate 
> warnings.

As far as I know, basically nobody is using KSM at this point. There
are blog posts from several cloud providers about these security risks
that explicitly state that they're not using memory deduplication.


[PATCH v4 0/3] Add driver for Synopsys DesignWare I3C master IP

2018-11-13 Thread Vitor Soares
This patch series is a proposal for the I3C master driver for Synopsys IP.
This patch is to be applied on top of I3C subsystem RFC V10 submitted by
Boris Brezillon.

Supported features:
  Regular CCC commands.
  I3C private transfers.
  I2C transfers.

Missing functionalities:
  Support DMA interface.
  Support for I3C_BUS_MODE_MIXED_SLOW.
  Hot-join.
  IBI.

Main change between v3 and v4:
- Minor fixes. They are described in each patch

Main change between v2 and v3:
- Minor fixes. They are described in each patch

Main change between v1 and v2:
- Add controller version on dt-binding
- The driver now calls writesl/readsl() instead readl/writel
- Rename some variables in the driver

Vitor Soares (3):
  i3c: master: Add driver for Synopsys DesignWare IP
  dt-binding: i3c: Document Synopsys DesignWare I3C
  MAINTAINERS: Add myself as the dw-i3c-master module maintainer

 .../devicetree/bindings/i3c/snps,dw-i3c-master.txt |   41 +
 MAINTAINERS|6 +
 drivers/i3c/master/Kconfig |   14 +
 drivers/i3c/master/Makefile|1 +
 drivers/i3c/master/dw-i3c-master.c | 1216 
 5 files changed, 1278 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
 create mode 100644 drivers/i3c/master/dw-i3c-master.c

-- 
2.7.4




[PATCH v4 3/3] MAINTAINERS: Add myself as the dw-i3c-master module maintainer

2018-11-13 Thread Vitor Soares
Signed-off-by: Vitor Soares 
---
Change in v4:
- Change email soa...@synopsys.com to vitor.soa...@synopsys.com

Change in v3:
- Remove dummy characters

Change in v2:
- None

 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 293c863..39910b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6930,6 +6930,12 @@ F:   drivers/i3c/
 F: include/linux/i3c/
 F: include/dt-bindings/i3c/
 
+I3C DRIVER FOR SYNOPSYS DESIGNWARE
+M: Vitor Soares 
+S: Maintained
+F: Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
+F: drivers/i3c/master/dw*
+
 IA64 (Itanium) PLATFORM
 M: Tony Luck 
 M: Fenghua Yu 
-- 
2.7.4




[PATCH v4 2/3] dt-binding: i3c: Document Synopsys DesignWare I3C

2018-11-13 Thread Vitor Soares
Document Synopsys DesignWare I3C master module

Signed-off-by: Vitor Soares 
Reviewed-by: Rob Herring 
---
Changes in v4:
- Change email soa...@synopsys.com to vitor.soa...@synopsys.com
- Add Rob Herring R-b

Changes in v3:
- Remove dummy characters

Changes in v2:
- Address the changes in Documentation/devicetree/bindings/i3c/i3c.txt
- Add controller version on compatible string

 .../devicetree/bindings/i3c/snps,dw-i3c-master.txt | 41 ++
 1 file changed, 41 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt

diff --git a/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt 
b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
new file mode 100644
index 000..5020eb7
--- /dev/null
+++ b/Documentation/devicetree/bindings/i3c/snps,dw-i3c-master.txt
@@ -0,0 +1,41 @@
+Bindings for Synopsys DesignWare I3C master block
+=
+
+Required properties:
+
+- compatible: shall be "snps,dw-i3c-master-1.00a"
+- clocks: shall reference the core_clk
+- interrupts: the interrupt line connected to this I3C master
+- reg: Offset and length of I3C master registers
+
+Mandatory properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- #address-cells: shall be set to 3
+- #size-cells: shall be set to 0
+
+Optional properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- i2c-scl-hz
+- i3c-scl-hz
+
+I3C device connected on the bus follow the generic description (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details).
+
+Example:
+
+   i3c-master@2000 {
+   compatible = "snps,dw-i3c-master-1.00a";
+   #address-cells = <3>;
+   #size-cells = <0>;
+   reg = <0x02000 0x1000>;
+   interrupts = <0>;
+   clocks = <&i3cclk>;
+
+   eeprom@57{
+   compatible = "atmel,24c01";
+   reg = <0x57 0x0 0x10>;
+   pagesize = <0x8>;
+   };
+   };
-- 
2.7.4




[PATCH v4 1/3] i3c: master: Add driver for Synopsys DesignWare IP

2018-11-13 Thread Vitor Soares
Add driver for Synopsys DesignWare I3C master IP

Signed-off-by: Vitor Soares 
---
Change in v4:
- Change email soa...@synopsys.com to vitor.soa...@synopsys.com

Change in v3:
- Use struct_size() (suggested by Matthew)

Change in v2:
- Rename some variables
- Remove dw_i3c_master_dev_set_info()
- Ajust code to match the changes made of i3c subsystem
- Use readsl/writesl() to populate TX/RX buffers

 drivers/i3c/master/Kconfig |   14 +
 drivers/i3c/master/Makefile|1 +
 drivers/i3c/master/dw-i3c-master.c | 1216 
 3 files changed, 1231 insertions(+)
 create mode 100644 drivers/i3c/master/dw-i3c-master.c

diff --git a/drivers/i3c/master/Kconfig b/drivers/i3c/master/Kconfig
index 959b837..8ee1ce6 100644
--- a/drivers/i3c/master/Kconfig
+++ b/drivers/i3c/master/Kconfig
@@ -4,3 +4,17 @@ config CDNS_I3C_MASTER
depends on !(ALPHA || PARISC)
help
  Enable this driver if you want to support Cadence I3C master block.
+
+config DW_I3C_MASTER
+   tristate "Synospsys DesignWare I3C master driver"
+   depends on I3C
+   depends on !(ALPHA || PARISC)
+   # ALPHA and PARISC needs {read,write}sl()
+   help
+ Support for Synopsys DesignWare MIPI I3C Controller.
+
+ For details please see
+ https://www.synopsys.com/dw/ipdir.php?ds=mipi_i3c
+
+ This driver can also be built as a module.  If so, the module
+ will be called dw-i3c-master.
diff --git a/drivers/i3c/master/Makefile b/drivers/i3c/master/Makefile
index 4c4304a..fc53939 100644
--- a/drivers/i3c/master/Makefile
+++ b/drivers/i3c/master/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_CDNS_I3C_MASTER)  += i3c-master-cdns.o
+obj-$(CONFIG_DW_I3C_MASTER)+= dw-i3c-master.o
diff --git a/drivers/i3c/master/dw-i3c-master.c 
b/drivers/i3c/master/dw-i3c-master.c
new file mode 100644
index 000..71b80d3
--- /dev/null
+++ b/drivers/i3c/master/dw-i3c-master.c
@@ -0,0 +1,1216 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Synopsys, Inc. and/or its affiliates.
+ *
+ * Author: Vitor Soares 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEVICE_CTRL0x0
+#define DEV_CTRL_ENABLEBIT(31)
+#define DEV_CTRL_RESUMEBIT(30)
+#define DEV_CTRL_HOT_JOIN_NACK BIT(8)
+#define DEV_CTRL_I2C_SLAVE_PRESENT BIT(7)
+
+#define DEVICE_ADDR0x4
+#define DEV_ADDR_DYNAMIC_ADDR_VALIDBIT(31)
+#define DEV_ADDR_DYNAMIC(x)(((x) << 16) & GENMASK(22, 16))
+
+#define HW_CAPABILITY  0x8
+#define COMMAND_QUEUE_PORT 0xc
+#define COMMAND_PORT_TOC   BIT(30)
+#define COMMAND_PORT_READ_TRANSFER BIT(28)
+#define COMMAND_PORT_SDAP  BIT(27)
+#define COMMAND_PORT_ROC   BIT(26)
+#define COMMAND_PORT_SPEED(x)  (((x) << 21) & GENMASK(23, 21))
+#define COMMAND_PORT_DEV_INDEX(x)  (((x) << 16) & GENMASK(20, 16))
+#define COMMAND_PORT_CPBIT(15)
+#define COMMAND_PORT_CMD(x)(((x) << 7) & GENMASK(14, 7))
+#define COMMAND_PORT_TID(x)(((x) << 3) & GENMASK(6, 3))
+
+#define COMMAND_PORT_ARG_DATA_LEN(x)   (((x) << 16) & GENMASK(31, 16))
+#define COMMAND_PORT_ARG_DATA_LEN_MAX  65536
+#define COMMAND_PORT_TRANSFER_ARG  0x01
+
+#define COMMAND_PORT_SDA_DATA_BYTE_3(x)(((x) << 24) & GENMASK(31, 24))
+#define COMMAND_PORT_SDA_DATA_BYTE_2(x)(((x) << 16) & GENMASK(23, 16))
+#define COMMAND_PORT_SDA_DATA_BYTE_1(x)(((x) << 8) & GENMASK(15, 8))
+#define COMMAND_PORT_SDA_BYTE_STRB_3   BIT(5)
+#define COMMAND_PORT_SDA_BYTE_STRB_2   BIT(4)
+#define COMMAND_PORT_SDA_BYTE_STRB_1   BIT(3)
+#define COMMAND_PORT_SHORT_DATA_ARG0x02
+
+#define COMMAND_PORT_DEV_COUNT(x)  (((x) << 21) & GENMASK(25, 21))
+#define COMMAND_PORT_ADDR_ASSGN_CMD0x03
+
+#define RESPONSE_QUEUE_PORT0x10
+#define RESPONSE_PORT_ERR_STATUS(x)(((x) & GENMASK(31, 28)) >> 28)
+#define RESPONSE_NO_ERROR  0
+#define RESPONSE_ERROR_CRC 1
+#define RESPONSE_ERROR_PARITY  2
+#define RESPONSE_ERROR_FRAME   3
+#define RESPONSE_ERROR_IBA_NACK4
+#define RESPONSE_ERROR_ADDRESS_NACK5
+#define RESPONSE_ERROR_OVER_UNDER_FLOW 6
+#define RESPONSE_ERROR_TRANSF_ABORT8
+#define RESPONSE_ERROR_I2C_W_NACK_ERR  9
+#define RESPONSE_PORT_TID(x)   (((x) & GENMASK(27, 24)) >> 24)
+#define RESPONSE_PORT_DATA_LEN(x)  ((x) & GENMASK(15, 0))
+
+#define RX_TX_DATA_PORT0x14
+#define IBI_QUEUE_STATUS   0x18
+#define QUEUE_THLD_CTRL0x1c
+#define QUEUE_THLD_CTRL_RESP_BUF_MASK  GENMASK(15, 8)
+#define QUEUE_THLD_CTRL_RESP_BUF(x)(((x) - 1) << 8)
+
+#define DATA_BUFFER_THLD_CTRL  0x20
+#define D

Re: [PATCH v10 07/22] kasan: initialize shadow to 0xff for tag-based mode

2018-11-13 Thread Andrey Konovalov
On Wed, Nov 7, 2018 at 6:08 PM, Mark Rutland  wrote:
> On Tue, Nov 06, 2018 at 06:30:22PM +0100, Andrey Konovalov wrote:
>> A tag-based KASAN shadow memory cell contains a memory tag, that
>> corresponds to the tag in the top byte of the pointer, that points to that
>> memory. The native top byte value of kernel pointers is 0xff, so with
>> tag-based KASAN we need to initialize shadow memory to 0xff.
>>
>> Reviewed-by: Andrey Ryabinin 
>> Reviewed-by: Dmitry Vyukov 
>> Signed-off-by: Andrey Konovalov 
>> ---
>>  arch/arm64/mm/kasan_init.c | 16 ++--
>>  include/linux/kasan.h  |  8 
>>  mm/kasan/common.c  |  3 ++-
>>  3 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
>> index 63527e585aac..18ebc8994a7b 100644
>> --- a/arch/arm64/mm/kasan_init.c
>> +++ b/arch/arm64/mm/kasan_init.c
>> @@ -43,6 +43,15 @@ static phys_addr_t __init kasan_alloc_zeroed_page(int 
>> node)
>>   return __pa(p);
>>  }
>>
>> +static phys_addr_t __init kasan_alloc_raw_page(int node)
>> +{
>> + void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE,
>> + __pa(MAX_DMA_ADDRESS),
>> + MEMBLOCK_ALLOC_ACCESSIBLE,
>> + node);
>> + return __pa(p);
>> +}
>> +
>>  static pte_t *__init kasan_pte_offset(pmd_t *pmdp, unsigned long addr, int 
>> node,
>> bool early)
>>  {
>> @@ -88,7 +97,9 @@ static void __init kasan_pte_populate(pmd_t *pmdp, 
>> unsigned long addr,
>>
>>   do {
>>   phys_addr_t page_phys = early ? __pa_symbol(kasan_zero_page)
>> -   : kasan_alloc_zeroed_page(node);
>> +   : kasan_alloc_raw_page(node);
>> + if (!early)
>> + memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
>>   next = addr + PAGE_SIZE;
>>   set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
>>   } while (ptep++, addr = next, addr != end && 
>> pte_none(READ_ONCE(*ptep)));
>> @@ -138,6 +149,7 @@ asmlinkage void __init kasan_early_init(void)
>>   KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
>>   BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_START, PGDIR_SIZE));
>>   BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
>> +
>>   kasan_pgd_populate(KASAN_SHADOW_START, KASAN_SHADOW_END, NUMA_NO_NODE,
>>  true);
>>  }
>> @@ -234,7 +246,7 @@ void __init kasan_init(void)
>>   set_pte(&kasan_zero_pte[i],
>>   pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO));
>>
>> - memset(kasan_zero_page, 0, PAGE_SIZE);
>> + memset(kasan_zero_page, KASAN_SHADOW_INIT, PAGE_SIZE);
>
> If this isn't going to contain zero, can we please have a preparatory
> patch renaming this to something which isn't misleading?
>
> Perhaps kasan_common_shadow_page?

Will rename to kasan_early_shadow_page in v11, thanks!

>
> Thanks,
> Mark.


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Oleksandr Natalenko

Hi.


Yep. However, so far, it requires an application to explicitly opt in
to this behavior, so it's not all that bad. Your patch would remove
the requirement for application opt-in, which, in my opinion, makes
this way worse and reduces the number of applications for which this
is acceptable.


The default is to maintain the old behaviour, so unless the explicit 
decision is made by the administrator, no extra risk is imposed.



As far as I know, basically nobody is using KSM at this point. There
are blog posts from several cloud providers about these security risks
that explicitly state that they're not using memory deduplication.


I tend to disagree here. Based on both what my company does and what 
UKSM users do, memory dedup is a desired option (note "option" word 
here, not the default choice).


Thanks.

--
  Oleksandr Natalenko (post-factum)


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa

Hello,
I've been studying v4 of the patch-set [1] that Nadav has been working on.
Incidentally, I think it would be useful to cc also the 
security/hardening ml.

The patch-set seems to be close to final, so I am resuming this discussion.

On 30/10/2018 19:06, Andy Lutomirski wrote:


I support the addition of a rare-write mechanism to the upstream kernel.  And I 
think that there is only one sane way to implement it: using an mm_struct. That 
mm_struct, just like any sane mm_struct, should only differ from init_mm in 
that it has extra mappings in the *user* region.


After reading the code, I see what you meant.
I think I can work with it.

But I have a couple of questions wrt the use of this mechanism, in the 
context of write rare.



1) mm_struct.

Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code 
(live patch?), which seems to happen sequentially and in a relatively 
standardized way, like replacing the NOPs specifically placed in the 
functions that need patching.


This is a bit different from the more generic write-rare case, applied 
to data.


As example, I have in mind a system where both IMA and SELinux are in use.

In this system, a file is accessed for the first time.

That would trigger 2 things:
- evaluation of the SELinux rules and probably update of the AVC cache
- IMA measurement and update of the measurements

Both of them could be write protected, meaning that they would both have 
to be modified through the write rare mechanism.


While the events, for 1 specific file, would be sequential, it's not 
difficult to imagine that multiple files could be accessed at the same time.


If the update of the data structures in both IMA and SELinux must use 
the same mm_struct, that would have to be somehow regulated and it would 
introduce an unnecessary (imho) dependency.


How about having one mm_struct for each writer (core or thread)?



2) Iiuc, the purpose of the 2 pages being remapped is that the target of 
the patch might spill across the page boundary, however if I deal with 
the modification of generic data, I shouldn't (shouldn't I?) assume that 
the data will not span across multiple pages.


If the data spans across multiple pages, in unknown amount, I suppose 
that I should not keep interrupts disabled for an unknown time, as it 
would hurt preemption.


What I thought, in my initial patch-set, was to iterate over each page 
that must be written to, in a loop, re-enabling interrupts in-between 
iterations, to give pending interrupts a chance to be served.


This would mean that the data being written to would not be consistent, 
but it's a problem that would have to be addressed anyways, since it can 
be still read by other cores, while the write is ongoing.



Is this a valid concern/approach?


--
igor


[1] https://lkml.org/lkml/2018/11/11/110


Re: [PATCH v10 12/22] kasan, arm64: fix up fault handling logic

2018-11-13 Thread Andrey Konovalov
On Thu, Nov 8, 2018 at 1:22 PM, Mark Rutland  wrote:
> On Tue, Nov 06, 2018 at 06:30:27PM +0100, Andrey Konovalov wrote:
>> show_pte in arm64 fault handling relies on the fact that the top byte of
>> a kernel pointer is 0xff, which isn't always the case with tag-based
>> KASAN.
>
> That's for the TTBR1 check, right?
>
> i.e. for the following to work:
>
> if (addr >= VA_START)
>
> ... we need the tag bits to be an extension of bit 55...
>
>>
>> This patch resets the top byte in show_pte.
>>
>> Reviewed-by: Andrey Ryabinin 
>> Reviewed-by: Dmitry Vyukov 
>> Signed-off-by: Andrey Konovalov 
>> ---
>>  arch/arm64/mm/fault.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 7d9571f4ae3d..d9a84d6f3343 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -32,6 +32,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -141,6 +142,8 @@ void show_pte(unsigned long addr)
>>   pgd_t *pgdp;
>>   pgd_t pgd;
>>
>> + addr = (unsigned long)kasan_reset_tag((void *)addr);
>
> ... but this ORs in (0xffUL << 56), which is not correct for addresses
> which aren't TTBR1 addresses to begin with, where bit 55 is clear, and
> throws away useful information.
>
> We could use untagged_addr() here, but that wouldn't be right for
> kernels which don't use TBI1, and we'd erroneously report addresses
> under the TTBR1 range as being in the TTBR1 range.
>
> I also see that the entry assembly for el{1,0}_{da,ia} clears the tag
> for EL0 addresses.
>
> So we could have:
>
> static inline bool is_ttbr0_addr(unsigned long addr)
> {
> /* entry assembly clears tags for TTBR0 addrs */
> return addr < TASK_SIZE_64;
> }
>
> static inline bool is_ttbr1_addr(unsigned long addr)
> {
> /* TTBR1 addresses may have a tag if HWKASAN is in use */
> return arch_kasan_reset_tag(addr) >= VA_START;
> }
>
> ... and use those in the conditionals, leaving the addr as-is for
> reporting purposes.

Actually it looks like 276e9327 ("arm64: entry: improve data abort
handling of tagged pointers") already takes care of both user and
kernel fault addresses and correctly removes tags from them. So I
think we need to drop this patch.


Re: [PATCH v15 23/23] x86/sgx: Driver documentation

2018-11-13 Thread Jarkko Sakkinen
On Thu, Nov 08, 2018 at 09:20:40PM +0200, Jarkko Sakkinen wrote:
> On Thu, Nov 08, 2018 at 04:39:42PM +0200, Jarkko Sakkinen wrote:
> > On Wed, Nov 07, 2018 at 09:09:37AM -0800, Dave Hansen wrote:
> > > On 11/7/18 8:30 AM, Jarkko Sakkinen wrote:
> > > >> Does this code run when I type "make kselftest"?  If not, I think we
> > > >> should rectify that.
> > > > No, it doesn't. It is just my backup for the non-SDK user space code
> > > > that I've made that I will use to fork my user space SGX projects in
> > > > the future.
> > > > 
> > > > I can work-out a selftest (and provide a new patch in the series) but
> > > > I'm still wondering what the enclave should do. I would suggest that
> > > > we start with an enclave that does just EEXIT and nothing else.
> > > 
> > > Yeah, that's a start.  But, a good selftest would include things like
> > > faults and error conditions.
> > 
> > Great. We can add more entry points to the enclave for different tests
> > but I'll start with a bare minimum. And yeah but ever goes into next
> > version I'll document the fault handling.
> 
> For the v17 I'll add exactly two test cases:
> 
> 1. EENTER/EEXIT
> 2. EENTER/exception
> 
> So that it will easier to evaluate and demonstrate exception handling.
> 
> /Jarkko

Here is my test program now:

https://github.com/jsakkine-intel/sgx-selftest

It is ~1100 lines ATM. Next I'll deploy it to the kernel tree. It has
only (1) now but I'll add (2) too when I convert this to a kernel patch
(probably by doing sgx_call() with a NULL pointer).

/Jarkko


[PATCH] arm64: Make kpti command line options x86 compatible

2018-11-13 Thread Alexander Graf
I've already stumbled over 2 cases where people got confused about how to
disable kpti on AArch64. In both cases, they used existing x86_64 options
and just applied that to an AArch64 system, expecting it to work.

I think it makes a lot of sense to have compatible kernel command line
parameters whenever we can have them be compatible.

So this patch adds the pti= and no_pti kernel command line options, mapping
them into the existing kpti= command line framework. It preserves the old
syntax to maintain compatibility with older command lines.

While at it, the patch also marks the respective options as dual-arch.

Reported-by: Richard Brown 
Signed-off-by: Alexander Graf 
---
 Documentation/admin-guide/kernel-parameters.txt |  6 +++---
 arch/arm64/kernel/cpufeature.c  | 15 ++-
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 81d1d5a74728..4a1c6bcfcdb5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3522,8 +3522,8 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.
 
-   pti=[X86_64] Control Page Table Isolation of user and
-   kernel address spaces.  Disabling this feature
+   pti=[X86_64,ARM64] Control Page Table Isolation of user
+   and kernel address spaces.  Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.
 
@@ -3534,7 +3534,7 @@
 
Not specifying this option is equivalent to pti=auto.
 
-   nopti   [X86_64]
+   nopti   [X86_64,ARM64]
Equivalent to pti=off
 
pty.legacy_count=
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index af50064dea51..12bb3b0470dd 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -978,13 +978,26 @@ static int __init parse_kpti(char *str)
bool enabled;
int ret = strtobool(str, &enabled);
 
-   if (ret)
+   if (ret) {
+   if (!strncmp(str, "auto", 4)) {
+   __kpti_forced = 0;
+   return 0;
+   }
return ret;
+   }
 
__kpti_forced = enabled ? 1 : -1;
return 0;
 }
 early_param("kpti", parse_kpti);
+early_param("pti", parse_kpti);
+
+static int __init handle_no_pti(char *p)
+{
+   __kpti_forced = -1;
+   return 0;
+}
+early_param("nopti", parse_no_pti);
 #endif /* CONFIG_UNMAP_KERNEL_AT_EL0 */
 
 #ifdef CONFIG_ARM64_HW_AFDBM
-- 
2.12.3



[PATCH v2] arm64: Make kpti command line options x86 compatible

2018-11-13 Thread Alexander Graf
I've already stumbled over 2 cases where people got confused about how to
disable kpti on AArch64. In both cases, they used existing x86_64 options
and just applied that to an AArch64 system, expecting it to work.

I think it makes a lot of sense to have compatible kernel command line
parameters whenever we can have them be compatible.

So this patch adds the pti= and no_pti kernel command line options, mapping
them into the existing kpti= command line framework. It preserves the old
syntax to maintain compatibility with older command lines.

While at it, the patch also marks the respective options as dual-arch.

Reported-by: Richard Brown 
Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - Actually make it compile. Sorry for the sloppy v1.
---
 Documentation/admin-guide/kernel-parameters.txt |  6 +++---
 arch/arm64/kernel/cpufeature.c  | 20 +++-
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 81d1d5a74728..4a1c6bcfcdb5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3522,8 +3522,8 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.
 
-   pti=[X86_64] Control Page Table Isolation of user and
-   kernel address spaces.  Disabling this feature
+   pti=[X86_64,ARM64] Control Page Table Isolation of user
+   and kernel address spaces.  Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.
 
@@ -3534,7 +3534,7 @@
 
Not specifying this option is equivalent to pti=auto.
 
-   nopti   [X86_64]
+   nopti   [X86_64,ARM64]
Equivalent to pti=off
 
pty.legacy_count=
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index af50064dea51..a67b4b563a7c 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -978,13 +978,31 @@ static int __init parse_kpti(char *str)
bool enabled;
int ret = strtobool(str, &enabled);
 
-   if (ret)
+   if (ret) {
+   if (!strncmp(str, "auto", 4)) {
+   __kpti_forced = 0;
+   return 0;
+   }
return ret;
+   }
 
__kpti_forced = enabled ? 1 : -1;
return 0;
 }
 early_param("kpti", parse_kpti);
+
+static int __init parse_pti(char *str)
+{
+   return parse_kpti(str);
+}
+early_param("pti", parse_pti);
+
+static int __init parse_no_pti(char *p)
+{
+   __kpti_forced = -1;
+   return 0;
+}
+early_param("nopti", parse_no_pti);
 #endif /* CONFIG_UNMAP_KERNEL_AT_EL0 */
 
 #ifdef CONFIG_ARM64_HW_AFDBM
-- 
2.12.3



Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Oleksandr Natalenko

So,


…snip…
+static int ksm_seeker_thread(void *nothing)
+{
+   pid_t last_pid = 1;
+   pid_t curr_pid;
+   struct task_struct *task;
+
+   set_freezable();
+   set_user_nice(current, 5);
+
+   while (!kthread_should_stop()) {
+   wait_while_offlining();
+
+   try_to_freeze();
+
+   if (!ksm_mode_always()) {
+   wait_event_freezable(ksm_seeker_thread_wait,
+   ksm_mode_always() || kthread_should_stop());
+   continue;
+   }
+
+   /*
+* import one task's vma per run
+*/
+   read_lock(&tasklist_lock);
+
+   /* Try always get next task */
+   for_each_process(task) {
+   curr_pid = task_pid_nr(task);
+   if (curr_pid == last_pid) {
+   task = next_task(task);
+   break;
+   }
+
+   if (curr_pid > last_pid)
+   break;
+   }
+
+   last_pid = task_pid_nr(task);
+   ksm_import_task_vma(task);


This seems to be a bad idea. ksm_import_task_vma() may sleep with 
tasklist_lock being held. Thus, IIUC, you'll get this:


[ 1754.410322] BUG: scheduling while atomic: ksmd_seeker/50/0x0002
…
[ 1754.410444] Call Trace:
[ 1754.410455]  dump_stack+0x5c/0x80
[ 1754.410460]  __schedule_bug.cold.19+0x38/0x51
[ 1754.410464]  __schedule+0x11dc/0x2080
[ 1754.410483]  schedule+0x32/0xb0
[ 1754.410487]  rwsem_down_write_failed+0x15d/0x240
[ 1754.410496]  call_rwsem_down_write_failed+0x13/0x20
[ 1754.410499]  down_write+0x20/0x30
[ 1754.410502]  ksm_import_task_vma+0x22/0x70
[ 1754.410505]  ksm_seeker_thread+0x134/0x1c0
[ 1754.410512]  kthread+0x113/0x130
[ 1754.410518]  ret_from_fork+0x35/0x40

I think you may want to get a reference to task_struct before releasing 
tasklist_lock, and then put it after ksm_import_task_vma() does its job.



+   read_unlock(&tasklist_lock);
+
+   schedule_timeout_interruptible(
+   msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
+   }
+   return 0;
+}
…snip…


--
  Oleksandr Natalenko (post-factum)


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 19:33, Oleksandr Natalenko :
>
> So,
>
> > …snip…
> > +static int ksm_seeker_thread(void *nothing)
> > +{
> > + pid_t last_pid = 1;
> > + pid_t curr_pid;
> > + struct task_struct *task;
> > +
> > + set_freezable();
> > + set_user_nice(current, 5);
> > +
> > + while (!kthread_should_stop()) {
> > + wait_while_offlining();
> > +
> > + try_to_freeze();
> > +
> > + if (!ksm_mode_always()) {
> > + wait_event_freezable(ksm_seeker_thread_wait,
> > + ksm_mode_always() || kthread_should_stop());
> > + continue;
> > + }
> > +
> > + /*
> > +  * import one task's vma per run
> > +  */
> > + read_lock(&tasklist_lock);
> > +
> > + /* Try always get next task */
> > + for_each_process(task) {
> > + curr_pid = task_pid_nr(task);
> > + if (curr_pid == last_pid) {
> > + task = next_task(task);
> > + break;
> > + }
> > +
> > + if (curr_pid > last_pid)
> > + break;
> > + }
> > +
> > + last_pid = task_pid_nr(task);
> > + ksm_import_task_vma(task);
>
> This seems to be a bad idea. ksm_import_task_vma() may sleep with
> tasklist_lock being held. Thus, IIUC, you'll get this:

Yep, that one of the reason why i move code from ksmd thread, i'm not
fully understood how to properly fix that.
But i misunderstood problem symptoms.

> [ 1754.410322] BUG: scheduling while atomic: ksmd_seeker/50/0x0002
> …
> [ 1754.410444] Call Trace:
> [ 1754.410455]  dump_stack+0x5c/0x80
> [ 1754.410460]  __schedule_bug.cold.19+0x38/0x51
> [ 1754.410464]  __schedule+0x11dc/0x2080
> [ 1754.410483]  schedule+0x32/0xb0
> [ 1754.410487]  rwsem_down_write_failed+0x15d/0x240
> [ 1754.410496]  call_rwsem_down_write_failed+0x13/0x20
> [ 1754.410499]  down_write+0x20/0x30
> [ 1754.410502]  ksm_import_task_vma+0x22/0x70
> [ 1754.410505]  ksm_seeker_thread+0x134/0x1c0
> [ 1754.410512]  kthread+0x113/0x130
> [ 1754.410518]  ret_from_fork+0x35/0x40
>
> I think you may want to get a reference to task_struct before releasing
> tasklist_lock, and then put it after ksm_import_task_vma() does its job.

Maybe i misunderstood something, but currently i do exactly that.

> > + read_unlock(&tasklist_lock);
> > +
> > + schedule_timeout_interruptible(
> > + msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
> > + }
> > + return 0;
> > +}
> > …snip…
>
> --
>Oleksandr Natalenko (post-factum)


That's good that you got that in any way (because i can't reproduce currently).

You mean try do something, like that right?

read_lock(&tasklist_lock);
  
  task_lock(task);
read_unlock(&tasklist_lock);
last_pid = task_pid_nr(task);
ksm_import_task_vma(task);
  task_unlock(task);

Thanks!


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Andy Lutomirski
On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:
>
> Hello,
> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> Incidentally, I think it would be useful to cc also the
> security/hardening ml.
> The patch-set seems to be close to final, so I am resuming this discussion.
>
> On 30/10/2018 19:06, Andy Lutomirski wrote:
>
> > I support the addition of a rare-write mechanism to the upstream kernel.  
> > And I think that there is only one sane way to implement it: using an 
> > mm_struct. That mm_struct, just like any sane mm_struct, should only differ 
> > from init_mm in that it has extra mappings in the *user* region.
>
> After reading the code, I see what you meant.
> I think I can work with it.
>
> But I have a couple of questions wrt the use of this mechanism, in the
> context of write rare.
>
>
> 1) mm_struct.
>
> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> (live patch?), which seems to happen sequentially and in a relatively
> standardized way, like replacing the NOPs specifically placed in the
> functions that need patching.
>
> This is a bit different from the more generic write-rare case, applied
> to data.
>
> As example, I have in mind a system where both IMA and SELinux are in use.
>
> In this system, a file is accessed for the first time.
>
> That would trigger 2 things:
> - evaluation of the SELinux rules and probably update of the AVC cache
> - IMA measurement and update of the measurements
>
> Both of them could be write protected, meaning that they would both have
> to be modified through the write rare mechanism.
>
> While the events, for 1 specific file, would be sequential, it's not
> difficult to imagine that multiple files could be accessed at the same time.
>
> If the update of the data structures in both IMA and SELinux must use
> the same mm_struct, that would have to be somehow regulated and it would
> introduce an unnecessary (imho) dependency.
>
> How about having one mm_struct for each writer (core or thread)?
>

I don't think that helps anything.  I think the mm_struct used for
prmem (or rare_write or whatever you want to call it) should be
entirely abstracted away by an appropriate API, so neither SELinux nor
IMA need to be aware that there's an mm_struct involved.  It's also
entirely possible that some architectures won't even use an mm_struct
behind the scenes -- x86, for example, could have avoided it if there
were a kernel equivalent of PKRU.  Sadly, there isn't.

>
>
> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> the patch might spill across the page boundary, however if I deal with
> the modification of generic data, I shouldn't (shouldn't I?) assume that
> the data will not span across multiple pages.

The reason for the particular architecture of text_poke() is to avoid
memory allocation to get it working.  i think that prmem/rare_write
should have each rare-writable kernel address map to a unique user
address, possibly just by offsetting everything by a constant.  For
rare_write, you don't actually need it to work as such until fairly
late in boot, since the rare_writable data will just be writable early
on.

>
> If the data spans across multiple pages, in unknown amount, I suppose
> that I should not keep interrupts disabled for an unknown time, as it
> would hurt preemption.
>
> What I thought, in my initial patch-set, was to iterate over each page
> that must be written to, in a loop, re-enabling interrupts in-between
> iterations, to give pending interrupts a chance to be served.
>
> This would mean that the data being written to would not be consistent,
> but it's a problem that would have to be addressed anyways, since it can
> be still read by other cores, while the write is ongoing.

This probably makes sense, except that enabling and disabling
interrupts means you also need to restore the original mm_struct (most
likely), which is slow.  I don't think there's a generic way to check
whether in interrupt is pending without turning interrupts on.


Re: [PATCH] hwmon (ina3221) Add single-shot mode support

2018-11-13 Thread Guenter Roeck
On Mon, Nov 12, 2018 at 08:58:24PM -0800, Nicolin Chen wrote:
> Hi Guenter,
> 
> On Mon, Nov 12, 2018 at 08:32:48PM -0800, Guenter Roeck wrote:
> > On Mon, Nov 12, 2018 at 08:23:53PM -0800, Nicolin Chen wrote:
> > > INA3221 supports both continuous and single-shot modes. When
> > > running in the continuous mode, it keeps measuring the inputs
> > > and converting them to the data register even if there are no
> > > users reading the data out. In this use case, this could be a
> > > power waste.
> > > 
> > > So this patch adds a single-shot mode support so that ina3221
> > > could do measurement and conversion only if users trigger it,
> > > depending on the use case where it only needs to poll data in
> > > a lower frequency.
> > > 
> > > The change also exposes "mode" and "available_modes" nodes to
> > > allow users to switch between two operating modes.
> > > 
> > Lots and lots of complexity for little gain. Sorry, I don't see
> > the point of this change.
> 
> The chip is causing considerable power waste on battery-powered
> devices so we typically use it running in the single-shot mode.
> 

And you need to be able to do that with a sysfs attribute ?
Are you planning to have some code switching back and forth
between the modes ?

You'll need to provide a good rationale why this needs to be
runtime configurable.

Guenter

> Although the chip now can be powered down, but we still need to
> occasionally poll it for power measurement and critical alerts,
> so single-shot mode is the best choice for us, considering that
> the power-down-and-up routine would be way heavier.
> 
> I could understand that you don't really like it, but it's some
> feature that we truly need. Do you have any suggestion to write
> the code that can make it more convincing to you?
> 
> Thanks
> Nicolin


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Oleksandr Natalenko

On 13.11.2018 18:10, Timofey Titovets wrote:

You mean try do something, like that right?

read_lock(&tasklist_lock);
  
  task_lock(task);
read_unlock(&tasklist_lock);
last_pid = task_pid_nr(task);
ksm_import_task_vma(task);
  task_unlock(task);


No, task_lock() uses spin_lock() under the bonnet, so this will be the 
same.


Since the sole reason you have to lock/acquire/get a reference to 
task_struct here is to prevent it from disappearing, I was thinking 
about using get_task_struct(), which just increases atomic 
task_struct.usage value (IOW, takes a reference). I *hope* this will be 
enough to prevent task_struct from disappearing in the meantime.


Someone, correct me if I'm wrong.

--
  Oleksandr Natalenko (post-factum)


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Nadav Amit
From: Andy Lutomirski
Sent: November 13, 2018 at 5:16:09 PM GMT
> To: Igor Stoppa 
> Cc: Kees Cook , Peter Zijlstra , 
> Nadav Amit , Mimi Zohar , 
> Matthew Wilcox , Dave Chinner , 
> James Morris , Michal Hocko , Kernel 
> Hardening , linux-integrity 
> , LSM List 
> , Igor Stoppa 
> , Dave Hansen , Jonathan 
> Corbet , Laura Abbott , Randy Dunlap 
> , Mike Rapoport , open 
> list:DOCUMENTATION , LKML 
> , Thomas Gleixner 
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:
>> Hello,
>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>> Incidentally, I think it would be useful to cc also the
>> security/hardening ml.
>> The patch-set seems to be close to final, so I am resuming this discussion.
>> 
>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>> 
>>> I support the addition of a rare-write mechanism to the upstream kernel.  
>>> And I think that there is only one sane way to implement it: using an 
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only differ 
>>> from init_mm in that it has extra mappings in the *user* region.
>> 
>> After reading the code, I see what you meant.
>> I think I can work with it.
>> 
>> But I have a couple of questions wrt the use of this mechanism, in the
>> context of write rare.
>> 
>> 
>> 1) mm_struct.
>> 
>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>> (live patch?), which seems to happen sequentially and in a relatively
>> standardized way, like replacing the NOPs specifically placed in the
>> functions that need patching.
>> 
>> This is a bit different from the more generic write-rare case, applied
>> to data.
>> 
>> As example, I have in mind a system where both IMA and SELinux are in use.
>> 
>> In this system, a file is accessed for the first time.
>> 
>> That would trigger 2 things:
>> - evaluation of the SELinux rules and probably update of the AVC cache
>> - IMA measurement and update of the measurements
>> 
>> Both of them could be write protected, meaning that they would both have
>> to be modified through the write rare mechanism.
>> 
>> While the events, for 1 specific file, would be sequential, it's not
>> difficult to imagine that multiple files could be accessed at the same time.
>> 
>> If the update of the data structures in both IMA and SELinux must use
>> the same mm_struct, that would have to be somehow regulated and it would
>> introduce an unnecessary (imho) dependency.
>> 
>> How about having one mm_struct for each writer (core or thread)?
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it) should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.
> 
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
> 
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>> 
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>> 
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

I guess that enabling IRQs might break some hidden assumptions in the code,
but is there a fundamental reason that IRQs need to be disabled? use_mm()
got them enabled, although it is only suitable for kernel threads.



Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 20:27, Oleksandr Natalenko :
>
> On 13.11.2018 18:10, Timofey Titovets wrote:
> > You mean try do something, like that right?
> >
> > read_lock(&tasklist_lock);
> >   
> >   task_lock(task);
> > read_unlock(&tasklist_lock);
> > last_pid = task_pid_nr(task);
> > ksm_import_task_vma(task);
> >   task_unlock(task);
>
> No, task_lock() uses spin_lock() under the bonnet, so this will be the
> same.
>
> Since the sole reason you have to lock/acquire/get a reference to
> task_struct here is to prevent it from disappearing, I was thinking
> about using get_task_struct(), which just increases atomic
> task_struct.usage value (IOW, takes a reference). I *hope* this will be
> enough to prevent task_struct from disappearing in the meantime.
>
> Someone, correct me if I'm wrong.

That brilliant, i just missing that api.
That must do exactly what i want.

Thanks!

> --
>Oleksandr Natalenko (post-factum)
>


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Andy Lutomirski
On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit  wrote:
>
> From: Andy Lutomirski
> Sent: November 13, 2018 at 5:16:09 PM GMT
> > To: Igor Stoppa 
> > Cc: Kees Cook , Peter Zijlstra 
> > , Nadav Amit , Mimi Zohar 
> > , Matthew Wilcox , Dave 
> > Chinner , James Morris , Michal 
> > Hocko , Kernel Hardening 
> > , linux-integrity 
> > , LSM List 
> > , Igor Stoppa 
> > , Dave Hansen , 
> > Jonathan Corbet , Laura Abbott , Randy 
> > Dunlap , Mike Rapoport , 
> > open list:DOCUMENTATION , LKML 
> > , Thomas Gleixner 
> > Subject: Re: [PATCH 10/17] prmem: documentation
> >
> >
> > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:
> >> Hello,
> >> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> >> Incidentally, I think it would be useful to cc also the
> >> security/hardening ml.
> >> The patch-set seems to be close to final, so I am resuming this discussion.
> >>
> >> On 30/10/2018 19:06, Andy Lutomirski wrote:
> >>
> >>> I support the addition of a rare-write mechanism to the upstream kernel.  
> >>> And I think that there is only one sane way to implement it: using an 
> >>> mm_struct. That mm_struct, just like any sane mm_struct, should only 
> >>> differ from init_mm in that it has extra mappings in the *user* region.
> >>
> >> After reading the code, I see what you meant.
> >> I think I can work with it.
> >>
> >> But I have a couple of questions wrt the use of this mechanism, in the
> >> context of write rare.
> >>
> >>
> >> 1) mm_struct.
> >>
> >> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> >> (live patch?), which seems to happen sequentially and in a relatively
> >> standardized way, like replacing the NOPs specifically placed in the
> >> functions that need patching.
> >>
> >> This is a bit different from the more generic write-rare case, applied
> >> to data.
> >>
> >> As example, I have in mind a system where both IMA and SELinux are in use.
> >>
> >> In this system, a file is accessed for the first time.
> >>
> >> That would trigger 2 things:
> >> - evaluation of the SELinux rules and probably update of the AVC cache
> >> - IMA measurement and update of the measurements
> >>
> >> Both of them could be write protected, meaning that they would both have
> >> to be modified through the write rare mechanism.
> >>
> >> While the events, for 1 specific file, would be sequential, it's not
> >> difficult to imagine that multiple files could be accessed at the same 
> >> time.
> >>
> >> If the update of the data structures in both IMA and SELinux must use
> >> the same mm_struct, that would have to be somehow regulated and it would
> >> introduce an unnecessary (imho) dependency.
> >>
> >> How about having one mm_struct for each writer (core or thread)?
> >
> > I don't think that helps anything.  I think the mm_struct used for
> > prmem (or rare_write or whatever you want to call it) should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
> >
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
> >
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> I guess that enabling IRQs might break some hidden assumptions in the code,
> but is ther

Re: [PATCH 01/12] kernfs: add function to find kernfs_node without increasing ref counter

2018-11-13 Thread Paolo Valente



> Il giorno 12 nov 2018, alle ore 13:28, Greg Kroah-Hartman 
>  ha scritto:
> 
> On Mon, Nov 12, 2018 at 10:56:21AM +0100, Paolo Valente wrote:
>> From: Angelo Ruocco 
>> 
>> The kernfs pseudo file system doesn't export any function to only find
>> a node by name, without also getting a reference on it.
>> But in some cases it is useful to just locate a kernfs node, while
>> using it or not depends on some other condition.
>> 
>> This commit adds a function to just look for a node, without getting
>> a reference on it.
> 
> Eeek, that sounds really bad.  So you save off a pointer to something,
> and have no idea if that pointer now really is valid or not?  It can
> instantly disappear right afterwards.
> 

Hi Greg,
that function is invoked only in functions executed with cgroup_mutex
held.  This guarantees that nothing disappears or becomes
inconsistent.  That's why we decided to go for this optimization,
instead of doing useless gets&puts pairs.  Still, I'm not expert
enough to state whether it is impossible that, once we have defined
that function, it may then get used in some unsafe way.

So, I seem to see two options:
1) Add a comment on the function, saying that cgroup_mutex must be
   held while invoking it (I guess you won't like this one).
2) Do not define such a new function, and, in the other patches, use
   the already-available find_and_get.

Looking forward to your feedback (or of other knowledgeable people on
this issue) before proceeding to a rebased V2,
Paolo


> This feels wrong, what is the problem of having a properly reference
> counted object passed back to you that you have to create a dangerous
> function like this?
> 
> greg k-h



Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> Hi.
> 
> > Yep. However, so far, it requires an application to explicitly opt in
> > to this behavior, so it's not all that bad. Your patch would remove
> > the requirement for application opt-in, which, in my opinion, makes
> > this way worse and reduces the number of applications for which this
> > is acceptable.
> 
> The default is to maintain the old behaviour, so unless the explicit
> decision is made by the administrator, no extra risk is imposed.

The new interface would be more tolerable if it honored MADV_UNMERGEABLE:

KSM default on: merge everything except when MADV_UNMERGEABLE is
excplicitly set.

KSM default off: merge only when MADV_MERGEABLE is set.

The proposed change won't honor MADV_UNMERGEABLE, meaning that
application programmers won't have a way to prevent sensitive data to be
every merged. So, I think, we should keep allow an explicit opt-out
option for applications.

> 
> > As far as I know, basically nobody is using KSM at this point. There
> > are blog posts from several cloud providers about these security risks
> > that explicitly state that they're not using memory deduplication.
> 
> I tend to disagree here. Based on both what my company does and what UKSM
> users do, memory dedup is a desired option (note "option" word here, not the
> default choice).

Lightweight containers is a use case for KSM: when many VMs share the
same small kernel. KSM is used in production by large cloud vendors.

Thank you,
Pasha


[PATCHv11 0/8] Add Intel Stratix10 FPGA manager and service layer

2018-11-13 Thread richard . gong
From: Richard Gong 

This is the 11th submission of Intel Stratix10 service layer and FPGA
manager driver patches. Starting from 10th submission Stratix10 service
layer driver .c file is moved to drivers/firmware, header files is moved
to include/linux/firmware/intel. And other firmware interface includes
Stratix10 service layer document.

Stratix10 service layer patches have been reviewed internally by Alan Tull
and other colleagues at Intel.

Some features of the Intel Stratix10 SoC require a level of privilege
higher than the kernel is granted. Such secure features include
FPGA programming, remote status update, read and write the secure
registers. In terms of the ARMv8 architecture, the kernel runs at Exception
Level 1 (EL1), access to the features requires Exception Level 3 (EL3).

The Intel Stratix10 service layer provides kernel APIs for drivers to
request access to the secure features. The requests are queued and
processed one by one. ARM’s SMCCC is used to pass the execution of the
requests on to a secure monitor (EL3).

Later the Intel Stratix10 service layer driver will be extended to provide
services for QSPI, Crypto and warm reset.

v2: add patches for FPGA manager, FPGA manager binding, dts and defconfig
remove intel-service subdirectory and intel-service.h, move intel-smc.h
and intel-service.c to driver/misc subdirectory.
remove global variables.
change service layer driver be 'default n'.
correct SPDX markers.
add timeout for do..while() loop.
add kernel-doc for the functions and structs, correct multiline
comments.
replace kfifo_in and kfifo_out with kfifo_in_spinlocked and
kfifo_out_spinlocked.
rename struct intel_svc_data (at client header) to
intel_svc_client_msg.
rename struct intel_svc_private_mem to intel_svc_data.
other corrections/changes from Intel internal code reviews.
v3: change all exported functions with "intel_svc_" as the prefix.
increase timeout values for claiming back submitted buffer(s).
rename struct intel_command_reconfig_payload to
struct intel_svc_command_reconfig_payload.
add pr_err() to provide the error return value.
change to put fpga_mgr node under firmware/svc node.
change to FPGA manager to align the update of service client APIs, and
the update of fpga_mgr device node.
Other corrections/changes.
v4: s/intel/stratix10/ on some variables, structs, functions, and file
names intel-service.c -> stratix10-svc.c
intel-smc.h -> stratix10-smc.h
intel-service-client.h -> stratix10-svc-client.h.
remove non-kernel-doc formatting.
s/fpga-mgr@0/fpga-mgr/ to remove unit_address at fpga_mgr node.
add Rob's Reviewed-by.
add Richard's signed-off-by.
v5: add a new API statix10_svc_done() which is called by service client
when client request is completed or error occurs during request
process. Which allows service layer to free its resources.
remove dummy client from service layer client header and service layer
source file.
add Rob's Reviewed-by.
add a new file stratix10-svc.rst and add that to driver-api/index.rst.
kernel-doc fixes.
v6: replace kthread_create_on_cpu() with kthread_create_on_node().
extend stratix_svc_send() to support service client which doesn't use
memory allocated by service layer.
add S10_RECONFIG_TIMEOUT.
rename s/S10_BUF_TIMEOUT/S10_BUFFER_TIMEOUT/.
fix service layer and FPGA manager Klocwork errors.
v7: add remote status update client support.
s/pr_debug/dev_dbg, s/dev_info/dev_dbg.
add unlock buffer if s10_svc_send_msg() fails.
add release channel if fpga_mgr_create() fails.
handle invalid pointer at svc if the client passed an invalid name.
v8: move stratix10-smc.h to include/linux from driver/misc.
revert version 7 error code & smc function ID value changes at
stratix10-smc.h.
add a goto and common error handling at the end of fpga driver's.
probe function.
v9: remove a patch on defconfig for enable service layer and FPGA manager.
resolve a issue found at git-bisect test.
remove kernel-doc markups that aren't being built.
v10:move stratix10-smc.h and stratix10-svc-client.h to
include/linux/firmware/intel.
move stratix10-svc.c to drivers/firmware.
s/STRATIX10_SERVICE/INTEL_STRATIX10_SERVICE, fix a Klocwork error at
service layer driver.
s/stratix10_svc_command_reconfig_payload/
stratix10_svc_command_config_type.
add stratix10 service layer document to other firmware interface.
update path for the included header in Stratix10 FPGA manager driver.
v11:add Acked-by Moritz Fischer
add additional space at Kconfig file's help text
fix a type in commit message from patch #8

Alan Tull (3):
  dt-bindings: fpga: add Stratix10 SoC FPGA manager binding
  arm64: dts: stratix10: add fpga manager and region
  fpga: add intel stratix10 soc fpga manager driver

Richard Gong (5):
  dt-bindings, firmware: add Intel Stratix10 ser

[PATCHv11 2/8] arm64: dts: stratix10: add stratix10 service driver binding to base dtsi

2018-11-13 Thread richard . gong
From: Richard Gong 

Add Intel Stratix10 service layer to the device tree

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
Acked-by: Moritz Fischer 
---
v2: change to put service layer driver node under the firmware node
change compatible to "intel, stratix10-svc"
v3: no change
v4: s/service driver/stratix10 service driver/ in subject line
v5: no change
v6: add Moritz's Acked-by
v7: no change
v8: no change
v9: no change
v10: no change
v11: no change
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi 
b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index fef7351..519b16e 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -24,6 +24,19 @@
#address-cells = <2>;
#size-cells = <2>;
 
+   reserved-memory {
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+
+   service_reserved: svcbuffer@0 {
+   compatible = "shared-dma-pool";
+   reg = <0x0 0x0 0x0 0x100>;
+   alignment = <0x1000>;
+   no-map;
+   };
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
@@ -537,5 +550,13 @@
 
status = "disabled";
};
+
+   firmware {
+   svc {
+   compatible = "intel,stratix10-svc";
+   method = "smc";
+   memory-region = <&service_reserved>;
+   };
+   };
};
 };
-- 
2.7.4



[PATCHv11 1/8] dt-bindings, firmware: add Intel Stratix10 service layer binding

2018-11-13 Thread richard . gong
From: Richard Gong 

Add a device tree binding for the Intel Stratix10 service layer driver

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
Reviewed-by: Rob Herring 
Acked-by: Moritz Fischer 
---
v2: change to put service layer driver node under the firmware node
change compatible to "intel, stratix10-svc"
v3: no change
v4: add Rob's Reviewed-by
v5: no change
v6: add Moritz's Acked-by
v7: no change
v8: no change
v9: no chnage
v10: no change
v11: no change
---
 .../bindings/firmware/intel,stratix10-svc.txt  | 57 ++
 1 file changed, 57 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt

diff --git a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt 
b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
new file mode 100644
index 000..1fa6606
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
@@ -0,0 +1,57 @@
+Intel Service Layer Driver for Stratix10 SoC
+
+Intel Stratix10 SoC is composed of a 64 bit quad-core ARM Cortex A53 hard
+processor system (HPS) and Secure Device Manager (SDM). When the FPGA is
+configured from HPS, there needs to be a way for HPS to notify SDM the
+location and size of the configuration data. Then SDM will get the
+configuration data from that location and perform the FPGA configuration.
+
+To meet the whole system security needs and support virtual machine requesting
+communication with SDM, only the secure world of software (EL3, Exception
+Layer 3) can interface with SDM. All software entities running on other
+exception layers must channel through the EL3 software whenever it needs
+service from SDM.
+
+Intel Stratix10 service layer driver, running at privileged exception level
+(EL1, Exception Layer 1), interfaces with the service providers and provides
+the services for FPGA configuration, QSPI, Crypto and warm reset. Service layer
+driver also manages secure monitor call (SMC) to communicate with secure 
monitor
+code running in EL3.
+
+Required properties:
+---
+The svc node has the following mandatory properties, must be located under
+the firmware node.
+
+- compatible: "intel,stratix10-svc"
+- method: smc or hvc
+smc - Secure Monitor Call
+hvc - Hypervisor Call
+- memory-region:
+   phandle to the reserved memory node. See
+   Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+   for details
+
+Example:
+---
+
+   reserved-memory {
+#address-cells = <2>;
+#size-cells = <2>;
+ranges;
+
+service_reserved: svcbuffer@0 {
+compatible = "shared-dma-pool";
+reg = <0x0 0x0 0x0 0x100>;
+alignment = <0x1000>;
+no-map;
+};
+};
+
+   firmware {
+   svc {
+   compatible = "intel,stratix10-svc";
+   method = "smc";
+   memory-region = <&service_reserved>;
+   };
+   };
-- 
2.7.4



[PATCHv11 3/8] firmware: add Intel Stratix10 service layer driver

2018-11-13 Thread richard . gong
From: Richard Gong 

Some features of the Intel Stratix10 SoC require a level of privilege
higher than the kernel is granted. Such secure features include
FPGA programming. In terms of the ARMv8 architecture, the kernel runs
at Exception Level 1 (EL1), access to the features requires
Exception Level 3 (EL3).

The Intel Stratix10 SoC service layer provides an in kernel API for
drivers to request access to the secure features. The requests are queued
and processed one by one. ARM’s SMCCC is used to pass the execution
of the requests on to a secure monitor (EL3).

The header file stratix10-sve-client.h defines the interface between
service providers (FPGA manager is one of them) and service layer.

The header file stratix10-smc.h defines the secure monitor call (SMC)
message protocols used for service layer driver in normal world
(EL1) to communicate with secure monitor SW in secure monitor exception
level 3 (EL3).

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v2: remove intel-service subdirectory and intel-service.h, move
intel-smc.h and intel-service.c to driver/misc subdirectory
correct SPDX markers
change service layer driver be 'default n'
remove global variables
add timeout for do..while() loop
add kernel-doc for the functions and structs, correct multiline comments
replace kfifo_in/kfifo_out with kfifo_in_spinlocked/kfifo_out_spinlocked
rename struct intel_svc_data (at client header) to intel_svc_client_msg
rename struct intel_svc_private_mem to intel_svc_data
Other corrections/changes from Intel internal code reviews
v3: change all exported functions with "intel_svc_" as the prefix
increase timeout values for claiming back submitted buffer(s)
rename struct intel_command_reconfig_payload to
struct intel_svc_command_reconfig_payload
add pr_err() to provide the error return value
other corrections/changes
v4: s/intel/stratix10/ on some variables, structs, functions, and file names
intel-service.c -> stratix10-svc.c
intel-smc.h -> stratix10-smc.h
intel-service-client.h -> stratix10-svc-client.h
remove non-kernel-doc formatting
v5: add a new API statix10_svc_done() which is called by service client
when client request is completed or error occurs during request
process. Which allows service layer to free its resources.
remove dummy client from service layer client header and service
layer source file.
kernel-doc fixes
v6: replace kthread_create_on_cpu() with kthread_create_on_node()
extend stratix_svc_send() to support service client which doesn't
use memory allocated by service layer
fix several Klocwork errors
v7: handle invalid pointer if the client passed an invalid name
v8: move stratix10-smc.h to include/linux
revert version 7 error code & smc function ID value changes
at stratix10-smc.h
v9: remove handle for COMMAND_RSU_UPDATE at stratix10-svc.c
remove RSU related definitions at stratix10-smc.h
v10: s/stratix10_svc_command_reconfig_payload/stratix10_svc_command_config_type
 move stratix10-smc.h and stratix10-svc-client.h to
 include/linux/firmware/intel
 move stratix10-svc.c to drivers/firmware
 s/STRATIX10_SERVICE/INTEL_STRATIX10_SERVICE
 fix a Klocwork error
v11: add additional space at Kconfig file's help text
---
 drivers/firmware/Kconfig   |   12 +
 drivers/firmware/Makefile  |1 +
 drivers/firmware/stratix10-svc.c   | 1013 
 include/linux/firmware/intel/stratix10-smc.h   |  265 +
 .../linux/firmware/intel/stratix10-svc-client.h|  201 
 5 files changed, 1492 insertions(+)
 create mode 100644 drivers/firmware/stratix10-svc.c
 create mode 100644 include/linux/firmware/intel/stratix10-smc.h
 create mode 100644 include/linux/firmware/intel/stratix10-svc-client.h

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index 7273e50..f754578 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -216,6 +216,18 @@ config FW_CFG_SYSFS_CMDLINE
  WARNING: Using incorrect parameters (base address in particular)
  may crash your system.
 
+config INTEL_STRATIX10_SERVICE
+   tristate "Intel Stratix10 Service Layer"
+   depends on HAVE_ARM_SMCCC
+   default n
+   help
+ Intel Stratix10 service layer runs at privileged exception level,
+ interfaces with the service providers (FPGA manager is one of them)
+ and manages secure monitor call to communicate with secure monitor
+ software at secure monitor exception level.
+
+ Say Y here if you want Stratix10 service layer support.
+
 config QCOM_SCM
bool
depends on ARM || ARM64
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index 3158dff..80feb63 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_DMI_SYSFS)   += dmi-s

[PATCHv11 7/8] Documentation: driver-api: add stratix10 service layer

2018-11-13 Thread richard . gong
From: Richard Gong 

Add Intel Stratix10 service layer driver document

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v5: this patch is added in patch set version 5,
add new file stratix10-svc.rst
add stratix10-svc.rst to driver-api/index.rst
v6: no change
v7: no change
v8: no change
v9: no change
v10: add stratix10 service layer document to other firmware
 interface, and then remove stratix10-svc.rst file
v11: no change
---
 .../driver-api/firmware/other_interfaces.rst   | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/Documentation/driver-api/firmware/other_interfaces.rst 
b/Documentation/driver-api/firmware/other_interfaces.rst
index 36c47b1..a4ac54b 100644
--- a/Documentation/driver-api/firmware/other_interfaces.rst
+++ b/Documentation/driver-api/firmware/other_interfaces.rst
@@ -13,3 +13,33 @@ EDD Interfaces
 .. kernel-doc:: drivers/firmware/edd.c
:internal:
 
+Intel Stratix10 SoC Service Layer
+-
+Some features of the Intel Stratix10 SoC require a level of privilege
+higher than the kernel is granted. Such secure features include
+FPGA programming. In terms of the ARMv8 architecture, the kernel runs
+at Exception Level 1 (EL1), access to the features requires
+Exception Level 3 (EL3).
+
+The Intel Stratix10 SoC service layer provides an in kernel API for
+drivers to request access to the secure features. The requests are queued
+and processed one by one. ARM’s SMCCC is used to pass the execution
+of the requests on to a secure monitor (EL3).
+
+.. kernel-doc:: include/linux/firmware/intel/stratix10-svc-client.h
+   :functions: stratix10_svc_command_code
+
+.. kernel-doc:: include/linux/firmware/intel/stratix10-svc-client.h
+   :functions: stratix10_svc_client_msg
+
+.. kernel-doc:: include/linux/firmware/intel/stratix10-svc-client.h
+   :functions: stratix10_svc_command_reconfig_payload
+
+.. kernel-doc:: include/linux/firmware/intel/stratix10-svc-client.h
+   :functions: stratix10_svc_cb_data
+
+.. kernel-doc:: include/linux/firmware/intel/stratix10-svc-client.h
+   :functions: stratix10_svc_client
+
+.. kernel-doc:: drivers/firmware/stratix10-svc.c
+   :export:
-- 
2.7.4



[PATCHv11 8/8] firmware: add remote status update client support

2018-11-13 Thread richard . gong
From: Richard Gong 

Extend Intel Stratix10 service layer to support the second service layer
client, Remote Status Update (RSU).

RSU is used to provide our customers with protection against loading bad
bitstreams onto their devices when those devices are booting from flash.

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v7: this patch is added in patch set version 7
v8: no change
v9: add case for COMMAND_RSU_UPDATE at svc_thread_recv_status_ok() at
stratix10-svc.c file
add RSU related definitions at stratix10-smc.h file
v10: s/misc/firmware at commit header
v11: correct a typo in commit message
---
 drivers/firmware/stratix10-svc.c   | 35 +++-
 include/linux/firmware/intel/stratix10-smc.h   | 47 ++
 .../linux/firmware/intel/stratix10-svc-client.h| 20 -
 3 files changed, 98 insertions(+), 4 deletions(-)

diff --git a/drivers/firmware/stratix10-svc.c b/drivers/firmware/stratix10-svc.c
index 168f523..81f3182 100644
--- a/drivers/firmware/stratix10-svc.c
+++ b/drivers/firmware/stratix10-svc.c
@@ -34,7 +34,7 @@
  * timeout is set to 30 seconds (30 * 1000) at Intel Stratix10 SoC.
  */
 #define SVC_NUM_DATA_IN_FIFO   32
-#define SVC_NUM_CHANNEL1
+#define SVC_NUM_CHANNEL2
 #define FPGA_CONFIG_DATA_CLAIM_TIMEOUT_MS  200
 #define FPGA_CONFIG_STATUS_TIMEOUT_SEC 30
 
@@ -271,7 +271,7 @@ static void svc_thread_cmd_config_status(struct 
stratix10_svc_controller *ctrl,
  * @cb_data: pointer to callback data structure to service client
  * @res: result from SMC or HVC call
  *
- * Send back the correspond status to the service client (FPGA manager etc).
+ * Send back the correspond status to the service clients.
  */
 static void svc_thread_recv_status_ok(struct stratix10_svc_data *p_data,
  struct stratix10_svc_cb_data *cb_data,
@@ -295,6 +295,9 @@ static void svc_thread_recv_status_ok(struct 
stratix10_svc_data *p_data,
case COMMAND_RECONFIG_STATUS:
cb_data->status = BIT(SVC_STATUS_RECONFIG_COMPLETED);
break;
+   case COMMAND_RSU_UPDATE:
+   cb_data->status = BIT(SVC_STATUS_RSU_OK);
+   break;
default:
pr_warn("it shouldn't happen\n");
break;
@@ -373,6 +376,16 @@ static int svc_normal_to_secure_thread(void *data)
a1 = 0;
a2 = 0;
break;
+   case COMMAND_RSU_STATUS:
+   a0 = INTEL_SIP_SMC_RSU_STATUS;
+   a1 = 0;
+   a2 = 0;
+   break;
+   case COMMAND_RSU_UPDATE:
+   a0 = INTEL_SIP_SMC_RSU_UPDATE;
+   a1 = pdata->arg[0];
+   a2 = 0;
+   break;
default:
pr_warn("it shouldn't happen\n");
break;
@@ -389,6 +402,19 @@ static int svc_normal_to_secure_thread(void *data)
 (unsigned int)res.a1, (unsigned int)res.a2);
pr_debug(" res.a3=0x%016x\n", (unsigned int)res.a3);
 
+   if (pdata->command == COMMAND_RSU_STATUS) {
+   if (res.a0 == INTEL_SIP_SMC_RSU_ERROR)
+   cbdata->status = BIT(SVC_STATUS_RSU_ERROR);
+   else
+   cbdata->status = BIT(SVC_STATUS_RSU_OK);
+
+   cbdata->kaddr1 = &res;
+   cbdata->kaddr2 = NULL;
+   cbdata->kaddr3 = NULL;
+   pdata->chan->scl->receive_cb(pdata->chan->scl, cbdata);
+   continue;
+   }
+
switch (res.a0) {
case INTEL_SIP_SMC_STATUS_OK:
svc_thread_recv_status_ok(pdata, cbdata, res);
@@ -941,6 +967,11 @@ static int stratix10_svc_drv_probe(struct platform_device 
*pdev)
chans[0].name = SVC_CLIENT_FPGA;
spin_lock_init(&chans[0].lock);
 
+   chans[1].scl = NULL;
+   chans[1].ctrl = controller;
+   chans[1].name = SVC_CLIENT_RSU;
+   spin_lock_init(&chans[1].lock);
+
list_add_tail(&controller->node, &svc_ctrl);
platform_set_drvdata(pdev, controller);
 
diff --git a/include/linux/firmware/intel/stratix10-smc.h 
b/include/linux/firmware/intel/stratix10-smc.h
index a109e4c..5be5dab 100644
--- a/include/linux/firmware/intel/stratix10-smc.h
+++ b/include/linux/firmware/intel/stratix10-smc.h
@@ -67,6 +67,12 @@
  *
  * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR:
  * There is error during the FPGA configuration process.
+ *
+ * INTEL_SIP_SMC_REG_ERROR:
+ * There is error during a read or write operation of the protected registers.
+ *
+ * INTEL_SIP_SMC_RSU_ERROR:
+ * There is error during a remote status update.
  */
 #define I

Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Nadav Amit
From: Andy Lutomirski
Sent: November 13, 2018 at 5:47:16 PM GMT
> To: Nadav Amit 
> Cc: Igor Stoppa , Kees Cook , 
> Peter Zijlstra , Mimi Zohar , 
> Matthew Wilcox , Dave Chinner , 
> James Morris , Michal Hocko , Kernel 
> Hardening , linux-integrity 
> , LSM List 
> , Igor Stoppa 
> , Dave Hansen , Jonathan 
> Corbet , Laura Abbott , Randy Dunlap 
> , Mike Rapoport , open 
> list:DOCUMENTATION , LKML 
> , Thomas Gleixner 
> Subject: Re: [PATCH 10/17] prmem: documentation
> 
> 
> On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit  wrote:
>> From: Andy Lutomirski
>> Sent: November 13, 2018 at 5:16:09 PM GMT
>>> To: Igor Stoppa 
>>> Cc: Kees Cook , Peter Zijlstra 
>>> , Nadav Amit , Mimi Zohar 
>>> , Matthew Wilcox , Dave 
>>> Chinner , James Morris , Michal 
>>> Hocko , Kernel Hardening 
>>> , linux-integrity 
>>> , LSM List 
>>> , Igor Stoppa 
>>> , Dave Hansen , 
>>> Jonathan Corbet , Laura Abbott , Randy 
>>> Dunlap , Mike Rapoport , 
>>> open list:DOCUMENTATION , LKML 
>>> , Thomas Gleixner 
>>> Subject: Re: [PATCH 10/17] prmem: documentation
>>> 
>>> 
>>> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:
 Hello,
 I've been studying v4 of the patch-set [1] that Nadav has been working on.
 Incidentally, I think it would be useful to cc also the
 security/hardening ml.
 The patch-set seems to be close to final, so I am resuming this discussion.
 
 On 30/10/2018 19:06, Andy Lutomirski wrote:
 
> I support the addition of a rare-write mechanism to the upstream kernel.  
> And I think that there is only one sane way to implement it: using an 
> mm_struct. That mm_struct, just like any sane mm_struct, should only 
> differ from init_mm in that it has extra mappings in the *user* region.
 
 After reading the code, I see what you meant.
 I think I can work with it.
 
 But I have a couple of questions wrt the use of this mechanism, in the
 context of write rare.
 
 
 1) mm_struct.
 
 Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
 (live patch?), which seems to happen sequentially and in a relatively
 standardized way, like replacing the NOPs specifically placed in the
 functions that need patching.
 
 This is a bit different from the more generic write-rare case, applied
 to data.
 
 As example, I have in mind a system where both IMA and SELinux are in use.
 
 In this system, a file is accessed for the first time.
 
 That would trigger 2 things:
 - evaluation of the SELinux rules and probably update of the AVC cache
 - IMA measurement and update of the measurements
 
 Both of them could be write protected, meaning that they would both have
 to be modified through the write rare mechanism.
 
 While the events, for 1 specific file, would be sequential, it's not
 difficult to imagine that multiple files could be accessed at the same 
 time.
 
 If the update of the data structures in both IMA and SELinux must use
 the same mm_struct, that would have to be somehow regulated and it would
 introduce an unnecessary (imho) dependency.
 
 How about having one mm_struct for each writer (core or thread)?
>>> 
>>> I don't think that helps anything.  I think the mm_struct used for
>>> prmem (or rare_write or whatever you want to call it) should be
>>> entirely abstracted away by an appropriate API, so neither SELinux nor
>>> IMA need to be aware that there's an mm_struct involved.  It's also
>>> entirely possible that some architectures won't even use an mm_struct
>>> behind the scenes -- x86, for example, could have avoided it if there
>>> were a kernel equivalent of PKRU.  Sadly, there isn't.
>>> 
 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
 the patch might spill across the page boundary, however if I deal with
 the modification of generic data, I shouldn't (shouldn't I?) assume that
 the data will not span across multiple pages.
>>> 
>>> The reason for the particular architecture of text_poke() is to avoid
>>> memory allocation to get it working.  i think that prmem/rare_write
>>> should have each rare-writable kernel address map to a unique user
>>> address, possibly just by offsetting everything by a constant.  For
>>> rare_write, you don't actually need it to work as such until fairly
>>> late in boot, since the rare_writable data will just be writable early
>>> on.
>>> 
 If the data spans across multiple pages, in unknown amount, I suppose
 that I should not keep interrupts disabled for an unknown time, as it
 would hurt preemption.
 
 What I thought, in my initial patch-set, was to iterate over each page
 that must be written to, in a loop, re-enabling interrupts in-between
 iterations, to give pending interrupts a chance to be served.
 
 This would mean that the data being written to would not be consistent,
 but i

[PATCHv11 5/8] arm64: dts: stratix10: add fpga manager and region

2018-11-13 Thread richard . gong
From: Alan Tull 

Add the Stratix10 FPGA manager and a FPGA region to the
device tree.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
---
v2: this patch is added in patch set version 2
v3: change to put fpga_mgr node under firmware/svc node
v4: s/fpga-mgr@0/fpga-mgr/ to remove unit_address
add Richard's signed-off-by
v5: no change
v6: no change
v7: no change
v8: no change
v9: no change
v10: no change
v11: no change
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi 
b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index 519b16e..a20df0d 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -106,6 +106,14 @@
interrupt-parent = <&intc>;
ranges = <0 0 0 0x>;
 
+   base_fpga_region {
+   #address-cells = <0x1>;
+   #size-cells = <0x1>;
+
+   compatible = "fpga-region";
+   fpga-mgr = <&fpga_mgr>;
+   };
+
clkmgr: clock-controller@ffd1 {
compatible = "intel,stratix10-clkmgr";
reg = <0xffd1 0x1000>;
@@ -556,6 +564,10 @@
compatible = "intel,stratix10-svc";
method = "smc";
memory-region = <&service_reserved>;
+
+   fpga_mgr: fpga-mgr {
+   compatible = 
"intel,stratix10-soc-fpga-mgr";
+   };
};
};
};
-- 
2.7.4



[PATCHv11 6/8] fpga: add intel stratix10 soc fpga manager driver

2018-11-13 Thread richard . gong
From: Alan Tull 

Add driver for reconfiguring Intel Stratix10 SoC FPGA devices.
This driver communicates through the Intel service layer driver
which does communication with privileged hardware (that does the
FPGA programming) through a secure mailbox.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
Acked-by: Moritz Fischer 
---
v2: this patch is added in patch set version 2
v3: change to align to the update of service client APIs, and the
update of fpga_mgr device node
v4: changes to align with stratix10-svc-client API updates
add Richard's signed-off-by
v5: update to align changes at service layer to minimize service
layer thread usages
v6: add S10_RECONFIG_TIMEOUT
rename s/S10_BUF_TIMEOUT/S10_BUFFER_TIMEOUT/
fix klocwork errors
v7: s/pr_debug/dev_dbg, s/dev_info/dev_dbg
add unlock buffer if s10_svc_send_msg() fails
add release channel if fpga_mgr_create() fails
v8: add a goto and common error handling at the end of s10_probe
function
add Moritz's Acked-by
v9: remove kernel-doc markups that aren't being built
v10: s/STRATIX10_SERVICE/INTEL_STRATIX10_SERVICE
 s/stratix10_svc_command_reconfig_payload/stratix10_svc_command_config_type
 update path for the included header file stratix10-svc-client.h
v11: no change
---
 drivers/fpga/Kconfig |   6 +
 drivers/fpga/Makefile|   1 +
 drivers/fpga/stratix10-soc.c | 535 +++
 3 files changed, 542 insertions(+)
 create mode 100644 drivers/fpga/stratix10-soc.c

diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index 1ebcef4..0bb7b5c 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -56,6 +56,12 @@ config FPGA_MGR_ZYNQ_FPGA
help
  FPGA manager driver support for Xilinx Zynq FPGAs.
 
+config FPGA_MGR_STRATIX10_SOC
+   tristate "Intel Stratix10 SoC FPGA Manager"
+   depends on (ARCH_STRATIX10 && INTEL_STRATIX10_SERVICE)
+   help
+ FPGA manager driver support for the Intel Stratix10 SoC.
+
 config FPGA_MGR_XILINX_SPI
tristate "Xilinx Configuration over Slave Serial (SPI)"
depends on SPI
diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
index 7a2d73b..c0dd4c8 100644
--- a/drivers/fpga/Makefile
+++ b/drivers/fpga/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_FPGA_MGR_ICE40_SPI)  += ice40-spi.o
 obj-$(CONFIG_FPGA_MGR_MACHXO2_SPI) += machxo2-spi.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA) += socfpga.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA_A10) += socfpga-a10.o
+obj-$(CONFIG_FPGA_MGR_STRATIX10_SOC)   += stratix10-soc.o
 obj-$(CONFIG_FPGA_MGR_TS73XX)  += ts73xx-fpga.o
 obj-$(CONFIG_FPGA_MGR_XILINX_SPI)  += xilinx-spi.o
 obj-$(CONFIG_FPGA_MGR_ZYNQ_FPGA)   += zynq-fpga.o
diff --git a/drivers/fpga/stratix10-soc.c b/drivers/fpga/stratix10-soc.c
new file mode 100644
index 000..a1a09e0
--- /dev/null
+++ b/drivers/fpga/stratix10-soc.c
@@ -0,0 +1,535 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FPGA Manager Driver for Intel Stratix10 SoC
+ *
+ *  Copyright (C) 2018 Intel Corporation
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * FPGA programming requires a higher level of privilege (EL3), per the SoC
+ * design.
+ */
+#define NUM_SVC_BUFS   4
+#define SVC_BUF_SIZE   SZ_512K
+
+/* Indicates buffer is in use if set */
+#define SVC_BUF_LOCK   0
+
+#define S10_BUFFER_TIMEOUT (msecs_to_jiffies(SVC_RECONFIG_BUFFER_TIMEOUT_MS))
+#define S10_RECONFIG_TIMEOUT 
(msecs_to_jiffies(SVC_RECONFIG_REQUEST_TIMEOUT_MS))
+
+/*
+ * struct s10_svc_buf
+ * buf:  virtual address of buf provided by service layer
+ * lock: locked if buffer is in use
+ */
+struct s10_svc_buf {
+   char *buf;
+   unsigned long lock;
+};
+
+struct s10_priv {
+   struct stratix10_svc_chan *chan;
+   struct stratix10_svc_client client;
+   struct completion status_return_completion;
+   struct s10_svc_buf svc_bufs[NUM_SVC_BUFS];
+   unsigned long status;
+};
+
+static int s10_svc_send_msg(struct s10_priv *priv,
+   enum stratix10_svc_command_code command,
+   void *payload, u32 payload_length)
+{
+   struct stratix10_svc_chan *chan = priv->chan;
+   struct device *dev = priv->client.dev;
+   struct stratix10_svc_client_msg msg;
+   int ret;
+
+   dev_dbg(dev, "%s cmd=%d payload=%p length=%d\n",
+   __func__, command, payload, payload_length);
+
+   msg.command = command;
+   msg.payload = payload;
+   msg.payload_length = payload_length;
+
+   ret = stratix10_svc_send(chan, &msg);
+   dev_dbg(dev, "stratix10_svc_send returned status %d\n", ret);
+
+   return ret;
+}
+
+/*
+ * Free buffers allocated from the service layer's pool that are not in use.
+ * Return true when all buffers are freed.
+ */
+static bool s10_free_buffers(struct fpga_manager *mgr)
+{
+   struct s10_priv *priv = mgr->priv;
+   uint num_free = 0;
+   uint i;
+
+   for (i = 0; i 

[PATCHv11 4/8] dt-bindings: fpga: add Stratix10 SoC FPGA manager binding

2018-11-13 Thread richard . gong
From: Alan Tull 

Add a Device Tree binding for the Intel Stratix10 SoC FPGA manager.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
Reviewed-by: Rob Herring 
Acked-by: Moritz Fischer 
---
v2: this patch is added in patch set version 2
v3: change to put fpga_mgr node under firmware/svc node
v4: s/fpga-mgr@0/fpga-mgr/ to remove unit_address
add Richard's signed-off-by
v5: add Reviewed-by Rob Herring
v6: no change
v7: no change
v8: no change
v9: no change
v10: no change
v11: add Moritz Fischer's Acked-by
---
 .../bindings/fpga/intel-stratix10-soc-fpga-mgr.txt  | 17 +
 1 file changed, 17 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt

diff --git 
a/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt 
b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
new file mode 100644
index 000..6e03f79
--- /dev/null
+++ b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
@@ -0,0 +1,17 @@
+Intel Stratix10 SoC FPGA Manager
+
+Required properties:
+The fpga_mgr node has the following mandatory property, must be located under
+firmware/svc node.
+
+- compatible : should contain "intel,stratix10-soc-fpga-mgr"
+
+Example:
+
+   firmware {
+   svc {
+   fpga_mgr: fpga-mgr {
+   compatible = "intel,stratix10-soc-fpga-mgr";
+   };
+   };
+   };
-- 
2.7.4



Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin :
>
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>

We just did not have VM/Madvise flag for that currently.
Same as THP.
Because all logic written with assumption, what we have exactly 2 states.
Allow/Disallow (More like not allow).

And if we try to add, that must be something like:
MADV_FORBID_* to disallow something completely.

And same for THP
(because currently apps just refuse to start if THP enabled, because of no way
to forbid thp).

Thanks.

> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.
>
> Thank you,
> Pasha
>


[PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
I try mimic hugepages attribute, so mode have two states:
  * madvise  - old default behaviour
  * always [new] - allow ksm to get tasks vma and
   try working on that.
  * seeker_sleep_millisecs:
Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
Description: Many different chrome/electron apps + KDE
Ratio: 5%
Saved: ~100  MiB
Used:  ~2000 MiB

  * Name: K8s test VM
Description: Some small random running docker images
Ratio: 40%
Saved: ~160 MiB
Used:  ~920 MiB

  * Name: Ceph test VM
Description: Ceph Mon/OSD, some containers
Ratio: 20%
Saved: ~60 MiB
Used:  ~600 MiB

  * Name: BareMetal K8s backend server
Description: Different server apps in containers C, Java, GO & etc
Ratio: 72%
Saved: ~5800 MiB
Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
Description: Many instance of one CPU intensive application
Ratio: 55%
Saved: ~2600 MiB
Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
Description: Only OSD storage daemons running
Raio: 2%
Saved: ~190 MiB
Used:  ~11.7 GiB

Changes:
  v1 -> v2:
* Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
* Reformat patch description
* Rename mode normal to madvise
* Add some memory numbers
* Fix checkpatch.pl warnings
* Separate ksm vma seeker to another kthread
* Fix: "BUG: scheduling while atomic: ksmd"
  by move seeker to another thread

Signed-off-by: Timofey Titovets 
CC: Matthew Wilcox 
CC: linux...@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 mm/ksm.c | 215 +++
 2 files changed, 198 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/mm/ksm.rst 
b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..7cffd47f9b38 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -116,6 +116,21 @@ run
 Default: 0 (must be changed to 1 to activate KSM, except if
 CONFIG_SYSFS is disabled)
 
+mode
+* set always to allow ksm deduplicate memory of every process
+* set madvise to use only madvised memory
+
+Default: madvise (dedupulicate only madvised memory as in
+earlier releases)
+
+seeker_sleep_millisecs
+how many milliseconds ksmd task seeker should sleep try another
+task.
+e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
+
+Default: 1000 (chosen for rate limit purposes)
+
+
 use_zero_pages
 specifies whether empty pages (i.e. allocated pages that only
 contain zeroes) should be treated specially.  When set to 1,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5b0894b45ee5..1a03b28b6288 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Milliseconds ksmd seeker should sleep between runs */
+static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
+
 /* Checksum of an empty (zeroed) page */
 static unsigned int zero_checksum __read_mostly;
 
@@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
 static unsigned long ksm_run = KSM_RUN_STOP;
 static void wait_while_offlining(void);
 
+#define KSM_MODE_MADVISE 0
+#define KSM_MODE_ALWAYS1
+static unsigned long ksm_mode = KSM_MODE_MADVISE;
+
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
+static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
 static DEFINE_SPINLOCK(ksm_mmlist_lock);
 
@@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
 
+static inline int ksm_mode_always(void)
+{
+   return (ksm_mode == KSM_MODE_ALWAYS);
+}
+
 static int __init ksm_slab_init(void)
 {
rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
 }
 
+
+static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
+{
+   int err;
+
+   if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
+ 

[PATCH V4] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
I try mimic hugepages attribute, so mode have two states:
  * madvise  - old default behaviour
  * always [new] - allow ksm to get tasks vma and
   try working on that.
  * seeker_sleep_millisecs:
Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
Description: Many different chrome/electron apps + KDE
Ratio: 5%
Saved: ~100  MiB
Used:  ~2000 MiB

  * Name: K8s test VM
Description: Some small random running docker images
Ratio: 40%
Saved: ~160 MiB
Used:  ~920 MiB

  * Name: Ceph test VM
Description: Ceph Mon/OSD, some containers
Ratio: 20%
Saved: ~60 MiB
Used:  ~600 MiB

  * Name: BareMetal K8s backend server
Description: Different server apps in containers C, Java, GO & etc
Ratio: 72%
Saved: ~5800 MiB
Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
Description: Many instance of one CPU intensive application
Ratio: 55%
Saved: ~2600 MiB
Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
Description: Only OSD storage daemons running
Raio: 2%
Saved: ~190 MiB
Used:  ~11.7 GiB

Changes:
  v1 -> v2:
* Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
* Reformat patch description
* Rename mode normal to madvise
* Add some memory numbers
* Separate ksm vma seeker to another kthread
* Fix: "BUG: scheduling while atomic: ksmd"
  by move seeker to another thread
  v3 -> v4:
* Fix again "BUG: scheduling while atomic"
  by get()/put() task API
* Remove unused variable "error"

Signed-off-by: Timofey Titovets 
CC: Matthew Wilcox 
CC: Oleksandr Natalenko 
CC: Pavel Tatashin 
CC: linux...@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 mm/ksm.c | 217 +++
 2 files changed, 200 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/mm/ksm.rst 
b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..7cffd47f9b38 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -116,6 +116,21 @@ run
 Default: 0 (must be changed to 1 to activate KSM, except if
 CONFIG_SYSFS is disabled)
 
+mode
+* set always to allow ksm deduplicate memory of every process
+* set madvise to use only madvised memory
+
+Default: madvise (dedupulicate only madvised memory as in
+earlier releases)
+
+seeker_sleep_millisecs
+how many milliseconds ksmd task seeker should sleep try another
+task.
+e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
+
+Default: 1000 (chosen for rate limit purposes)
+
+
 use_zero_pages
 specifies whether empty pages (i.e. allocated pages that only
 contain zeroes) should be treated specially.  When set to 1,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5b0894b45ee5..f087eefda8b2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Milliseconds ksmd seeker should sleep between runs */
+static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
+
 /* Checksum of an empty (zeroed) page */
 static unsigned int zero_checksum __read_mostly;
 
@@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
 static unsigned long ksm_run = KSM_RUN_STOP;
 static void wait_while_offlining(void);
 
+#define KSM_MODE_MADVISE 0
+#define KSM_MODE_ALWAYS1
+static unsigned long ksm_mode = KSM_MODE_MADVISE;
+
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
+static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
 static DEFINE_SPINLOCK(ksm_mmlist_lock);
 
@@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
 
+static inline int ksm_mode_always(void)
+{
+   return (ksm_mode == KSM_MODE_ALWAYS);
+}
+
 static int __init ksm_slab_init(void)
 {
rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -2389,6 +2402,108 @@ static int ksmd_should_run(void)
return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
 }
 
+
+static int ksm_enter(struct

Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
On 13/11/2018 19:16, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:

[...]

>> How about having one mm_struct for each writer (core or thread)?
>>
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it)

write_rare / rarely can be shortened to wr_  which is kinda less
confusing than rare_write, since it would become rw_ and easier to
confuse with R/W

Any advice for better naming is welcome.

> should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.

Yes, that is fine. In my proposal I was thinking about tying it to the
core/thread that performs the actual write.

The high level API could be something like:

wr_memcpy(void *src, void *dst, uint_t size)

>  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.

The mm_struct - or whatever is the means to do the write on that
architecture - can be kept hidden from the API.

But the reason why I was proposing to have one mm_struct per writer is
that, iiuc, the secondary mapping is created in the secondary mm_struct
for each writer using it.

So the updating of IMA measurements would have, theoretically, also
write access to the SELinux AVC. Which I was trying to avoid.
And similarly any other write rare updater. Is this correct?

>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.

Yes, that is true. I think it's safe to assume, from an attack pattern,
that as long as user space is not started, the system can be considered
ok. Even user-space code run from initrd should be ok, since it can be
bundled (and signed) as a single binary with the kernel.

Modules loaded from a regular filesystem are a bit more risky, because
an attack might inject a rogue key in the key-ring and use it to load
malicious modules.

>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

The only "excuse" I have is that write_rare is opt-in and is "rare".
Maybe the enabling/disabling of interrupts - and the consequent switch
of mm_struct - could be somehow tied to the latency configuration?

If preemption is disabled, the expectations on the system latency are
anyway more relaxed.

But I'm not sure how it would work against I/O.

--
igor


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
On 13/11/2018 19:47, Andy Lutomirski wrote:

> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.

Why would these be less sensitive?

But I see a big difference between my initial implementation and this one.

In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.

If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.

To access that mapping, the attack should be performed from the ISR, I
think.

--
igor


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
I forgot one sentence :-(

On 13/11/2018 20:31, Igor Stoppa wrote:
> On 13/11/2018 19:47, Andy Lutomirski wrote:
> 
>> For general rare-writish stuff, I don't think we want IRQs running
>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> 
> Why would these be less sensitive?
> 
> But I see a big difference between my initial implementation and this one.
> 
> In my case, by using a shared mapping, visible to all cores, freezing
> the core that is performing the write would have exposed the writable
> mapping to a potential attack run from another core.
> 
> If the mapping is private to the core performing the write, even if it
> is frozen, it's much harder to figure out what it had mapped and where,
> from another core.
> 
> To access that mapping, the attack should be performed from the ISR, I
> think.

Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?

--
igor


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
On 18-11-13 21:17:42, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin :
> >
> > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > Hi.
> > >
> > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > the requirement for application opt-in, which, in my opinion, makes
> > > > this way worse and reduces the number of applications for which this
> > > > is acceptable.
> > >
> > > The default is to maintain the old behaviour, so unless the explicit
> > > decision is made by the administrator, no extra risk is imposed.
> >
> > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> >
> > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > excplicitly set.
> >
> > KSM default off: merge only when MADV_MERGEABLE is set.
> >
> > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > application programmers won't have a way to prevent sensitive data to be
> > every merged. So, I think, we should keep allow an explicit opt-out
> > option for applications.
> >
> 
> We just did not have VM/Madvise flag for that currently.
> Same as THP.
> Because all logic written with assumption, what we have exactly 2 states.
> Allow/Disallow (More like not allow).
> 
> And if we try to add, that must be something like:
> MADV_FORBID_* to disallow something completely.

No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
set, memory is indeed becomes always unmergeable regardless of ksm mode
of operation.

To do the above in ksm_madvise(), a new state should be added, for example
instead of: 

case MADV_UNMERGEABLE:
*vm_flags &= ~VM_MERGEABLE;

A new flag should be used:
*vm_flags |= VM_UNMERGEABLE;

I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
be accepted.

Pasha


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Andy Lutomirski
On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa  wrote:
>
> On 13/11/2018 19:16, Andy Lutomirski wrote:
>
> > should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved.
>
> Yes, that is fine. In my proposal I was thinking about tying it to the
> core/thread that performs the actual write.
>
> The high level API could be something like:
>
> wr_memcpy(void *src, void *dst, uint_t size)
>
> >  It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU.  Sadly, there isn't.
>
> The mm_struct - or whatever is the means to do the write on that
> architecture - can be kept hidden from the API.
>
> But the reason why I was proposing to have one mm_struct per writer is
> that, iiuc, the secondary mapping is created in the secondary mm_struct
> for each writer using it.
>
> So the updating of IMA measurements would have, theoretically, also
> write access to the SELinux AVC. Which I was trying to avoid.
> And similarly any other write rare updater. Is this correct?

If you call a wr_memcpy() function with the signature you suggested,
then you can overwrite any memory of this type.  Having a different
mm_struct under the hood makes no difference.  As far as I'm
concerned, for *dynamically allocated* rare-writable memory, you might
as well allocate all the paging structures at the same time, so the
mm_struct will always contain the mappings.  If there are serious bugs
in wr_memcpy() that cause it to write to the wrong place, we have
bigger problems.

I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
but that can wait for v2.  And it still doesn't obviously need
multiple mm_structs.

>
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working.  i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant.  For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
>
> Yes, that is true. I think it's safe to assume, from an attack pattern,
> that as long as user space is not started, the system can be considered
> ok. Even user-space code run from initrd should be ok, since it can be
> bundled (and signed) as a single binary with the kernel.
>
> Modules loaded from a regular filesystem are a bit more risky, because
> an attack might inject a rogue key in the key-ring and use it to load
> malicious modules.

If a malicious module is loaded, the game is over.

>
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow.  I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> The only "excuse" I have is that write_rare is opt-in and is "rare".
> Maybe the enabling/disabling of interrupts - and the consequent switch
> of mm_struct - could be somehow tied to the latency configuration?
>
> If preemption is disabled, the expectations on the system latency are
> anyway more relaxed.
>
> But I'm not sure how it would work against I/O.

I think it's entirely reasonable for the API to internally break up
very large memcpys.


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Andy Lutomirski
On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa  wrote:
>
> I forgot one sentence :-(
>
> On 13/11/2018 20:31, Igor Stoppa wrote:
> > On 13/11/2018 19:47, Andy Lutomirski wrote:
> >
> >> For general rare-writish stuff, I don't think we want IRQs running
> >> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
> >
> > Why would these be less sensitive?
> >
> > But I see a big difference between my initial implementation and this one.
> >
> > In my case, by using a shared mapping, visible to all cores, freezing
> > the core that is performing the write would have exposed the writable
> > mapping to a potential attack run from another core.
> >
> > If the mapping is private to the core performing the write, even if it
> > is frozen, it's much harder to figure out what it had mapped and where,
> > from another core.
> >
> > To access that mapping, the attack should be performed from the ISR, I
> > think.
>
> Unless the secondary mapping is also available to other cores, through
> the shared mm_struct ?
>

I don't think this matters much.  The other cores will only be able to
use that mapping when they're doing a rare write.


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
> > Is it really necessary to have an extra thread in ksm just to add vma's
> > for scanning? Can we do it right from the scanner thread? Also, may be
> > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > enabled?
> >
> > Thank you,
> > Pasha
>
> Oh, thats a long story, and my english to bad for describe all things,
> even that hard to find linux-mm conversation several years ago about that.
>
> Anyway, so:
> In V2 - i use scanner thread to add VMA, but i think scanner do that
> with too high rate.
> i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
> to get VMA...
> To high rate for task list scanner, i think it's overkill.
>
> About add VMA from creation time,
> UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> But some mm people say what they not like add KSM hooks to other subsystems.
> And want ksm do that internally by some way.
>
> Frankly speaking i didn't have enough knowledge and skills to do that
> another way in past time.
> They also suggest me look to THP for that logic, but i can't find how
> THP do that without hooks, and
> where THP truly scan memory.
>
> So, after all of that i implemented this in that way.
> In first iteration as part of ksm scan thread, and in second, by
> separate thread.
> Because that allow to add VMA in fully independent way.

It still feels as a wrong direction. A new thread that adds random
VMA's to scan, and no way to optimize the queue fairness for example.
It should really be done at creation time, when VMA is created it
should be added to KSM scanning queue, or KSM main scanner thread
should go through VMA list in a coherent order.

The design of having a separate thread is bad. I plan in the future to
add thread per node support to KSM, and this one odd thread won't
break things, to which queue should this thread add VMA if there are
multiple queues?

Thank you,
Pasha


Re: [PATCH v5 05/27] Documentation/x86: Add CET description

2018-11-13 Thread Borislav Petkov
On Thu, Oct 11, 2018 at 08:15:01AM -0700, Yu-cheng Yu wrote:
> Explain how CET works and the no_cet_shstk/no_cet_ibt kernel
> parameters.
> 
> Signed-off-by: Yu-cheng Yu 
> ---
>  .../admin-guide/kernel-parameters.txt |   6 +
>  Documentation/index.rst   |   1 +
>  Documentation/x86/index.rst   |  11 +
>  Documentation/x86/intel_cet.rst   | 266 ++
>  4 files changed, 284 insertions(+)
>  create mode 100644 Documentation/x86/index.rst
>  create mode 100644 Documentation/x86/intel_cet.rst

So this patch should probably come first in the series so that a reader
can know what to expect...

> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 92eb1f42240d..3854423f7c86 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2764,6 +2764,12 @@
>   noexec=on: enable non-executable mappings (default)
>   noexec=off: disable non-executable mappings
>  
> + no_cet_ibt  [X86-64] Disable indirect branch tracking for user-mode
> + applications
> +
> + no_cet_shstk[X86-64] Disable shadow stack support for user-mode
> + applications
> +
>   nosmap  [X86]
>   Disable SMAP (Supervisor Mode Access Prevention)
>   even if it is supported by processor.
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 5db7e87c7cb1..1cdc139adb40 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst

Please integrate scripts/checkpatch.pl into your patch creation
workflow. Some of the warnings/errors *actually* make sense:

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#76: FILE: Documentation/x86/index.rst:1:
+===

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#93: FILE: Documentation/x86/intel_cet.rst:1:
+=

> @@ -104,6 +104,7 @@ implementation.
> :maxdepth: 2
>  
> sh/index
> +   x86/index
>  
>  Filesystem Documentation
>  
> diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> new file mode 100644
> index ..9c34d8cbc8f0
> --- /dev/null
> +++ b/Documentation/x86/index.rst
> @@ -0,0 +1,11 @@
> +===
> +X86 Documentation
> +===
> +
> +Control Flow Enforcement
> +
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   intel_cet
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index ..946f4802a51f
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,266 @@
> +=
> +Control Flow Enforcement Technology (CET)
> +=
> +
> +[1] Overview
> +
> +
> +Control Flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be implemented
> +to protect both the kernel and applications.  In the first phase,
> +only the user-mode protection is implemented on the 64-bit kernel.

s/the//  is implemented in 64-bit mode.

> +However, 32-bit applications are supported under the compatibility
> +mode.

Drop "However":

"32-bit applications are, of course, supported in compatibility mode."

> +
> +CET includes shadow stack (SHSTK) and indirect branch tracking (IBT).

"CET introduces two a shadow stack and an indirect branch tracking mechanism."

> +The SHSTK is a secondary stack allocated from memory.  The processor

s/The//

> +automatically pushes/pops a secure copy to the SHSTK every return
> +address and,

that reads funny - pls reorganize. Also, what is a "secure copy"?

You mean a copy of every return address which software cannot access?

> by comparing the secure copy to the program stack copy,
> +verifies function returns are as intended. 

 ... have not been corrupted/modified."

> The IBT verifies all
> +indirect CALL/JMP targets are intended and marked by the compiler with
> +'ENDBR' op codes.

"opcode" - one word. And before you use "ENDBR" you need to explain it
above what it is.

/me reads further... encounters ENDBR's definition...

ah, ok, so you should say something like

"... and marked by the compiler with the ENDBR opcode (see below)."

> +
> +There are two kernel configuration options:
> +
> +INTEL_X86_SHADOW_STACK_USER, and
> +INTEL_X86_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features:
> +
> +no_cet_shstk - disables SHSTK, and
> +   

Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Andy Lutomirski
On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa  wrote:
>
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
> > For general rare-writish stuff, I don't think we want IRQs running
> > with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?

I'm not really saying they're less sensitive so much as that the
considerations are different.  I think the original rare-write code is
based on ideas from grsecurity, and it was intended to protect static
data like structs full of function pointers.   Those targets have some
different properties:

 - Static targets are at addresses that are much more guessable, so
they're easier targets for most attacks.  (Not spraying attacks like
the ones you're interested in, though.)

 - Static targets are higher value.  No offense to IMA or AVC, but
outright execution of shellcode, hijacking of control flow, or compete
disablement of core security features is higher impact than bypassing
SELinux or IMA.  Why would you bother corrupting the AVC if you could
instead just set enforcing=0?  (I suppose that corrupting the AVC is
less likely to be noticed by monitoring tools.)

 - Static targets are small.  This means that the interrupt latency
would be negligible, especially in comparison to the latency of
replacing the entire SELinux policy object.

Anyway, I'm not all that familiar with SELinux under the hood, but I'm
wondering if a different approach to thinks like the policy database
might be appropriate.  When the policy is changed, rather than
allocating rare-write memory and writing to it, what if we instead
allocated normal memory, wrote to it, write-protected it, and then
used the rare-write infrastructure to do a much smaller write to
replace the pointer?

Admittedly, this creates a window where another core could corrupt the
data as it's being written.  That may not matter so much if an
attacker can't force a policy update.  Alternatively, the update code
could re-verify the policy after write-protecting it, or there could
be a fancy API to allocate some temporarily-writable memory (by
creating a whole new mm_struct, mapping the memory writable just in
that mm_struct, and activating it) so that only the actual policy
loader could touch the memory.  But I'm mostly speculating here, since
I'm not familiar with the code in question.

Anyway, I tend to think that the right way to approach mainlining all
this is to first get the basic rare write support for static data into
place and then to build on that.  I think it's great that you're
pushing this effort, but doing this for SELinux and IMA is a bigger
project than doing it for static data, and it might make sense to do
it in bite-sized pieces.

Does any of this make sense?


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin :
>
> On 18-11-13 21:17:42, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin :
> > >
> > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > Hi.
> > > >
> > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > this way worse and reduces the number of applications for which this
> > > > > is acceptable.
> > > >
> > > > The default is to maintain the old behaviour, so unless the explicit
> > > > decision is made by the administrator, no extra risk is imposed.
> > >
> > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > >
> > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > excplicitly set.
> > >
> > > KSM default off: merge only when MADV_MERGEABLE is set.
> > >
> > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > application programmers won't have a way to prevent sensitive data to be
> > > every merged. So, I think, we should keep allow an explicit opt-out
> > > option for applications.
> > >
> >
> > We just did not have VM/Madvise flag for that currently.
> > Same as THP.
> > Because all logic written with assumption, what we have exactly 2 states.
> > Allow/Disallow (More like not allow).
> >
> > And if we try to add, that must be something like:
> > MADV_FORBID_* to disallow something completely.
>
> No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> set, memory is indeed becomes always unmergeable regardless of ksm mode
> of operation.
>
> To do the above in ksm_madvise(), a new state should be added, for example
> instead of:
>
> case MADV_UNMERGEABLE:
> *vm_flags &= ~VM_MERGEABLE;
>
> A new flag should be used:
> *vm_flags |= VM_UNMERGEABLE;
>
> I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> be accepted.
>
> Pasha
>

That must work, but we out of bit space in vm_flags [1].
i.e. first 32 bits already defined, and other only accessible only on
64-bit machines.

1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa



On 13/11/2018 20:35, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa  wrote:

[...]

>> The high level API could be something like:
>>
>> wr_memcpy(void *src, void *dst, uint_t size)

[...]

> If you call a wr_memcpy() function with the signature you suggested,
> then you can overwrite any memory of this type.  Having a different
> mm_struct under the hood makes no difference.  As far as I'm
> concerned, for *dynamically allocated* rare-writable memory, you might
> as well allocate all the paging structures at the same time, so the
> mm_struct will always contain the mappings.  If there are serious bugs
> in wr_memcpy() that cause it to write to the wrong place, we have
> bigger problems.

Beside bugs, I'm also thinking about possible vulnerability.
It might be overthinking, though.

I do not think it's possible to really protect against control flow
attacks, unless there is some support from the HW and/or the compiler.

What is left, are data-based attacks. In this case, it would be an
attacker using one existing wr_ invocation with doctored parameters.

However, there is always the objection that it would be possible to come
up with a "writing kit" for plowing through the page tables and
unprotect anything that might be of value.

Ideally, that should be the only type of data-based attack left.

In practice, it might just be an excess of paranoia from my side.

> I can imagine that we'd want a *typed* wr_memcpy()-like API some day,
> but that can wait for v2.  And it still doesn't obviously need
> multiple mm_structs.

I left that out, for now, but yes, typing would play some role here.

[...]

> I think it's entirely reasonable for the API to internally break up
> very large memcpys.

The criteria for deciding if/how to break it down is not clear to me,
though. The single page was easy, but might be (probably is) too much.

--
igor


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa  wrote:

[...]

>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
> 
> I don't think this matters much.  The other cores will only be able to
> use that mapping when they're doing a rare write.

Which has now been promoted to target for attacks :-)

--
igor



Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
On 18-11-13 21:54:13, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin :
> >
> > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin :
> > > >
> > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > Hi.
> > > > >
> > > > > > Yep. However, so far, it requires an application to explicitly opt 
> > > > > > in
> > > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > > this way worse and reduces the number of applications for which this
> > > > > > is acceptable.
> > > > >
> > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > decision is made by the administrator, no extra risk is imposed.
> > > >
> > > > The new interface would be more tolerable if it honored 
> > > > MADV_UNMERGEABLE:
> > > >
> > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > excplicitly set.
> > > >
> > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > >
> > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > application programmers won't have a way to prevent sensitive data to be
> > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > option for applications.
> > > >
> > >
> > > We just did not have VM/Madvise flag for that currently.
> > > Same as THP.
> > > Because all logic written with assumption, what we have exactly 2 states.
> > > Allow/Disallow (More like not allow).
> > >
> > > And if we try to add, that must be something like:
> > > MADV_FORBID_* to disallow something completely.
> >
> > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > of operation.
> >
> > To do the above in ksm_madvise(), a new state should be added, for example
> > instead of:
> >
> > case MADV_UNMERGEABLE:
> > *vm_flags &= ~VM_MERGEABLE;
> >
> > A new flag should be used:
> > *vm_flags |= VM_UNMERGEABLE;
> >
> > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > be accepted.
> >
> > Pasha
> >
> 
> That must work, but we out of bit space in vm_flags [1].
> i.e. first 32 bits already defined, and other only accessible only on
> 64-bit machines.

So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

> 
> 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219


Re: [PATCH 01/12] kernfs: add function to find kernfs_node without increasing ref counter

2018-11-13 Thread Greg Kroah-Hartman
On Tue, Nov 13, 2018 at 06:53:59PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 12 nov 2018, alle ore 13:28, Greg Kroah-Hartman 
> >  ha scritto:
> > 
> > On Mon, Nov 12, 2018 at 10:56:21AM +0100, Paolo Valente wrote:
> >> From: Angelo Ruocco 
> >> 
> >> The kernfs pseudo file system doesn't export any function to only find
> >> a node by name, without also getting a reference on it.
> >> But in some cases it is useful to just locate a kernfs node, while
> >> using it or not depends on some other condition.
> >> 
> >> This commit adds a function to just look for a node, without getting
> >> a reference on it.
> > 
> > Eeek, that sounds really bad.  So you save off a pointer to something,
> > and have no idea if that pointer now really is valid or not?  It can
> > instantly disappear right afterwards.
> > 
> 
> Hi Greg,
> that function is invoked only in functions executed with cgroup_mutex
> held.  This guarantees that nothing disappears or becomes
> inconsistent.  That's why we decided to go for this optimization,
> instead of doing useless gets&puts pairs.  Still, I'm not expert
> enough to state whether it is impossible that, once we have defined
> that function, it may then get used in some unsafe way.

I can guarantee once you define that function, it will be used in an
unsafe way :(

So just don't create it, use the put calls, it's fast and should never
be a performance issue.

> So, I seem to see two options:
> 1) Add a comment on the function, saying that cgroup_mutex must be
>held while invoking it (I guess you won't like this one).

Nope, do not create it.

> 2) Do not define such a new function, and, in the other patches, use
>the already-available find_and_get.

Yes, please do that.

thanks,

greg k-h


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
On 13/11/2018 20:48, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa  wrote:
>>
>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>
>>> For general rare-writish stuff, I don't think we want IRQs running
>>> with them mapped anywhere for write.  For AVC and IMA, I'm less sure.
>>
>> Why would these be less sensitive?
> 
> I'm not really saying they're less sensitive so much as that the
> considerations are different.  I think the original rare-write code is
> based on ideas from grsecurity, and it was intended to protect static
> data like structs full of function pointers.   Those targets have some
> different properties:
> 
>  - Static targets are at addresses that are much more guessable, so
> they're easier targets for most attacks.  (Not spraying attacks like
> the ones you're interested in, though.)
> 
>  - Static targets are higher value.  No offense to IMA or AVC, but
> outright execution of shellcode, hijacking of control flow, or compete
> disablement of core security features is higher impact than bypassing
> SELinux or IMA.  Why would you bother corrupting the AVC if you could
> instead just set enforcing=0?  (I suppose that corrupting the AVC is
> less likely to be noticed by monitoring tools.)
> 
>  - Static targets are small.  This means that the interrupt latency
> would be negligible, especially in comparison to the latency of
> replacing the entire SELinux policy object.

Your analysis is correct.
In my case, having already taken care of those, I was going *also* after
the next target in line.
Admittedly, flipping a bit located at a fixed offset is way easier than
spraying dynamically allocated data structured.

However, once the bit is not easily writable, the only options are to
either find another way to flip it (unprotect it or subvert something
that can write it) or to identify another target that is still writable.

AVC and policyDB fit the latter description.

> Anyway, I'm not all that familiar with SELinux under the hood, but I'm
> wondering if a different approach to thinks like the policy database
> might be appropriate.  When the policy is changed, rather than
> allocating rare-write memory and writing to it, what if we instead
> allocated normal memory, wrote to it, write-protected it, and then
> used the rare-write infrastructure to do a much smaller write to
> replace the pointer?

Actually, that's exactly what I did.

I did not want to overload this discussion, but since you brought it up,
I'm not sure write rare is enough.

* write_rare is for stuff that sometimes changes all the time, ex: AVC

* dynamic read only is for stuff that at some point should not be
modified anymore, but could still be destroyed. Ex: policyDB

I think it would be good to differentiate, at runtime, between the two,
to minimize the chance that a write_rare function is used against some
read_only data.

Releasing dynamically allocated protected memory is also a big topic.
In some cases it's allocated and released continuously, like in the AVC.
Maybe it can be optimized, or maybe it can be turned into an object
cache of protected object.

But for releasing, it would be good, I think, to have a mechanism for
freeing all the memory in one loop, like having a pool containing all
the memory that was allocated for a specific use (ex: policyDB)

> Admittedly, this creates a window where another core could corrupt the
> data as it's being written.  That may not matter so much if an
> attacker can't force a policy update.  Alternatively, the update code
> could re-verify the policy after write-protecting it, or there could
> be a fancy API to allocate some temporarily-writable memory (by
> creating a whole new mm_struct, mapping the memory writable just in
> that mm_struct, and activating it) so that only the actual policy
> loader could touch the memory.  But I'm mostly speculating here, since
> I'm not familiar with the code in question.

They are all corner cases ... possible but unlikely.
Another, maybe more critical, one is that the policyDB is not available
at boot.
There is a window of opportunity, before it's loaded. But it could be
mitigated by loading a barebone set of rules, either from initrd or even
as "firmware".

> Anyway, I tend to think that the right way to approach mainlining all
> this is to first get the basic rare write support for static data into
> place and then to build on that.  I think it's great that you're
> pushing this effort, but doing this for SELinux and IMA is a bigger
> project than doing it for static data, and it might make sense to do
> it in bite-sized pieces.
> 
> Does any of this make sense?

Yes, sure.

I *have* to do SELinux, but I do not necessarily have to wait for the
final version to be merged upstream. And anyways Android is on a
different kernel.

However, I think both SELinux and IMA have a value in being sufficiently
complex cases to be used for validating the design as it evolves.

Each of them has static data that could be the first t

Re: [PATCH 0/2] Docs/EDID: Fixed and improved EDID documentation

2018-11-13 Thread Jani Nikula
On Tue, 06 Nov 2018, Jonathan Corbet  wrote:
> On Mon, 5 Nov 2018 09:48:33 +0100
> Christoph Niedermaier  wrote:
>
>> A problem was found when EDID data sets for displays other
>> than the provided samples were generated. The patch series has
>> no effect on the provided samples that still match the data
>> used in drivers/gpu/drm/drm_edid_load.c.
>> The provided samples use small values for XOFFSET, XPULSE,
>> YOFFSET and YPULSE, where the error doesn't occur. This fix
>> corrects the use of that values in case of high values, because
>> the most significant bits were treated incorrectly.
>> 
>> The previous version made it necessary to first generate an
>> EDID data set without correct CRC and then to fix the CRC in
>> a second step. This patch series adds the CRC calculation to the
>> makefile in such a way that a correct EDID data set is generated
>> in a single build step.
>
> This seems reasonable, I guess; I've applied both.  It seems to me, though,
> that this stuff is in the wrong place.  Perhaps we should go one step
> further and move it to tools/ ?

And then the next step further would be to write a tool in a high level
language to generate the data rather than assemble the binary. Such a
tool would, of course, catch errors like the ones fixed by this patch.

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Jann Horn
+cc Daniel Gruss

On Tue, Nov 13, 2018 at 6:59 PM Pavel Tatashin
 wrote:
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>
> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.

Wait, what? Can you name specific ones? Nowadays, enabling KSM for
untrusted VMs seems like a terrible idea to me, security-wise.

Google says at 
:
"Compute Engine and Container Engine are not vulnerable to this kind
of attack, since they do not use KSM."

An AWS employee says at
:
"memory de-duplication is not enabled by Amazon EC2's hypervisor"

In my opinion, KSM is fundamentally insecure for systems hosting
multiple VMs that don't trust each other. I don't think anyone writes
cryptographic software under the assumption that an attacker will be
given the ability to query whether a given page of data exists
anywhere else on the system.


Re: [PATCH v5 05/27] Documentation/x86: Add CET description

2018-11-13 Thread Yu-cheng Yu
On Tue, 2018-11-13 at 19:43 +0100, Borislav Petkov wrote:
> On Thu, Oct 11, 2018 at 08:15:01AM -0700, Yu-cheng Yu wrote:
> > Explain how CET works and the no_cet_shstk/no_cet_ibt kernel
> > parameters.
> > 
> > Signed-off-by: Yu-cheng Yu 
> > ---
> >  .../admin-guide/kernel-parameters.txt |   6 +
> >  Documentation/index.rst   |   1 +
> >  Documentation/x86/index.rst   |  11 +
> >  Documentation/x86/intel_cet.rst   | 266 ++
> >  4 files changed, 284 insertions(+)
> >  create mode 100644 Documentation/x86/index.rst
> >  create mode 100644 Documentation/x86/intel_cet.rst
> 
> So this patch should probably come first in the series so that a reader
> can know what to expect...
> 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 92eb1f42240d..3854423f7c86 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2764,6 +2764,12 @@
> > noexec=on: enable non-executable mappings (default)
> > noexec=off: disable non-executable mappings
> >  
> > +   no_cet_ibt  [X86-64] Disable indirect branch tracking for
> > user-mode
> > +   applications
> > +
> > +   no_cet_shstk[X86-64] Disable shadow stack support for user-
> > mode
> > +   applications
> > +
> > nosmap  [X86]
> > Disable SMAP (Supervisor Mode Access Prevention)
> > even if it is supported by processor.
> > diff --git a/Documentation/index.rst b/Documentation/index.rst
> > index 5db7e87c7cb1..1cdc139adb40 100644
> > --- a/Documentation/index.rst
> > +++ b/Documentation/index.rst
> 
> Please integrate scripts/checkpatch.pl into your patch creation
> workflow. Some of the warnings/errors *actually* make sense:
> 
> WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
> #76: FILE: Documentation/x86/index.rst:1:
> +===
> 
> WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
> #93: FILE: Documentation/x86/intel_cet.rst:1:
> +=
> 
> > @@ -104,6 +104,7 @@ implementation.
> > :maxdepth: 2
> >  
> > sh/index
> > +   x86/index
> >  
> >  Filesystem Documentation
> >  
> > diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
> > new file mode 100644
> > index ..9c34d8cbc8f0
> > --- /dev/null
> > +++ b/Documentation/x86/index.rst
> > @@ -0,0 +1,11 @@
> > +===
> > +X86 Documentation
> > +===
> > +
> > +Control Flow Enforcement
> > +
> > +
> > +.. toctree::
> > +   :maxdepth: 1
> > +
> > +   intel_cet
> > diff --git a/Documentation/x86/intel_cet.rst
> > b/Documentation/x86/intel_cet.rst
> > new file mode 100644
> > index ..946f4802a51f
> > --- /dev/null
> > +++ b/Documentation/x86/intel_cet.rst
> > @@ -0,0 +1,266 @@
> > +=
> > +Control Flow Enforcement Technology (CET)
> > +=
> > +
> > +[1] Overview
> > +
> > +
> > +Control Flow Enforcement Technology (CET) provides protection against
> > +return/jump-oriented programming (ROP) attacks.  It can be implemented
> > +to protect both the kernel and applications.  In the first phase,
> > +only the user-mode protection is implemented on the 64-bit kernel.
> 
> s/the//is implemented in 64-bit mode.
> 
> > +However, 32-bit applications are supported under the compatibility
> > +mode.
> 
> Drop "However":
> 
> "32-bit applications are, of course, supported in compatibility mode."
> 
> > +
> > +CET includes shadow stack (SHSTK) and indirect branch tracking (IBT).
> 
> "CET introduces two a shadow stack and an indirect branch tracking mechanism."
> 
> > +The SHSTK is a secondary stack allocated from memory.  The processor
> 
> s/The//
> 
> > +automatically pushes/pops a secure copy to the SHSTK every return
> > +address and,
> 
> that reads funny - pls reorganize. Also, what is a "secure copy"?
> 
> You mean a copy of every return address which software cannot access?
> 
> > by comparing the secure copy to the program stack copy,
> > +verifies function returns are as intended. 
> 
>... have not been corrupted/modified."
> 
> > The IBT verifies all
> > +indirect CALL/JMP targets are intended and marked by the compiler with
> > +'ENDBR' op codes.
> 
> "opcode" - one word. And before you use "ENDBR" you need to explain it
> above what it is.
> 
> /me reads further... encounters ENDBR's definition...
> 
> ah, ok, so you should say something like
> 
> "... and marked by the compiler with the ENDBR opcode (see below)."

I will work on it.  Thanks!

Yu-cheng


RE: [PATCH v7 01/13] arch/x86: Start renaming the rdt files to more generic names

2018-11-13 Thread Yu, Fenghua
> From: Borislav Petkov [mailto:b...@alien8.de]
>  Subject: [PATCH v7 01/13] arch/x86: Start renaming the rdt files to more
> generic names
> 
> I guess the subject prefix for all those should be "arch/resctrl:" or so now.

Is "x86/resctrl" a better subject prefix?

Thanks.

-Fenghua


Re: [PATCH v7 01/13] arch/x86: Start renaming the rdt files to more generic names

2018-11-13 Thread Borislav Petkov
On Tue, Nov 13, 2018 at 09:35:40PM +, Yu, Fenghua wrote:
> Is "x86/resctrl" a better subject prefix?

Doh, of course. Diffstat is all arch/x86/.

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


RE: [PATCH v7 11/13] arch/x86: Introduce QOS feature for AMD

2018-11-13 Thread Yu, Fenghua
> From: Moger, Babu [mailto:babu.mo...@amd.com]
> Subject: [PATCH v7 11/13] arch/x86: Introduce QOS feature for AMD
> The specification for this feature is available at
> https://developer.amd.com/wp-content/resources/56375.pdf

> +bool cbm_validate_amd(char *buf, u32 *data, struct rdt_resource *r) {
> + if (val > r->default_ctrl) {
> + rdt_last_cmd_puts("mask out of range\n");
> + return false;
> + }

If val is zero, then this closid cannot allocate any cache line.

I'm wondering: does that mean the tasks running with this closid directly 
access memory without cache? Is there any usage for this situation?

Thanks.

-Fenghua


Re: [PATCH v10 12/22] kasan, arm64: fix up fault handling logic

2018-11-13 Thread Mark Rutland
On Tue, Nov 13, 2018 at 04:01:27PM +0100, Andrey Konovalov wrote:
> On Thu, Nov 8, 2018 at 1:22 PM, Mark Rutland  wrote:
> > On Tue, Nov 06, 2018 at 06:30:27PM +0100, Andrey Konovalov wrote:
> >> show_pte in arm64 fault handling relies on the fact that the top byte of
> >> a kernel pointer is 0xff, which isn't always the case with tag-based
> >> KASAN.
> >
> > That's for the TTBR1 check, right?
> >
> > i.e. for the following to work:
> >
> > if (addr >= VA_START)
> >
> > ... we need the tag bits to be an extension of bit 55...
> >
> >>
> >> This patch resets the top byte in show_pte.
> >>
> >> Reviewed-by: Andrey Ryabinin 
> >> Reviewed-by: Dmitry Vyukov 
> >> Signed-off-by: Andrey Konovalov 
> >> ---
> >>  arch/arm64/mm/fault.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> >> index 7d9571f4ae3d..d9a84d6f3343 100644
> >> --- a/arch/arm64/mm/fault.c
> >> +++ b/arch/arm64/mm/fault.c
> >> @@ -32,6 +32,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>
> >>  #include 
> >>  #include 
> >> @@ -141,6 +142,8 @@ void show_pte(unsigned long addr)
> >>   pgd_t *pgdp;
> >>   pgd_t pgd;
> >>
> >> + addr = (unsigned long)kasan_reset_tag((void *)addr);
> >
> > ... but this ORs in (0xffUL << 56), which is not correct for addresses
> > which aren't TTBR1 addresses to begin with, where bit 55 is clear, and
> > throws away useful information.
> >
> > We could use untagged_addr() here, but that wouldn't be right for
> > kernels which don't use TBI1, and we'd erroneously report addresses
> > under the TTBR1 range as being in the TTBR1 range.
> >
> > I also see that the entry assembly for el{1,0}_{da,ia} clears the tag
> > for EL0 addresses.
> >
> > So we could have:
> >
> > static inline bool is_ttbr0_addr(unsigned long addr)
> > {
> > /* entry assembly clears tags for TTBR0 addrs */
> > return addr < TASK_SIZE_64;
> > }
> >
> > static inline bool is_ttbr1_addr(unsigned long addr)
> > {
> > /* TTBR1 addresses may have a tag if HWKASAN is in use */
> > return arch_kasan_reset_tag(addr) >= VA_START;
> > }
> >
> > ... and use those in the conditionals, leaving the addr as-is for
> > reporting purposes.
> 
> Actually it looks like 276e9327 ("arm64: entry: improve data abort
> handling of tagged pointers") already takes care of both user and
> kernel fault addresses and correctly removes tags from them. So I
> think we need to drop this patch.

The clear_address_tag macro added in that commit only removes tags from TTBR0
addresses, so that's not sufficient if the kernel is used tagged addresses
(which will be in the TTBR1 range).

Thanks,
Mark.
That commit only removes tags from TTBR0 addresses, 


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
> Wait, what? Can you name specific ones? Nowadays, enabling KSM for
> untrusted VMs seems like a terrible idea to me, security-wise.

Of course it is not used to share data among different
customers/tenants, as far as I know it is used by Oracle Cloud to
merge the same pages in clear containers.

https://medium.com/cri-o/intel-clear-containers-and-cri-o-70824fb51811
One performance enhancing feature is the use of KSM, a recent KVM
optimized for memory sharing and boot speed. Another is the use of an
optimized Clear Containers mini-OS.

Pasha


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 22:17, Pavel Tatashin :
>
> On 18-11-13 21:54:13, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin :
> > >
> > > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin 
> > > > :
> > > > >
> > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > > Hi.
> > > > > >
> > > > > > > Yep. However, so far, it requires an application to explicitly 
> > > > > > > opt in
> > > > > > > to this behavior, so it's not all that bad. Your patch would 
> > > > > > > remove
> > > > > > > the requirement for application opt-in, which, in my opinion, 
> > > > > > > makes
> > > > > > > this way worse and reduces the number of applications for which 
> > > > > > > this
> > > > > > > is acceptable.
> > > > > >
> > > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > > decision is made by the administrator, no extra risk is imposed.
> > > > >
> > > > > The new interface would be more tolerable if it honored 
> > > > > MADV_UNMERGEABLE:
> > > > >
> > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > > excplicitly set.
> > > > >
> > > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > > >
> > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > > application programmers won't have a way to prevent sensitive data to 
> > > > > be
> > > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > > option for applications.
> > > > >
> > > >
> > > > We just did not have VM/Madvise flag for that currently.
> > > > Same as THP.
> > > > Because all logic written with assumption, what we have exactly 2 
> > > > states.
> > > > Allow/Disallow (More like not allow).
> > > >
> > > > And if we try to add, that must be something like:
> > > > MADV_FORBID_* to disallow something completely.
> > >
> > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > > of operation.
> > >
> > > To do the above in ksm_madvise(), a new state should be added, for example
> > > instead of:
> > >
> > > case MADV_UNMERGEABLE:
> > > *vm_flags &= ~VM_MERGEABLE;
> > >
> > > A new flag should be used:
> > > *vm_flags |= VM_UNMERGEABLE;
> > >
> > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > > be accepted.
> > >
> > > Pasha
> > >
> >
> > That must work, but we out of bit space in vm_flags [1].
> > i.e. first 32 bits already defined, and other only accessible only on
> > 64-bit machines.
>
> So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

With all due respect to you, for that type of things we need
mm maintainer opinion.

I just don't want get situation, where after touch of other subsystems,
maintainer will just refuse that work by some reason.

i.e. writing patches for upstream (from my point of view),
is more art of communication and making resulte code acceptable by community.
Because any code which written correctly from engineering point of view,
can be easy refused, just because someone not found it useful.

Thanks.

> >
> > 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Pavel Tatashin
> > > That must work, but we out of bit space in vm_flags [1].
> > > i.e. first 32 bits already defined, and other only accessible only on
> > > 64-bit machines.
> >
> > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> 
> With all due respect to you, for that type of things we need
> mm maintainer opinion.

As far as I understood, you already got directions from the maintainers
to do similar to the way THP is implemented, and THP uses two flags:

VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
honor MADV_UNMERGEABLE.

When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.

There may be a way to add VM_UNMERGEABLE without extending the size of
vm_flags, but that would be a good start point in looking how to add a
new flag.

Again, you could simply enable this feature on 64-bit only.

Pasha


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
вт, 13 нояб. 2018 г. в 21:43, Pavel Tatashin :
>
> > > Is it really necessary to have an extra thread in ksm just to add vma's
> > > for scanning? Can we do it right from the scanner thread? Also, may be
> > > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > > enabled?
> > >
> > > Thank you,
> > > Pasha
> >
> > Oh, thats a long story, and my english to bad for describe all things,
> > even that hard to find linux-mm conversation several years ago about that.
> >
> > Anyway, so:
> > In V2 - i use scanner thread to add VMA, but i think scanner do that
> > with too high rate.
> > i.e. walk on task list, and get new task every 20ms, to wait write 
> > semaphore,
> > to get VMA...
> > To high rate for task list scanner, i think it's overkill.
> >
> > About add VMA from creation time,
> > UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> > But some mm people say what they not like add KSM hooks to other subsystems.
> > And want ksm do that internally by some way.
> >
> > Frankly speaking i didn't have enough knowledge and skills to do that
> > another way in past time.
> > They also suggest me look to THP for that logic, but i can't find how
> > THP do that without hooks, and
> > where THP truly scan memory.
> >
> > So, after all of that i implemented this in that way.
> > In first iteration as part of ksm scan thread, and in second, by
> > separate thread.
> > Because that allow to add VMA in fully independent way.
>
> It still feels as a wrong direction. A new thread that adds random
> VMA's to scan, and no way to optimize the queue fairness for example.
> It should really be done at creation time, when VMA is created it
> should be added to KSM scanning queue, or KSM main scanner thread
> should go through VMA list in a coherent order.

How you see queue fairness in that case?
i.e. if you talk about moving from old VMA to new VMA,
IIRC i can't find any whole kernel list of VMAs.

i.e. i really understood what you don't like exactly,
but for that we need add hooks as i already mentioned above.
(And i already try get that to kernel [1]).

So, as i wrote you below, i need some maintainer opinion
in which way that responsible person of mm see 'right' implementation.

> The design of having a separate thread is bad. I plan in the future to
> add thread per node support to KSM, and this one odd thread won't
> break things, to which queue should this thread add VMA if there are
> multiple queues?

That will be interesting to look :)
But IMHO:
I think you will need to add some code to ksm_enter().
Because madvise() internally call ksm_enter().

So ksm_enter() will decide which tread must process that.
That not depends on caller.

Thanks.

> Thank you,
> Pasha
>
- - -
1. https://lkml.org/lkml/2014/11/8/206


Re: [PATCH V3] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
ср, 14 нояб. 2018 г. в 01:53, Pavel Tatashin :
>
> > > > That must work, but we out of bit space in vm_flags [1].
> > > > i.e. first 32 bits already defined, and other only accessible only on
> > > > 64-bit machines.
> > >
> > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> >
> > With all due respect to you, for that type of things we need
> > mm maintainer opinion.
>
> As far as I understood, you already got directions from the maintainers
> to do similar to the way THP is implemented, and THP uses two flags:
>
> VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
> honor MADV_UNMERGEABLE.
>
> When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.
>
> There may be a way to add VM_UNMERGEABLE without extending the size of
> vm_flags, but that would be a good start point in looking how to add a
> new flag.
>
> Again, you could simply enable this feature on 64-bit only.
>
> Pasha
>

Deal!
I will try with only on 64bit machines.


[PATCH V5] KSM: allow dedup all tasks memory

2018-11-13 Thread Timofey Titovets
From: Timofey Titovets 

ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
I try mimic hugepages attribute, so mode have two states:
  * madvise  - old default behaviour
  * always [new] - allow ksm to get tasks vma and
   try working on that.
  * seeker_sleep_millisecs:
Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

For security proporses add VM_UNMERGEABLE flag,
that allow users who really care about security to
use MADV_UNMERGEABLE to forbid new ksm code to process their VMAs.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
Description: Many different chrome/electron apps + KDE
Ratio: 5%
Saved: ~100  MiB
Used:  ~2000 MiB

  * Name: K8s test VM
Description: Some small random running docker images
Ratio: 40%
Saved: ~160 MiB
Used:  ~920 MiB

  * Name: Ceph test VM
Description: Ceph Mon/OSD, some containers
Ratio: 20%
Saved: ~60 MiB
Used:  ~600 MiB

  * Name: BareMetal K8s backend server
Description: Different server apps in containers C, Java, GO & etc
Ratio: 72%
Saved: ~5800 MiB
Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
Description: Many instance of one CPU intensive application
Ratio: 55%
Saved: ~2600 MiB
Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
Description: Only OSD storage daemons running
Raio: 2%
Saved: ~190 MiB
Used:  ~11.7 GiB

Changes:
  v1 -> v2:
* Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
* Reformat patch description
* Rename mode normal to madvise
* Add some memory numbers
* Separate ksm vma seeker to another kthread
* Fix: "BUG: scheduling while atomic: ksmd"
  by move seeker to another thread
  v3 -> v4:
* Fix again "BUG: scheduling while atomic"
  by get()/put() task API
* Remove unused variable error
  v4 -> v5:
* That nowonly be available on 64-bit arch
  because VM_UNMERGEABLE use 37 bit in vm_flags
* Add VM_UNMERGEABLE VMA flag to allow users
  forbid ksm do anything with VMAs

Signed-off-by: Timofey Titovets 
CC: Matthew Wilcox 
CC: Oleksandr Natalenko 
CC: Pavel Tatashin 
CC: linux...@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 include/linux/mm.h   |   7 +
 include/trace/events/mmflags.h   |   7 +
 mm/ksm.c | 247 +++
 4 files changed, 245 insertions(+), 31 deletions(-)

diff --git a/Documentation/admin-guide/mm/ksm.rst 
b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..7cffd47f9b38 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -116,6 +116,21 @@ run
 Default: 0 (must be changed to 1 to activate KSM, except if
 CONFIG_SYSFS is disabled)
 
+mode
+* set always to allow ksm deduplicate memory of every process
+* set madvise to use only madvised memory
+
+Default: madvise (dedupulicate only madvised memory as in
+earlier releases)
+
+seeker_sleep_millisecs
+how many milliseconds ksmd task seeker should sleep try another
+task.
+e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
+
+Default: 1000 (chosen for rate limit purposes)
+
+
 use_zero_pages
 specifies whether empty pages (i.e. allocated pages that only
 contain zeroes) should be treated specially.  When set to 1,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..3d8ee297d776 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -224,13 +224,20 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit 
architectures */
+#define VM_HIGH_ARCH_BIT_5 37  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef VM_HIGH_ARCH_5
+/* Forbid ksm to autopickup VMA and try dedup it */
+#define VM_UNMERGEABLE VM_HI

Re: [PATCH] hwmon (ina3221) Add single-shot mode support

2018-11-13 Thread Nicolin Chen
Hi Guenter,

On Tue, Nov 13, 2018 at 09:21:02AM -0800, Guenter Roeck wrote:
> > > > INA3221 supports both continuous and single-shot modes. When
> > > > running in the continuous mode, it keeps measuring the inputs
> > > > and converting them to the data register even if there are no
> > > > users reading the data out. In this use case, this could be a
> > > > power waste.
> > > > 
> > > > So this patch adds a single-shot mode support so that ina3221
> > > > could do measurement and conversion only if users trigger it,
> > > > depending on the use case where it only needs to poll data in
> > > > a lower frequency.
> > > > 
> > > > The change also exposes "mode" and "available_modes" nodes to
> > > > allow users to switch between two operating modes.
> > > > 
> > > Lots and lots of complexity for little gain. Sorry, I don't see
> > > the point of this change.
> > 
> > The chip is causing considerable power waste on battery-powered
> > devices so we typically use it running in the single-shot mode.
> 
> And you need to be able to do that with a sysfs attribute ?
> Are you planning to have some code switching back and forth
> between the modes ?
>
> You'll need to provide a good rationale why this needs to be
> runtime configurable.

Honestly, our old downstream driver didn't expose it via sysfs.
Instead, it had a built-in "governor" to switch modes based on
the CPU hotplug state and cpufreq. However, the interface used
to register a CPU hotplug notification was already deprecated.
And I don't feel this governor is generic enough to be present
in the mainline code.

For me, it's not that necessary to be a sysfs attribute. I try
to add it merely because I cannot find a good criteria for the
mode switching in a hwmon driver. So having an open sysfs node
may allow user space power daemon to decide its operating mode,
since it knows which power mode the system is running at: full
speed (charging/charged) or power saving (on-battery), and it
knows how often this exact service will poll the sensor data.

An alternative way (without the sysfs node), after looking at
other hwmon code, could be to have a timed polling thread and
read data using an update_interval value from ABI. This might
turn out to be more complicated as it'll also involve settings
of hardware averaging and conversion time. Above all, I cannot
figure out a good threshold of update_interval to switch modes.

If you can give some advice of a better implementation, that'd
be great.

Thanks
Nicolin


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-13 Thread Kenneth Lee



在 2018/11/13 上午8:23, Leon Romanovsky 写道:

On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote:

From: Kenneth Lee 

WarpDrive is a general accelerator framework for the user application to
access the hardware without going through the kernel in data path.

The kernel component to provide kernel facility to driver for expose the
user interface is called uacce. It a short name for
"Unified/User-space-access-intended Accelerator Framework".

This patch add document to explain how it works.

+ RDMA and netdev folks

Sorry, to be late in the game, I don't see other patches, but from
the description below it seems like you are reinventing RDMA verbs
model. I have hard time to see the differences in the proposed
framework to already implemented in drivers/infiniband/* for the kernel
space and for the https://github.com/linux-rdma/rdma-core/ for the user
space parts.


Thanks Leon,

Yes, we tried to solve similar problem in RDMA. We also learned a lot 
from the exist code of RDMA. But we we have to make a new one because we 
cannot register accelerators such as AI operation, encryption or 
compression to the RDMA framework:)


Another problem we tried to address is the way to pin the memory for dma 
operation. The RDMA way to pin the memory cannot avoid the page lost due 
to copy-on-write operation during the memory is used by the device. This 
may not be important to RDMA library. But it is important to accelerator.


Hope this can help the understanding.

Cheers



Hard NAK from RDMA side.

Thanks


Signed-off-by: Kenneth Lee 
---
  Documentation/warpdrive/warpdrive.rst   | 260 +++
  Documentation/warpdrive/wd-arch.svg | 764 
  Documentation/warpdrive/wd.svg  | 526 ++
  Documentation/warpdrive/wd_q_addr_space.svg | 359 +
  4 files changed, 1909 insertions(+)
  create mode 100644 Documentation/warpdrive/warpdrive.rst
  create mode 100644 Documentation/warpdrive/wd-arch.svg
  create mode 100644 Documentation/warpdrive/wd.svg
  create mode 100644 Documentation/warpdrive/wd_q_addr_space.svg

diff --git a/Documentation/warpdrive/warpdrive.rst 
b/Documentation/warpdrive/warpdrive.rst
new file mode 100644
index ..ef84d3a2d462
--- /dev/null
+++ b/Documentation/warpdrive/warpdrive.rst
@@ -0,0 +1,260 @@
+Introduction of WarpDrive
+=
+
+*WarpDrive* is a general accelerator framework for the user application to
+access the hardware without going through the kernel in data path.
+
+It can be used as the quick channel for accelerators, network adaptors or
+other hardware for application in user space.
+
+This may make some implementation simpler.  E.g.  you can reuse most of the
+*netdev* driver in kernel and just share some ring buffer to the user space
+driver for *DPDK* [4] or *ODP* [5]. Or you can combine the RSA accelerator with
+the *netdev* in the user space as a https reversed proxy, etc.
+
+*WarpDrive* takes the hardware accelerator as a heterogeneous processor which
+can share particular load from the CPU:
+
+.. image:: wd.svg
+:alt: WarpDrive Concept
+
+The virtual concept, queue, is used to manage the requests sent to the
+accelerator. The application send requests to the queue by writing to some
+particular address, while the hardware takes the requests directly from the
+address and send feedback accordingly.
+
+The format of the queue may differ from hardware to hardware. But the
+application need not to make any system call for the communication.
+
+*WarpDrive* tries to create a shared virtual address space for all involved
+accelerators. Within this space, the requests sent to queue can refer to any
+virtual address, which will be valid to the application and all involved
+accelerators.
+
+The name *WarpDrive* is simply a cool and general name meaning the framework
+makes the application faster. It includes general user library, kernel
+management module and drivers for the hardware. In kernel, the management
+module is called *uacce*, meaning "Unified/User-space-access-intended
+Accelerator Framework".
+
+
+How does it work
+
+
+*WarpDrive* uses *mmap* and *IOMMU* to play the trick.
+
+*Uacce* creates a chrdev for the device registered to it. A "queue" will be
+created when the chrdev is opened. The application access the queue by mmap
+different address region of the queue file.
+
+The following figure demonstrated the queue file address space:
+
+.. image:: wd_q_addr_space.svg
+:alt: WarpDrive Queue Address Space
+
+The first region of the space, device region, is used for the application to
+write request or read answer to or from the hardware.
+
+Normally, there can be three types of device regions mmio and memory regions.
+It is recommended to use common memory for request/answer descriptors and use
+the mmio space for device notification, such as doorbell. But of course, this
+is all up to the interface designer.
+
+There can be two types of device memory re