date:20170906

[PATCH] thermal/drivers/step_wise: Fix temperature regulation misbehavior

2017-09-06 Thread Daniel Lezcano

There is a particular situation when the cooling device is cpufreq and the heat
dissipation is not efficient enough where the temperature increases little by
little until reaching the critical threshold and leading to a SoC reset.

The behavior is reproducible on a hikey6220 with bad heat dissipation (eg.
stacked with other boards).

Running a simple C program doing while(1); for each CPU of the SoC makes the
temperature to reach the passive regulation trip point and ends up to the
maximum allowed temperature followed by a reset.

What is observed is a ping pong between two cpu frequencies, 1.2GHz and 900MHz
while the temperature continues to grow.

It appears the step wise governor calls get_target_state() the first time with
the throttle set to true and the trend to 'raising'. The code selects logically
the next state, so the cpu frequency decreases from 1.2GHz to 900MHz, so far so
good. The temperature decreases immediately but still stays greater than the
trip point, then get_target_state() is called again, this time with the
throttle set to true *and* the trend to 'dropping'. From there the algorithm
assumes we have to step down the state and the cpu frequency jumps back to
1.2GHz. But the temperature is still higher than the trip point, so
get_target_state() is called with throttle=1 and trend='raising' again, we jump
to 900MHz, then get_target_state() is called with throttle=1 and
trend='dropping', we jump to 1.2GHz, etc ... but the temperature does not
stabilizes and continues to increase.

Keeping the next_target untouched when 'throttle' is true at 'dropping' time
fixes the issue.

Signed-off-by: Daniel Lezcano 
---
 drivers/thermal/step_wise.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index be95826..a01259a 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -94,9 +94,11 @@ static unsigned long get_target_state(struct 
thermal_instance *instance,
if (!throttle)
next_target = THERMAL_NO_TARGET;
} else {
-   next_target = cur_state - 1;
-   if (next_target > instance->upper)
-   next_target = instance->upper;
+   if (!throttle) {
+   next_target = cur_state - 1;
+   if (next_target > instance->upper)
+   next_target = instance->upper;
+   }
}
break;
case THERMAL_TREND_DROP_FULL:
-- 
2.7.4

Re: [Intel-wired-lan] [PATCH net-next v3] e1000e: Be drop monitor friendly

2017-09-06 Thread Neftin, Sasha


On 8/26/2017 04:14, Florian Fainelli wrote:

e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli 
---
Changes in v3:

- differentiate normal reclamation from TX DMA fragment mapping errors
- removed a few invalid dev_kfree_skb() replacements (those are already
   drop monitor friendly)

Changes in v2:

- make it compile

  drivers/net/ethernet/intel/e1000e/netdev.c | 18 +++---
  1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 327dfe5bedc0..cfd21858c095 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1071,7 +1071,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
*rx_ring, int *work_done,
  }
  
  static void e1000_put_txbuf(struct e1000_ring *tx_ring,

-   struct e1000_buffer *buffer_info)
+   struct e1000_buffer *buffer_info,
+   bool drop)
  {
struct e1000_adapter *adapter = tx_ring->adapter;
  
@@ -1085,7 +1086,10 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,

buffer_info->dma = 0;
}
if (buffer_info->skb) {
-   dev_kfree_skb_any(buffer_info->skb);
+   if (drop)
+   dev_kfree_skb_any(buffer_info->skb);
+   else
+   dev_consume_skb_any(buffer_info->skb);
buffer_info->skb = NULL;
}
buffer_info->time_stamp = 0;
@@ -1199,7 +1203,7 @@ static void e1000e_tx_hwtstamp_work(struct work_struct 
*work)
wmb(); /* force write prior to skb_tstamp_tx */
  
  		skb_tstamp_tx(skb, &shhwtstamps);

-   dev_kfree_skb_any(skb);
+   dev_consume_skb_any(skb);
} else if (time_after(jiffies, adapter->tx_hwtstamp_start
  + adapter->tx_timeout_factor * HZ)) {
dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
@@ -1254,7 +1258,7 @@ static bool e1000_clean_tx_irq(struct e1000_ring *tx_ring)
}
}
  
-			e1000_put_txbuf(tx_ring, buffer_info);

+   e1000_put_txbuf(tx_ring, buffer_info, false);
tx_desc->upper.data = 0;
  
  			i++;

@@ -2421,7 +2425,7 @@ static void e1000_clean_tx_ring(struct e1000_ring 
*tx_ring)
  
  	for (i = 0; i < tx_ring->count; i++) {

buffer_info = &tx_ring->buffer_info[i];
-   e1000_put_txbuf(tx_ring, buffer_info);
+   e1000_put_txbuf(tx_ring, buffer_info, false);
}
  
  	netdev_reset_queue(adapter->netdev);

@@ -5614,7 +5618,7 @@ static int e1000_tx_map(struct e1000_ring *tx_ring, 
struct sk_buff *skb,
i += tx_ring->count;
i--;
buffer_info = &tx_ring->buffer_info[i];
-   e1000_put_txbuf(tx_ring, buffer_info);
+   e1000_put_txbuf(tx_ring, buffer_info, true);
}
  
  	return 0;

@@ -7411,7 +7415,7 @@ static void e1000_remove(struct pci_dev *pdev)
if (adapter->flags & FLAG_HAS_HW_TIMESTAMP) {
cancel_work_sync(&adapter->tx_hwtstamp_work);
if (adapter->tx_hwtstamp_skb) {
-   dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
+   dev_consume_skb_any(adapter->tx_hwtstamp_skb);
adapter->tx_hwtstamp_skb = NULL;
}
}


i am ok with this patch

Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-06 Thread Peter Zijlstra

Guys, please trim email.

On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote:
> For clarification, I was actually asking if you consider just adding one
> more jump label to skip it for Xen/KVM instead of making
> virt_spin_lock() a pv-op.

I don't understand. What performance are you worried about. Native will
now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code.

4.13 on thinkpad x220: oops when writing to SD card

2017-09-06 Thread Seraphime Kirkovski

To: Adrian Hunter 
Cc: Shawn Lin , Pavel Machek ,
linux-...@vger.kernel.org,
kernel list 
Bcc: 
Subject: Re: 4.13 on thinkpad x220: oops when writing to SD card
Reply-To: 
In-Reply-To: <6689241f-a4d8-7a3e-9f0b-482b034e5...@intel.com>

Hi,

> > Seems 4.13-rc4 was already broken for that but unfortuantely 
> > I didn't
> > reproduce that. So maybe Seraphime can do git-bisect as he said "I get
> > it everytime" for which I assume it could be easy for him to find out
> > the problematic commit?

I can reliably reproduce it, although sometimes it needs some more work.
For example, I couldn't trigger it while writing less than 1 gigabyte
and sometimes I have to do it more than once. It helps if the machine is
doing something else in meantime, I do kernel builds.

> Another unrelated issue with mmc_init_request() is that 
> mmc_exit_request()
> is not called if mmc_init_request() fails, which means mmc_init_request()
> must free anything it allocates when it fails.

Will try the patch and report back soon.

Re: [PATCH] blk-mq: Start to fix memory ordering...

2017-09-06 Thread Andrea Parri

On Mon, Sep 04, 2017 at 11:09:32AM +0200, Peter Zijlstra wrote:
> 
> Attempt to untangle the ordering in blk-mq. The patch introducing the
> single smp_mb__before_atomic() is obviously broken in that it doesn't
> clearly specify a pairing barrier and an obtained guarantee.
> 
> The comment is further misleading in that it hints that the
> deadline store and the COMPLETE store also need to be ordered, but
> AFAICT there is no such dependency. However what does appear to be
> important is the clear happening _after_ the store, and that worked by
> pure accident.
> 
> This clarifies blk_mq_start_request() -- we should not get there with
> STARTING set -- this simplifies the code and makes the barrier usage
> sane (the old code could be read to allow not having _any_ atomic after
> the barrier, in which case the barrier hasn't got anything to order). We
> then also introduce the missing pairing barrier for it.
> 
> And it documents the STARTING vs COMPLETE ordering. Although I've not
> been entirely successful in reverse engineering the blk-mq state
> machine so there might still be more funnies around timeout vs
> requeue.
> 
> If I got anything wrong, feel free to educate me by adding comments to
> clarify things ;-)
> 
> Cc: Alan Stern 
> Cc: Will Deacon 
> Cc: Ming Lei 
> Cc: Jens Axboe 
> Cc: Andrea Parri 
> Cc: Boqun Feng 
> Cc: "Paul E. McKenney" 
> Cc: Christoph Hellwig 
> Cc: Bart Van Assche 
> Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking 
> rq as started")
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  block/blk-mq.c |   48 +---
>  1 file changed, 37 insertions(+), 11 deletions(-)
> 
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -558,22 +558,29 @@ void blk_mq_start_request(struct request
>  
>   blk_add_timer(rq);
>  
> - /*
> -  * Ensure that ->deadline is visible before set the started
> -  * flag and clear the completed flag.
> -  */
> - smp_mb__before_atomic();
> + WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>  
>   /*
>* Mark us as started and clear complete. Complete might have been
>* set if requeue raced with timeout, which then marked it as
>* complete. So be sure to clear complete again when we start
>* the request, otherwise we'll ignore the completion event.
> +  *
> +  * Ensure that ->deadline is visible before set STARTED, such that
> +  * blk_mq_check_expired() is guaranteed to observe our ->deadline
> +  * when it observes STARTED.
>*/
> - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> - set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> - if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> + smp_mb__before_atomic();

I am wondering whether we should be using smp_wmb() instead: this would
provide the above guarantee and save a full barrier on powerpc/arm64.


> + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> + /*
> +  * Coherence order guarantees these consequtive stores to a
> +  * singe variable propagate in the specified order. Thus the
> +  * clear_bit() is ordered _after_ the set bit. See
> +  * blk_mq_check_expired().
> +  */
>   clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);

It could be useful to stress that set_bit(), clear_bit()  must "act" on
the same subword of the unsigned long (whatever "subword" means at this
level...) to rely on the coherence order (c.f., alpha's implementation).


> + }
>  
>   if (q->dma_drain_size && blk_rq_bytes(rq)) {
>   /*
> @@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct
>   struct request *rq, void *priv, bool reserved)
>  {
>   struct blk_mq_timeout_data *data = priv;
> + unsigned long deadline;
>  
>   if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>   return;
>  
>   /*
> +  * Ensures that if we see STARTED we must also see our
> +  * up-to-date deadline, see blk_mq_start_request().
> +  */
> + smp_rmb();
> +
> + deadline = READ_ONCE(rq->deaedline);
> +
> + /*
>* The rq being checked may have been freed and reallocated
>* out already here, we avoid this race by checking rq->deadline
>* and REQ_ATOM_COMPLETE flag together:
> @@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct
>*   and clearing the flag in blk_mq_start_request(), so
>*   this rq won't be timed out too.
>*/
> - if (time_after_eq(jiffies, rq->deadline)) {
> - if (!blk_mark_rq_complete(rq))
> + if (time_after_eq(jiffies, deadline)) {
> + if (!blk_mark_rq_complete(rq)) {
> + /*
> +  * Relies on the implied MB from test_and_clear() to
> +  * order the COMPLETE load

RE: [PATCH] RM64: dts: ls208xa: Add iommu-map property for pci

2017-09-06 Thread Bharat Bhushan

Hi Robin,

> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, September 01, 2017 4:29 PM
> To: Bharat Bhushan ; Marc Zyngier
> ; robh...@kernel.org; Mark Rutland
> ; will.dea...@arm.com; o...@buserror.net; Gang
> Liu ; devicet...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> catalin.mari...@arm.com
> Subject: Re: [PATCH] RM64: dts: ls208xa: Add iommu-map property for pci
> 
> On 01/09/17 11:13, Bharat Bhushan wrote:
> >
> >
> >> -Original Message- From: linux-kernel-ow...@vger.kernel.org
> >> [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Bharat
> >> Bhushan Sent: Thursday, August 31, 2017 4:53 PM To: Marc Zyngier
> >> ; robh...@kernel.org; Mark Rutland
> >> ; will.dea...@arm.com; o...@buserror.net;
> Gang
> >> Liu ; devicet...@vger.kernel.org;
> >> linux-arm-ker...@lists.infradead.org; linux- ker...@vger.kernel.org;
> >> catalin.mari...@arm.com Subject: RE:
> >> [PATCH] RM64: dts: ls208xa: Add iommu-map property for pci
> >>
> >>
> >>
> >>> -Original Message- From: Marc Zyngier
> >>> [mailto:marc.zyng...@arm.com] Sent: Thursday, August 31, 2017
> >>> 4:20 PM To: Bharat Bhushan ;
> >>> robh...@kernel.org;
> >> Mark
> >>> Rutland ; will.dea...@arm.com;
> >> o...@buserror.net;
> >>> Gang Liu ; devicet...@vger.kernel.org;
> >>> linux-arm-ker...@lists.infradead.org; linux- ker...@vger.kernel.org;
> >>> catalin.mari...@arm.com Subject: Re:
> >>> [PATCH] RM64: dts: ls208xa: Add iommu-map property for pci
> >>>
> >>> [Fixing Mark's address...]
> >>>
> >>> On 31/08/17 11:41, Bharat Bhushan wrote:
> 
> > -Original Message- From: Marc Zyngier
> > [mailto:marc.zyng...@arm.com] Sent: Thursday, August 31, 2017
> > 3:02 PM To: Bharat Bhushan ;
> > robh...@kernel.org; ark.rutl...@arm.com; will.dea...@arm.com;
> > o...@buserror.net; Gang
> >>> Liu
> > ; devicet...@vger.kernel.org; linux-arm-
> > ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> > catalin.mari...@arm.com Subject: Re: [PATCH] RM64: dts:
> > ls208xa: Add iommu-map property for pci
> >
> > On 31/08/17 10:23, Bharat Bhushan wrote:
> >> This patch adds iommu-map property for PCIe, which enables
> SMMU
> >> for these devices on LS208xA devices.
> >>
> >> Signed-off-by: Bharat Bhushan  ---
> >> arch/arm64/boot/dts/freescale/fsl-ls208xa.dtsi | 4  1 file
> >> changed, 4 insertions(+)
> >>
> >> diff --git
> >> a/arch/arm64/boot/dts/freescale/fsl-ls208xa.dtsi
> >> b/arch/arm64/boot/dts/freescale/fsl-ls208xa.dtsi index
> >> 94cdd30..67cf605 100644 ---
> >> a/arch/arm64/boot/dts/freescale/fsl-ls208xa.dtsi +++
> >> b/arch/arm64/boot/dts/freescale/fsl-ls208xa.dtsi @@ -606,6
> >> +606,7 @@ num-lanes = <4>; bus-range = <0x0 0xff>;
> >> msi-parent = <&its>; + iommu-map = <0 &smmu 0
> 1>;   /* This
> >> is fixed-up by
> > u-boot */
> >
> > What does this do when your version of u-boot doesn't fill this in
> > for
> >> you?
> 
>  Good question, frankly I have not thought of this case before.
>  But if we pass length = 0 in above property then no fixup happen
>  with happen with older u-boot. In this case
>  of_iommu_configure() will return NULL iommu-ops and it switch to
>  swio-tlb. Will that work?
> >>> I really don't like this. You rely on having invalid data in the DT,
> >>> and that seems just wrong.
> >>>
> >>> Why can't u-boot just generate that property, and we leave the DT
> >>> alone?
> >>
> >> We do not have smmu phandle allowing uboot to generate this.
> >>
> >>> Or even better, you provide the right information for the few boards
> >>> that are based on this SoC, not relying on u-boot for anything that
> >>> is in the kernel tree?
> >>
> >> On our platforms we have a h/w table which converts RID->Device-Id.
> >> I will check what will happen if that table is not initialized, can
> >> RID be equal to device-id is that case. If that will be allowed than
> >> we can give right information that will work with u-boot not updating
> >> this property.
> >
> > U-boot uses a stream-id allocator and programs the h/w mapping table
> > (rid to sid mapping table). Also it updates iommu-map property
> > accordingly. But If u-boot does not update iommu-map than we cannot
> > have a valid full proof solution as stream-id allocation can change in
> > u-boot.
> >
> > So the other option of u-boot generating this entry seems correct
> > solution. This requires u-boot to know iommu-phandle, something
> > similar to "msi-parent" used for "msi-map" Device-tree binding need
> > change to add iommu-phandle/iommu-parent for this.
> 
> From what I know of this hardware, it's going to be rather difficult to 
> concoct
> a DT which reflects the initial hardware state accurately *and* works
> correctly without updating u-boot in lockstep. IIRC, I believe the accurate
> description for an unp

Re: [PATCH] blk-mq: Start to fix memory ordering...

2017-09-06 Thread Peter Zijlstra

On Tue, Sep 05, 2017 at 02:51:25PM +, Bart Van Assche wrote:
> "deaedline" is a spelling error. Has this patch been tested?

Bah.. So I ran with a previous version for a while, but then redid the
whole patch (as its mostly comments anyway) and clearly made a giant
mess of it.

I'll respin.

Re: [PATCH 1/9] ext4: remove duplicate extended attributes defs

2017-09-06 Thread Jan Kara

On Tue 05-09-17 16:35:33, Ross Zwisler wrote:
> The following commit:
> 
> commit 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR
> interface support")
> 
> added several defines related to extended attributes to ext4.h.  They were
> added within an #ifndef FS_IOC_FSGETXATTR block with the comment:
> 
> /* Until the uapi changes get merged for project quota... */
> 
> Those uapi changes were merged by this commit:
> 
> commit 334e580a6f97 ("fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR
> promotion")
> 
> so all the definitions needed by ext4 are available in
> include/uapi/linux/fs.h.  Remove the duplicates from ext4.h.
> 
> Signed-off-by: Ross Zwisler 
> Cc: Li Xi 
> Cc: Theodore Ts'o 
> Cc: Andreas Dilger 
> Cc: Jan Kara 
> Cc: Dave Chinner 

Yeah, good cleanup. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/ext4.h | 37 -
>  1 file changed, 37 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index a2bb7d2..c950278 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -644,43 +644,6 @@ enum {
>  #define EXT4_IOC_GET_ENCRYPTION_PWSALT   FS_IOC_GET_ENCRYPTION_PWSALT
>  #define EXT4_IOC_GET_ENCRYPTION_POLICY   FS_IOC_GET_ENCRYPTION_POLICY
>  
> -#ifndef FS_IOC_FSGETXATTR
> -/* Until the uapi changes get merged for project quota... */
> -
> -#define FS_IOC_FSGETXATTR_IOR('X', 31, struct fsxattr)
> -#define FS_IOC_FSSETXATTR_IOW('X', 32, struct fsxattr)
> -
> -/*
> - * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
> - */
> -struct fsxattr {
> - __u32   fsx_xflags; /* xflags field value (get/set) */
> - __u32   fsx_extsize;/* extsize field value (get/set)*/
> - __u32   fsx_nextents;   /* nextents field value (get)   */
> - __u32   fsx_projid; /* project identifier (get/set) */
> - unsigned char   fsx_pad[12];
> -};
> -
> -/*
> - * Flags for the fsx_xflags field
> - */
> -#define FS_XFLAG_REALTIME0x0001  /* data in realtime volume */
> -#define FS_XFLAG_PREALLOC0x0002  /* preallocated file extents */
> -#define FS_XFLAG_IMMUTABLE   0x0008  /* file cannot be modified */
> -#define FS_XFLAG_APPEND  0x0010  /* all writes append */
> -#define FS_XFLAG_SYNC0x0020  /* all writes 
> synchronous */
> -#define FS_XFLAG_NOATIME 0x0040  /* do not update access time */
> -#define FS_XFLAG_NODUMP  0x0080  /* do not include in 
> backups */
> -#define FS_XFLAG_RTINHERIT   0x0100  /* create with rt bit set */
> -#define FS_XFLAG_PROJINHERIT 0x0200  /* create with parents projid */
> -#define FS_XFLAG_NOSYMLINKS  0x0400  /* disallow symlink creation */
> -#define FS_XFLAG_EXTSIZE 0x0800  /* extent size allocator hint */
> -#define FS_XFLAG_EXTSZINHERIT0x1000  /* inherit inode extent 
> size */
> -#define FS_XFLAG_NODEFRAG0x2000  /* do not defragment */
> -#define FS_XFLAG_FILESTREAM  0x4000  /* use filestream allocator */
> -#define FS_XFLAG_HASATTR 0x8000  /* no DIFLAG for this */
> -#endif /* !defined(FS_IOC_FSGETXATTR) */
> -
>  #define EXT4_IOC_FSGETXATTR  FS_IOC_FSGETXATTR
>  #define EXT4_IOC_FSSETXATTR  FS_IOC_FSSETXATTR
>  
> -- 
> 2.9.5
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] MIPS: mt7620: Rename uartlite to serial

2017-09-06 Thread John Crispin


Hi,


comments inline


On 01/09/17 16:53, Harvey Hunt wrote:

Previously, mt7620.c defined the clocks for uarts with the names
uartlite, uart1 and uart2. Rename them to serial{0,1,2} and update
the devicetree node names.

Signed-off-by: Harvey Hunt 
Cc: devicet...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: linux-kernel@vger.kernel.org
---
  arch/mips/boot/dts/ralink/mt7620a.dtsi |  2 +-
  arch/mips/boot/dts/ralink/mt7628a.dtsi |  6 +++---
  arch/mips/ralink/mt7620.c  | 14 +++---
  3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/mips/boot/dts/ralink/mt7620a.dtsi 
b/arch/mips/boot/dts/ralink/mt7620a.dtsi
index 793c0c7..58bd002 100644
--- a/arch/mips/boot/dts/ralink/mt7620a.dtsi
+++ b/arch/mips/boot/dts/ralink/mt7620a.dtsi
@@ -45,7 +45,7 @@
reg = <0x300 0x100>;
};
  
-		uartlite@c00 {

+   serial0@c00 {
the uartlite is indeed not a full uart, having only rx/tx lines and 
missing various other features. i would prefer to keep it as is. you 
cannot connect a modem to the port for example as that would require HW 
handshake for example. Also making these changes will break 
compatibility with existing devicetrees.


John


compatible = "ralink,mt7620a-uart", "ralink,rt2880-uart", 
"ns16550a";
reg = <0xc00 0x100>;
  
diff --git a/arch/mips/boot/dts/ralink/mt7628a.dtsi b/arch/mips/boot/dts/ralink/mt7628a.dtsi

index 9ff7e8f..fe3fe9a 100644
--- a/arch/mips/boot/dts/ralink/mt7628a.dtsi
+++ b/arch/mips/boot/dts/ralink/mt7628a.dtsi
@@ -62,7 +62,7 @@
reg = <0x300 0x100>;
};
  
-		uart0: uartlite@c00 {

+   uart0: serial0@c00 {
compatible = "ns16550a";
reg = <0xc00 0x100>;
  
@@ -75,7 +75,7 @@

reg-shift = <2>;
};
  
-		uart1: uart1@d00 {

+   uart1: serial1@d00 {
compatible = "ns16550a";
reg = <0xd00 0x100>;
  
@@ -88,7 +88,7 @@

reg-shift = <2>;
};
  
-		uart2: uart2@e00 {

+   uart2: serial2@e00 {
compatible = "ns16550a";
reg = <0xe00 0x100>;
  
diff --git a/arch/mips/ralink/mt7620.c b/arch/mips/ralink/mt7620.c

index 9be8b08..f623ceb 100644
--- a/arch/mips/ralink/mt7620.c
+++ b/arch/mips/ralink/mt7620.c
@@ -54,7 +54,7 @@ static int dram_type;
  
  static struct rt2880_pmx_func i2c_grp[] =  { FUNC("i2c", 0, 1, 2) };

  static struct rt2880_pmx_func spi_grp[] = { FUNC("spi", 0, 3, 4) };
-static struct rt2880_pmx_func uartlite_grp[] = { FUNC("uartlite", 0, 15, 2) };
+static struct rt2880_pmx_func serial_grp[] = { FUNC("serial", 0, 15, 2) };
  static struct rt2880_pmx_func mdio_grp[] = {
FUNC("mdio", MT7620_GPIO_MODE_MDIO, 22, 2),
FUNC("refclk", MT7620_GPIO_MODE_MDIO_REFCLK, 22, 2),
@@ -92,7 +92,7 @@ static struct rt2880_pmx_group mt7620a_pinmux_data[] = {
GRP("uartf", uartf_grp, MT7620_GPIO_MODE_UART0_MASK,
MT7620_GPIO_MODE_UART0_SHIFT),
GRP("spi", spi_grp, 1, MT7620_GPIO_MODE_SPI),
-   GRP("uartlite", uartlite_grp, 1, MT7620_GPIO_MODE_UART1),
+   GRP("serial", serial_grp, 1, MT7620_GPIO_MODE_UART1),
GRP_G("wdt", wdt_grp, MT7620_GPIO_MODE_WDT_MASK,
MT7620_GPIO_MODE_WDT_GPIO, MT7620_GPIO_MODE_WDT_SHIFT),
GRP_G("mdio", mdio_grp, MT7620_GPIO_MODE_MDIO_MASK,
@@ -530,8 +530,8 @@ void __init ralink_clk_init(void)
periph_rate = MHZ(40);
pcmi2s_rate = MHZ(480);
  
-		ralink_clk_add("1d00.uartlite", periph_rate);

-   ralink_clk_add("1e00.uartlite", periph_rate);
+   ralink_clk_add("1d00.serial0", periph_rate);
+   ralink_clk_add("1e00.serial0", periph_rate);
} else {
cpu_pll_rate = mt7620_get_cpu_pll_rate(xtal_rate);
pll_rate = mt7620_get_pll_rate(xtal_rate, cpu_pll_rate);
@@ -566,9 +566,9 @@ void __init ralink_clk_init(void)
ralink_clk_add("1a00.i2s", pcmi2s_rate);
ralink_clk_add("1b00.spi", sys_rate);
ralink_clk_add("1b40.spi", sys_rate);
-   ralink_clk_add("1c00.uartlite", periph_rate);
-   ralink_clk_add("1d00.uart1", periph_rate);
-   ralink_clk_add("1e00.uart2", periph_rate);
+   ralink_clk_add("1c00.serial0", periph_rate);
+   ralink_clk_add("1d00.serial1", periph_rate);
+   ralink_clk_add("1e00.serial2", periph_rate);
ralink_clk_add("1018.wmac", xtal_rate);
  
  	if (IS_ENABLED(CONFIG_USB) && !is_mt76x8()) {

Re: [PATCH] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request

2017-09-06 Thread Vlastimil Babka

+CC linux-api

On 09/06/2017 06:35 AM, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
> important to reserve. When ZONE_MOVABLE is used, this problem would
> theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
> allocation request which is mainly used for page cache and anon page
> allocation. So, fix it by setting 0 to
> sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].
> 
> And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
> makes code complex. For example, if there is highmem system, following
> reserve ratio is activated for *NORMAL ZONE* which would be easyily
> misleading people.
> 
>  #ifdef CONFIG_HIGHMEM
>  32
>  #endif
> 
> This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
> array by MAX_NR_ZONES and place "#ifdef" to right place.
> 
> Reviewed-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Joonsoo Kim 
> ---
>  Documentation/sysctl/vm.txt |  5 ++---
>  include/linux/mmzone.h  |  2 +-
>  mm/page_alloc.c | 25 ++---
>  3 files changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e9059d3 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -336,8 +336,6 @@ The lowmem_reserve_ratio is an array. You can see them by 
> reading this file.
>  % cat /proc/sys/vm/lowmem_reserve_ratio
>  256 256 32
>  -
> -Note: # of this elements is one fewer than number of zones. Because the 
> highest
> -  zone's value is not necessary for following calculation.
>  
>  But, these values are not used directly. The kernel calculates # of 
> protection
>  pages for each zones from them. These are shown as array of protection pages
> @@ -388,7 +386,8 @@ As above expression, they are reciprocal number of ratio.
>  pages of higher zones on the node.
>  
>  If you would like to protect more pages, smaller values are effective.
> -The minimum value is 1 (1/1 -> 100%).
> +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
> +disables protection of the pages.
>  
>  ==
>  
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 356a814..d549c4e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -890,7 +890,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, 
> int,
>   void __user *, size_t *, loff_t *);
>  int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
>   void __user *, size_t *, loff_t *);
> -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
> +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
>  int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
>   void __user *, size_t *, loff_t *);
>  int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0f34356..2a7f7e9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned 
> int order);
>   * TBD: should special case ZONE_DMA32 machines here - in those we normally
>   * don't need any ZONE_NORMAL reservation
>   */
> -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
>  #ifdef CONFIG_ZONE_DMA
> -  256,
> + [ZONE_DMA] = 256,
>  #endif
>  #ifdef CONFIG_ZONE_DMA32
> -  256,
> + [ZONE_DMA32] = 256,
>  #endif
> + [ZONE_NORMAL] = 32,
>  #ifdef CONFIG_HIGHMEM
> -  32,
> + [ZONE_HIGHMEM] = 0,
>  #endif
> -  32,
> + [ZONE_MOVABLE] = 0,
>  };
>  
>  EXPORT_SYMBOL(totalram_pages);
> @@ -6921,13 +6922,15 @@ static void setup_per_zone_lowmem_reserve(void)
>   struct zone *lower_zone;
>  
>   idx--;
> -
> - if (sysctl_lowmem_reserve_ratio[idx] < 1)
> - sysctl_lowmem_reserve_ratio[idx] = 1;
> -
>   lower_zone = pgdat->node_zones + idx;
> - lower_zone->lowmem_reserve[j] = managed_pages /
> - sysctl_lowmem_reserve_ratio[idx];
> +
> + if (sysctl_lowmem_reserve_ratio[idx] < 1) {
> + sysctl_lowmem_reserve_ratio[idx] = 0;
> + lower_zone->lowmem_reserve[j] = 0;
> + } else {
> + lower_zone->lowmem_reserve[j] =
> + managed_pages / 
> sysctl_lowmem_reserve_ratio[idx];
> + }
>   managed_pages += lower_zone->managed_pages;

Re: [PATCH v5 1/3] mfd: Add support for Cherry Trail Dollar Cove TI PMIC

2017-09-06 Thread Lee Jones

On Tue, 05 Sep 2017, Takashi Iwai wrote:

> On Tue, 05 Sep 2017 10:53:41 +0200,
> Lee Jones wrote:
> > 
> > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > 
> > > On Tue, 05 Sep 2017 10:10:49 +0200,
> > > Lee Jones wrote:
> > > > 
> > > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > > > 
> > > > > On Tue, 05 Sep 2017 09:24:51 +0200,
> > > > > Lee Jones wrote:
> > > > > > 
> > > > > > On Mon, 04 Sep 2017, Takashi Iwai wrote:
> > > > > > 
> > > > > > > This patch adds the MFD driver for Dollar Cove (TI version) PMIC 
> > > > > > > with
> > > > > > > ACPI INT33F5 that is found on some Intel Cherry Trail devices.
> > > > > > > The driver is based on the original work by Intel, found at:
> > > > > > >   https://github.com/01org/ProductionKernelQuilts
> > > > > > > 
> > > > > > > This is a minimal version for adding the basic resources.  
> > > > > > > Currently,
> > > > > > > only ACPI PMIC opregion and the external power-button are used.
> > > > > > > 
> > > > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=193891
> > > > > > > Reviewed-by: Mika Westerberg 
> > > > > > > Reviewed-by: Andy Shevchenko 
> > > > > > > Signed-off-by: Takashi Iwai 
> > > > > > > ---
> > > > > > > v4->v5:
> > > > > > > * Minor coding-style fixes suggested by Lee
> > > > > > > * Put GPL text
> > > > > > > v3->v4:
> > > > > > > * no change for this patch
> > > > > > > v2->v3:
> > > > > > > * Rename dc_ti with chtdc_ti in all places
> > > > > > > * Driver/kconfig renames accordingly
> > > > > > > * Added acks by Andy and Mika
> > > > > > > v1->v2:
> > > > > > > * Minor cleanups as suggested by Andy
> > > > > > > 
> > > > > > >  drivers/mfd/Kconfig   |  13 +++
> > > > > > >  drivers/mfd/Makefile  |   1 +
> > > > > > >  drivers/mfd/intel_soc_pmic_chtdc_ti.c | 184 
> > > > > > > ++
> > > > > > >  3 files changed, 198 insertions(+)
> > > > > > >  create mode 100644 drivers/mfd/intel_soc_pmic_chtdc_ti.c
> > > > > > 
> > > > > > For my own reference:
> > > > > >   Acked-for-MFD-by: Lee Jones 
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > Now the question is how to deal with these.  It's no critical things,
> > > > > so I'm OK to postpone for 4.15.  OTOH, it's really a new
> > > > > device-specific stuff, thus it can't break anything else, and it'd be
> > > > > fairly safe to add it for 4.14 although it's at a bit late stage.
> > > > 
> > > > Yes, you are over 2 weeks late for v4.14.  It will have to be v4.15.
> > > 
> > > OK, I'll ring your bells again once when 4.15 development is opened.
> > > 
> > > 
> > > > > IMO, it'd be great if you can carry all stuff through MFD tree; or
> > > > > create an immutable branch (again).  But how to handle it, when to do
> > > > > it, It's all up to you guys.
> > > > 
> > > > If there aren't any build dependencies between the patches, each of
> > > > the patches should be applied through their own trees.  What are the
> > > > build-time dependencies?  Are there any?
> > > 
> > > No, there is no strict build-time dependency.  It's just that I don't
> > > see it nice to have a commit for a dead code, partly for testing
> > > purpose and partly for code consistency.  But if this makes
> > > maintenance easier, I'm happy with that, too, of course.
> > 
> > There won't be any dead code.  All of the subsystem trees are pulled
> > into -next [0] where the build bots can operate on the patches as a
> > whole.
> 
> But the merge order isn't guaranteed, i.e. at the commit of other tree
> for this new stuff, it's a dead code without merging the MFD stuff
> beforehand.  e.g. Imagine to perform the git bisection.  It's not
> about the whole tree, but about the each commit.

Only *building* is relevant for bisection until the whole feature
lands.  No one is going to bisect the function of a feature until it
is present.  So long as there aren't any build-time dependencies then
we're good, 

> And I won't be surprised if 0-day build bot gets a new feature to
> inspect the kconfig files, spot a dead kconfig entry and warn
> maintainers at each commit, too :)


0-days don't check for that and static analysers only check releases.

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: printk: what is going on with additional newlines?

2017-09-06 Thread Petr Mladek

On Tue 2017-09-05 22:42:28, Sergey Senozhatsky wrote:
> On (09/05/17 14:21), Petr Mladek wrote:
> [..]
> > > that's why I want buffered printk to re-use the printk-safe buffer
> > > on that particular CPU [ if buffered printk will ever land ].
> > > printk-safe buffer is not allocated on stack, or kmalloc-ed for
> > > temp usafe, and, more importantly, we flush it from panic().
> > > 
> > > and I'm not sure that lost messages due to missing panic flush()
> > > can really be an option even for a single cont line buffer. well,
> > > may be it can. printk has a sort of guarantee that messages will
> > > be at some well known location when pr_foo or printk function
> > > returns. buffered printk kills it. and I don't want to have
> > > several "flavors" of printk. printk-safe buffer seems to be the
> > > way to preserve that guarantee.
> > 
> > But the well known locations would help only when they are flushed
> > in panic() or when a crashdump is created. They do not help
> > in other cases, especially where there is a sudden death.
> 
> if the system locked up and there is no panic()->flush_on_panic(),
> no console_unlock(), crashdump, no nothing - then even having
> messages in the logbuf is probably not really helpful. you can't
> reach them anyway :)
> so yes, I'm speaking here about the cases when we flush_on_panic()
> or/and generate crash dump.

Why are we that much paranoid about the locked up system when
discussing the console handling offload (printk kthread)?
Why should we be more relaxed when talking about pushing
messages from extra buffers?


> > There are many fears that printk offloading does not have enough
> > guarantees to actually happen. IMHO, there must be similar fears
> > that the messages in a temporary buffer will never get flushed.
> > 
> > And there are more risks with this approach:
> > 
> >   + soft-lockups caused by disabled preemption; we would
> > need this to stay on the same CPU and use the same buffer
> 
> well, yes. like any control path that disables IRQs there are
> rules to follow. so printk-safe based solution has limitations.
> I mentioned them probably every time I speak about printk-safe
> buffering. but those limitations come with a bonus - flush on
> panic and well known location of the messages.
> 
> one thing to notice, is that
> printk-safe is usually faster than printk() or at least as fast as
> the fastest printk() path. because, unlike printk, it does not take
> spin on the logbuf lock; it does not console_trylock(), it does not
> do console_unlock().
> 
> 
> >   + broken preempt-count and missing message when one forgets
> > to close the buffered section or do it twice
> 
> yes, coding errors are possible.
> 
> 
> >   + lost messages because a per-CPU buffer size limitations
> 
> which is true for any type of buffers. including logbuf. and
> stack allocated buffers, any buffer. printk-safe buffer is at
> least much-much bigger than any stack allocated buffer.
> 
> 
> >   + races in printk_safe() that is not recursions safe
> >
> >   + not to say the problems mentioned by Linus as reply
> > to the Tetsuo's proposal, see
> > https://lkml.kernel.org/r/ca+55afx+5r-vfqfr7+ok9yrs2adq2ma4fz+s6ncywhy_-2m...@mail.gmail.com
> 
> like "limited in where you can actually expect buffering to happen"?
> 
> sure. it does not come for free, it's not all beautiful and shiny.

It is great that we see the risks and limitations.

> 
> [..]
> > I wonder if all this is worth the effort, complexity, and risk.
> > We are talking about cosmetic problems after all.
> 
> the thing about printk-safe buffering is that _mostly_ everything
> is already in the kernel. especially if we talk about single cont
> line buffering. just add public API printk_buffering_begin() and
> printk_buffering_end() that will __printk_safe_enter() and
> __printk_safe_exit(). and that's it. unless I'm missing something.
> 
> but I'm not super eager to have printk-safe based buffering.
> that's why I never posted a patch set. this approach has its
> limitations.

Ah, I am happy to read this. From the previous mails,
I got the feeling that you were eager to go this way.

I personally do not feel comfortable with taking all the risks
and limitations just to avoid mixed messages.

To be more precise. I am more and more pessimistic about
getting a safe buffer-based solution for multiple lines.

Well, it might make some sense for continuous lines. The
entire line should get printed within few lines of code
and limited time. Otherwise people could hardly expect
to see the pieces together. Then all the above risks and
limitations might be small and acceptable.


> > Well, what do you think about the extra printed information?
> > For example:
> > 
> >message
> > 
> > It looks straightforward to me. These information
> > might be helpful on its own. So, it might be a
> > win-win solution.
> 
> hm... don't know. frankly, I never found PID useful. I mostly look
> at the serial logs postmortem. so lin

Re: [PATCH v5 1/3] mfd: Add support for Cherry Trail Dollar Cove TI PMIC

2017-09-06 Thread Lee Jones

On Tue, 05 Sep 2017, Rafael J. Wysocki wrote:

> On Tue, Sep 5, 2017 at 11:38 AM, Takashi Iwai  wrote:
> > On Tue, 05 Sep 2017 10:53:41 +0200,
> > Lee Jones wrote:
> >>
> >> On Tue, 05 Sep 2017, Takashi Iwai wrote:
> >>
> >> > On Tue, 05 Sep 2017 10:10:49 +0200,
> >> > Lee Jones wrote:
> >> > >
> >> > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> >> > >
> >> > > > On Tue, 05 Sep 2017 09:24:51 +0200,
> >> > > > Lee Jones wrote:
> >> > > > >
> >> > > > > On Mon, 04 Sep 2017, Takashi Iwai wrote:
> >> > > > >
> >> > > > > > This patch adds the MFD driver for Dollar Cove (TI version) PMIC 
> >> > > > > > with
> >> > > > > > ACPI INT33F5 that is found on some Intel Cherry Trail devices.
> >> > > > > > The driver is based on the original work by Intel, found at:
> >> > > > > >   https://github.com/01org/ProductionKernelQuilts
> >> > > > > >
> >> > > > > > This is a minimal version for adding the basic resources.  
> >> > > > > > Currently,
> >> > > > > > only ACPI PMIC opregion and the external power-button are used.
> >> > > > > >
> >> > > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=193891
> >> > > > > > Reviewed-by: Mika Westerberg 
> >> > > > > > Reviewed-by: Andy Shevchenko 
> >> > > > > > Signed-off-by: Takashi Iwai 
> >> > > > > > ---
> >> > > > > > v4->v5:
> >> > > > > > * Minor coding-style fixes suggested by Lee
> >> > > > > > * Put GPL text
> >> > > > > > v3->v4:
> >> > > > > > * no change for this patch
> >> > > > > > v2->v3:
> >> > > > > > * Rename dc_ti with chtdc_ti in all places
> >> > > > > > * Driver/kconfig renames accordingly
> >> > > > > > * Added acks by Andy and Mika
> >> > > > > > v1->v2:
> >> > > > > > * Minor cleanups as suggested by Andy
> >> > > > > >
> >> > > > > >  drivers/mfd/Kconfig   |  13 +++
> >> > > > > >  drivers/mfd/Makefile  |   1 +
> >> > > > > >  drivers/mfd/intel_soc_pmic_chtdc_ti.c | 184 
> >> > > > > > ++
> >> > > > > >  3 files changed, 198 insertions(+)
> >> > > > > >  create mode 100644 drivers/mfd/intel_soc_pmic_chtdc_ti.c
> >> > > > >
> >> > > > > For my own reference:
> >> > > > >   Acked-for-MFD-by: Lee Jones 
> >> > > >
> >> > > > Thanks!
> >> > > >
> >> > > > Now the question is how to deal with these.  It's no critical things,
> >> > > > so I'm OK to postpone for 4.15.  OTOH, it's really a new
> >> > > > device-specific stuff, thus it can't break anything else, and it'd be
> >> > > > fairly safe to add it for 4.14 although it's at a bit late stage.
> >> > >
> >> > > Yes, you are over 2 weeks late for v4.14.  It will have to be v4.15.
> >> >
> >> > OK, I'll ring your bells again once when 4.15 development is opened.
> >> >
> >> >
> >> > > > IMO, it'd be great if you can carry all stuff through MFD tree; or
> >> > > > create an immutable branch (again).  But how to handle it, when to do
> >> > > > it, It's all up to you guys.
> >> > >
> >> > > If there aren't any build dependencies between the patches, each of
> >> > > the patches should be applied through their own trees.  What are the
> >> > > build-time dependencies?  Are there any?
> >> >
> >> > No, there is no strict build-time dependency.  It's just that I don't
> >> > see it nice to have a commit for a dead code, partly for testing
> >> > purpose and partly for code consistency.  But if this makes
> >> > maintenance easier, I'm happy with that, too, of course.
> >>
> >> There won't be any dead code.  All of the subsystem trees are pulled
> >> into -next [0] where the build bots can operate on the patches as a
> >> whole.
> >
> > But the merge order isn't guaranteed, i.e. at the commit of other tree
> > for this new stuff, it's a dead code without merging the MFD stuff
> > beforehand.  e.g. Imagine to perform the git bisection.  It's not
> > about the whole tree, but about the each commit.
> >
> > And I won't be surprised if 0-day build bot gets a new feature to
> > inspect the kconfig files, spot a dead kconfig entry and warn
> > maintainers at each commit, too :)
> 
> So I would prefer the whole series to go in via one tree in one go,
> because it is a series for a reason. :-)
> 
> The patches do depend on each other logically even though there may
> not be hard build-time dependencies between them.  It would be sort of
> good if the git history reflected that logical dependency.

We *never* do this.  Only build-time dependencies warrant the hassle
of immutable branches and cross-subsystem committing.  Patches should
be taken in via their own subsystems unless it would cause merge or
build issues if we did.

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

[PATCH -v2] blk-mq: Start to fix memory ordering...

2017-09-06 Thread Peter Zijlstra


Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern 
Cc: Will Deacon 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Jens Axboe 
Cc: Andrea Parri 
Cc: Boqun Feng 
Cc: Bart Van Assche 
Cc: "Paul E. McKenney" 
Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking 
rq as started")
Signed-off-by: Peter Zijlstra (Intel) 
---
 - spelling; Andrea and Bart
 - compiles (urgh!)
 - smp_wmb(); Adrea


 block/blk-mq.c  | 52 
 block/blk-timeout.c |  2 +-
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..506a0f355117 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,22 +558,32 @@ void blk_mq_start_request(struct request *rq)
 
blk_add_timer(rq);
 
-   /*
-* Ensure that ->deadline is visible before set the started
-* flag and clear the completed flag.
-*/
-   smp_mb__before_atomic();
+   WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
/*
 * Mark us as started and clear complete. Complete might have been
 * set if requeue raced with timeout, which then marked it as
 * complete. So be sure to clear complete again when we start
 * the request, otherwise we'll ignore the completion event.
+*
+* Ensure that ->deadline is visible before we set STARTED, such that
+* blk_mq_check_expired() is guaranteed to observe our ->deadline when
+* it observes STARTED.
 */
-   if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
-   set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-   if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+   smp_wmb();
+   set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+   if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
+   /*
+* Coherence order guarantees these consecutive stores to a
+* single variable propagate in the specified order. Thus the
+* clear_bit() is ordered _after_ the set bit. See
+* blk_mq_check_expired().
+*
+* (the bits must be part of the same byte for this to be
+* true).
+*/
clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+   }
 
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -744,11 +754,20 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx 
*hctx,
struct request *rq, void *priv, bool reserved)
 {
struct blk_mq_timeout_data *data = priv;
+   unsigned long deadline;
 
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
 
/*
+* Ensures that if we see STARTED we must also see our
+* up-to-date deadline, see blk_mq_start_request().
+*/
+   smp_rmb();
+
+   deadline = READ_ONCE(rq->deadline);
+
+   /*
 * The rq being checked may have been freed and reallocated
 * out already here, we avoid this race by checking rq->deadline
 * and REQ_ATOM_COMPLETE flag together:
@@ -761,11 +780,20 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx 
*hctx,
 *   and clearing the flag in blk_mq_start_request(), so
 *   this rq won't be timed out too.
 */
-   if (time_after_eq(jiffies, rq->deadline)) {
-   if (!blk_mark_rq_complete(rq))
+   if (time_after_eq(jiffies, deadline)) {
+   if (!blk_mark_rq_complete(rq)) {
+   /*
+* Again coherence order ensures that consecutive reads
+* from the same variable must be in that order. This
+* ensures that if w

Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup

2017-09-06 Thread Baoquan He

On 09/06/17 at 01:41pm, Dou Liyang wrote:
> Hi Baoquan,
> 
> At 09/06/2017 01:26 PM, Baoquan He wrote:
> [...]
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 4f63afc..9f8479c 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1260,8 +1260,10 @@ static void __init smp_get_logical_apicid(void)
>  }
> 
>  /*
> - * Prepare for SMP bootup.  The MP table or ACPI has been read
> - * earlier.  Just do some sanity checking here and enable APIC mode.
> + * Prepare for SMP bootup.
> + *
> + * @max_cpus: configured maximum number of CPUs
> + * It don't be used, but other arch also have this hook, so keep it.

Yeah, makes sense. Above words can be reconsidered.

>   */
>  void __init native_smp_prepare_cpus(unsigned int max_cpus)
>  {
> 
> Thanks,
>   dou.
> > > 
> > > static noinline void __init kernel_init_freeable(void)
> > > {
> > >   ...
> > >   smp_prepare_cpus(setup_max_cpus);
> > >   ...
> > > }
> > > 
> > > > 
> > > > apic_intr_mode_init();
> > > > 
> > > > -   switch (smp_sanity_check(max_cpus)) {
> > > > -   case SMP_NO_CONFIG:
> > > > -   disable_smp();
> > > > -   return;
> > > > -   case SMP_NO_APIC:
> > > > +   smp_sanity_check();
> > > > +
> > > > +   switch (apic_intr_mode) {
> > > > +   case APIC_PIC:
> > > > +   case APIC_VIRTUAL_WIRE_NO_CONFIG:
> > > > disable_smp();
> > > > return;
> > > > -   case SMP_FORCE_UP:
> > > > +   case APIC_SYMMETRIC_IO_NO_ROUTING:
> > > > disable_smp();
> > > > /* Setup local timer */
> > > > x86_init.timers.setup_percpu_clockev();
> > > > return;
> > > > -   case SMP_OK:
> > > > +   case APIC_VIRTUAL_WIRE:
> > > > +   case APIC_SYMMETRIC_IO:
> > > > break;
> > > > }
> > > > 
> > > > --
> > > > 2.5.5
> > > > 
> > > > 
> > > > 
> > 
> > 
> > 
> 
>

Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU

2017-09-06 Thread Thomas Gleixner

On Wed, 6 Sep 2017, Yu Chen wrote:
> On Wed, Sep 06, 2017 at 12:57:41AM +0200, Thomas Gleixner wrote:
> > I have a hard time to figure out how the 133 vectors on CPU31 are now
> > magically fitting in the empty space on CPU0, which is 204 - 133 = 71. In
> > my limited understanding of math 133 is greater than 71, but your patch
> > might make that magically be wrong.
> >
> The problem is reproduced when the network cable is not plugged in,
> because this driver looks like this:
> 
> step 1. Reserved enough irq vectors and corresponding IRQs.
> step 2. If the network is activated, invoke request_irq() to
> register the handler.
> step 3. Invoke set_affinity() to spread the IRQs onto different
> CPUs, thus to spread the vectors too.
> 
> Here's my understanding for why spreading vectors might help for this
> special case: 
> As step 2 will not get invoked, the IRQs of this driver
> has not been enabled, thus in migrate_one_irq() this IRQ
> will not be considered because there is a check of
> irqd_is_started(d), thus there should only be 8 vectors
> allocated by this driver on CPU0, and 8 vectors left on
> CPU31, and the 8 vectors on CPU31 will not be migrated
> to CPU0 neither, so there is room for other 'valid' vectors
> to be migrated to CPU0.

Can you please spare me repeating your theories, as long as you don't have
hard facts to back them up? The network cable is changing the symptoms,
but the underlying root cause is definitely something different.

> # cat /sys/kernel/debug/irq/domains/*
> name:   VECTOR
>  size:   0
>  mapped: 388
>  flags:  0x0041

So we have 388 vectors mapped in total. And those are just device vectors
because system vectors are not accounted there.

> name:   IO-APIC-0
>  size:   24
>  mapped: 16

That's the legacy space

> name:   IO-APIC-1
>  size:   8
>  mapped: 2

> name:   IO-APIC-2
>  size:   8
>  mapped: 0

> name:   IO-APIC-3
>  size:   8
>  mapped: 0

> name:   IO-APIC-4
>  size:   8
>  mapped: 5

And a few GSIs: Total GSIs = 16 + 2 + 5 = 23

> name:   PCI-MSI-2
>  size:   0
>  mapped: 365

Plus 365 PCI-MSI vectors allocated.

>  flags:  0x0051
>  parent: VECTOR
> name:   VECTOR
>  size:   0
>  mapped: 388

Which nicely sums up to 388

> # ls /sys/kernel/debug/irq/irqs
> ls /sys/kernel/debug/irq/irqs
> 0  10   11  13  142  184  217  259  292  31  33   337  339
> 340  342  344  346  348  350  352  354  356  358  360  362
> 364  366  368  370  372  374  376  378  380  382  384  386
> 388  390  392  394  4  6   7  9  1  109  12  14  15   2
> 24   26   332  335  338  34   341  343  345  347  349
> 351  353  355  357  359  361  363  365  367  369  371  373
> 375  377  379  381  383  385  387  389  391  393  395  5
> 67  8

That are all interrupts which are active. That's a total of 89. Can you
explain where the delta of 299 vectors comes from?

299 allocated, vector mapped, but unused interrupts?

That's where your problem is, not in the vector spreading. You have a
massive leak.

> BTW, do we have sysfs to display how much vectors used on each CPUs?

Not yet.

Can you please apply the debug patch below, boot the machine and right
after login provide the output of

# cat /sys/kernel/debug/tracing/trace

Thanks,

tglx

8<---
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -372,6 +372,9 @@ int msi_domain_alloc_irqs(struct irq_dom
return ret;
}
 
+   trace_printk("dev: %s nvec %d virq %d\n",
+dev_name(dev), desc->nvec_used, virq);
+
for (i = 0; i < desc->nvec_used; i++)
irq_set_msi_desc_off(virq, i, desc);
}
@@ -419,6 +422,8 @@ void msi_domain_free_irqs(struct irq_dom
 * entry. If that's the case, don't do anything.
 */
if (desc->irq) {
+   trace_printk("dev: %s nvec %d virq %d\n",
+dev_name(dev), desc->nvec_used, desc->irq);
irq_domain_free_irqs(desc->irq, desc->nvec_used);
desc->irq = 0;
}

Re: [PATCH 1/2] mm/slub: wake up kswapd for initial high order allocation

2017-09-06 Thread Vlastimil Babka

On 09/06/2017 06:37 AM, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> slub uses higher order allocation than it actually needs. In this case,
> we don't want to do direct reclaim to make such a high order page since
> it causes a big latency to the user. Instead, we would like to fallback
> lower order allocation that it actually needs.
> 
> However, we also want to get this higher order page in the next time
> in order to get the best performance and it would be a role of
> the background thread like as kswapd and kcompactd. To wake up them,
> we should not clear __GFP_KSWAPD_RECLAIM.
> 
> Unlike this intention, current code clears __GFP_KSWAPD_RECLAIM so fix it.
> Current unintended code is done by Mel's commit 444eb2a449ef ("mm: thp:
> set THP defrag by default to madvise and add a stall-free defrag option")
> for slub part. It removes a special case in __alloc_page_slowpath()
> where including __GFP_THISNODE and lacking ~__GFP_DIRECT_RECLAIM
> effectively means also lacking __GFP_KSWAPD_RECLAIM. However, slub
> doesn't use __GFP_THISNODE so it is not the case for this purpose. So,
> partially reverting this code in slub doesn't hurt Mel's intention.
> 
> Note that this patch does some clean up, too.
> __GFP_NOFAIL is cleared twice so remove one.
> 
> Signed-off-by: Joonsoo Kim 

Acked-by: Vlastimil Babka 

> ---
>  mm/slub.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 163352c..45f4a4b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1578,8 +1578,12 @@ static struct page *allocate_slab(struct kmem_cache 
> *s, gfp_t flags, int node)
>* so we fall-back to the minimum order allocation.
>*/
>   alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
> - if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > 
> oo_order(s->min))
> - alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & 
> ~(__GFP_RECLAIM|__GFP_NOFAIL);
> + if (oo_order(oo) > oo_order(s->min)) {
> + if (alloc_gfp & __GFP_DIRECT_RECLAIM) {
> + alloc_gfp |= __GFP_NOMEMALLOC;
> + alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> + }
> + }
>  
>   page = alloc_slab_page(s, alloc_gfp, node, oo);
>   if (unlikely(!page)) {
>

[PATCH v4 0/3] arm: npcm: add basic support for Nuvoton BMCs

2017-09-06 Thread Brendan Higgins

Addressed comments from:
  - Joel: https://www.spinics.net/lists/arm-kernel/msg605137.html

Changes since previous update:
  - Added SMP and SMP_ON_UP
  - Changed dts to use phandle

Re: UBSAN: Undefined error in time.h signed integer overflow

2017-09-06 Thread Thomas Gleixner

On Tue, 5 Sep 2017, John Stultz wrote:
> On Tue, Sep 5, 2017 at 9:30 PM, Shankara Pailoor  wrote:
> > Hi,
> >
> > I encountered this bug while fuzzing linux kernel 4.13-rc7 with syzkaller.
> >
> > 
> > UBSAN: Undefined behaviour in ./include/linux/time.h:233:27
> > signed integer overflow:
> > 8391720337152500783 * 10 cannot be represented in type 'long int'
> > CPU: 0 PID: 31798 Comm: syz-executor2 Not tainted 4.13.0-rc7 #2
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16 [inline]
> >  dump_stack+0xf7/0x1ae lib/dump_stack.c:52
> >  ubsan_epilogue+0x12/0x8f lib/ubsan.c:164
> >  handle_overflow+0x21e/0x292 lib/ubsan.c:195
> >  __ubsan_handle_mul_overflow+0x2a/0x3e lib/ubsan.c:219
> >  timespec_to_ns include/linux/time.h:233 [inline]
> >  posix_cpu_timer_set+0xb5c/0xf20 kernel/time/posix-cpu-timers.c:686
> >  do_timer_settime+0x1f4/0x390 kernel/time/posix-timers.c:890
> >  SYSC_timer_settime kernel/time/posix-timers.c:916 [inline]
> >  SyS_timer_settime+0xea/0x170 kernel/time/posix-timers.c:902
> >  entry_SYSCALL_64_fastpath+0x18/0xad
> > RIP: 0033:0x451e59
> > RSP: 002b:7fb62af4fc08 EFLAGS: 0216 ORIG_RAX: 00df
> > RAX: ffda RBX: 00718000 RCX: 00451e59
> > RDX: 20006000 RSI:  RDI: 
> > RBP: 0046 R08:  R09: 
> > R10: 20003fe0 R11: 0216 R12: 004be920
> > R13:  R14:  R15: 
> > 
> 
> Looks similar to the issue Thomas fixed here:
>https://patchwork.kernel.org/patch/9799827/
> 
> 
> Thomas: Should we change timespec_to_ns() to use the same transition
> internally rather then trying to track all the callers?

Probably.

Thanks,

tglx

[PATCH v4 3/3] MAINTAINERS: Add entry for the Nuvoton NPCM architecture

2017-09-06 Thread Brendan Higgins

Add maintainers and reviewers for the Nuvoton NPCM architecture.

Signed-off-by: Brendan Higgins 
Reviewed-by: Tomer Maimon 
Reviewed-by: Avi Fishman 
---
 MAINTAINERS | 13 +
 1 file changed, 13 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 44cb004c765d..67064bf11904 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1598,6 +1598,19 @@ F:   drivers/pinctrl/nomadik/
 F: drivers/i2c/busses/i2c-nomadik.c
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik.git
 
+ARM/NUVOTON NPCM ARCHITECTURE
+M: Avi Fishman 
+M: Tomer Maimon 
+R: Brendan Higgins 
+R: Rick Altherr 
+L: open...@lists.ozlabs.org (moderated for non-subscribers)
+S: Maintained
+F: arch/arm/mach-npcm/
+F: arch/arm/boot/dts/nuvoton-npcm*
+F: include/dt-bindings/clock/nuvoton,npcm7xx-clks.h
+F: drivers/*/*npcm*
+F: Documentation/*/*npcm*
+
 ARM/NUVOTON W90X900 ARM ARCHITECTURE
 M: Wan ZongShun 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
-- 
2.14.1.581.gf28d330327-goog

[PATCH v4 2/3] arm: dts: add Nuvoton NPCM750 device tree

2017-09-06 Thread Brendan Higgins

Add a common device tree for all Nuvoton NPCM750 BMCs and a board
specific device tree for the NPCM750 (Poleg) evaluation board.

Signed-off-by: Brendan Higgins 
Reviewed-by: Tomer Maimon 
Reviewed-by: Avi Fishman 
Tested-by: Tomer Maimon 
Tested-by: Avi Fishman 
---
 .../arm/cpu-enable-method/nuvoton,npcm7xx-smp  |  42 +
 .../devicetree/bindings/arm/npcm/npcm.txt  |   6 +
 arch/arm/boot/dts/nuvoton-npcm750-evb.dts  |  57 +++
 arch/arm/boot/dts/nuvoton-npcm750.dtsi | 177 +
 include/dt-bindings/clock/nuvoton,npcm7xx-clks.h   |  39 +
 5 files changed, 321 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/arm/cpu-enable-method/nuvoton,npcm7xx-smp
 create mode 100644 Documentation/devicetree/bindings/arm/npcm/npcm.txt
 create mode 100644 arch/arm/boot/dts/nuvoton-npcm750-evb.dts
 create mode 100644 arch/arm/boot/dts/nuvoton-npcm750.dtsi
 create mode 100644 include/dt-bindings/clock/nuvoton,npcm7xx-clks.h

diff --git 
a/Documentation/devicetree/bindings/arm/cpu-enable-method/nuvoton,npcm7xx-smp 
b/Documentation/devicetree/bindings/arm/cpu-enable-method/nuvoton,npcm7xx-smp
new file mode 100644
index ..e81f85b400cf
--- /dev/null
+++ 
b/Documentation/devicetree/bindings/arm/cpu-enable-method/nuvoton,npcm7xx-smp
@@ -0,0 +1,42 @@
+=
+Secondary CPU enable-method "nuvoton,npcm7xx-smp" binding
+=
+
+To apply to all CPUs, a single "nuvoton,npcm7xx-smp" enable method should be
+defined in the "cpus" node.
+
+Enable method name:"nuvoton,npcm7xx-smp"
+Compatible machines:   "nuvoton,npcm750"
+Compatible CPUs:   "arm,cortex-a9"
+Related properties:(none)
+
+Note:
+This enable method needs valid nodes compatible with "arm,cortex-a9-scu" and
+"nuvoton,npcm750-gcr".
+
+Example:
+
+   cpus {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   enable-method = "nuvoton,npcm7xx-smp";
+
+   cpu@0 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a9";
+   clocks = <&clk NPCM7XX_CLK_CPU>;
+   clock-names = "clk_cpu";
+   reg = <0>;
+   next-level-cache = <&L2>;
+   };
+
+   cpu@1 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a9";
+   clocks = <&clk NPCM7XX_CLK_CPU>;
+   clock-names = "clk_cpu";
+   reg = <1>;
+   next-level-cache = <&L2>;
+   };
+   };
+
diff --git a/Documentation/devicetree/bindings/arm/npcm/npcm.txt 
b/Documentation/devicetree/bindings/arm/npcm/npcm.txt
new file mode 100644
index ..2d87d9ecea85
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/npcm/npcm.txt
@@ -0,0 +1,6 @@
+NPCM Platforms Device Tree Bindings
+---
+NPCM750 SoC
+Required root node properties:
+   - compatible = "nuvoton,npcm750";
+
diff --git a/arch/arm/boot/dts/nuvoton-npcm750-evb.dts 
b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts
new file mode 100644
index ..e54a870d3ee0
--- /dev/null
+++ b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts
@@ -0,0 +1,57 @@
+/*
+ * DTS file for all NPCM750 SoCs
+ *
+ * Copyright 2012 Tomer Maimon 
+ *
+ * The code contained herein is licensed under the GNU General Public
+ * License. You may obtain a copy of the GNU General Public License
+ * Version 2 or later at the following locations:
+ *
+ * http://www.opensource.org/licenses/gpl-license.html
+ * http://www.gnu.org/copyleft/gpl.html
+ */
+
+/dts-v1/;
+#include "nuvoton-npcm750.dtsi"
+
+/ {
+   model = "Nuvoton npcm750 Development Board (Device Tree)";
+   compatible = "nuvoton,npcm750";
+
+   chosen {
+   stdout-path = &serial3;
+   bootargs = "earlyprintk=serial,serial3,115200";
+   };
+
+   memory {
+   reg = <0 0x4000>;
+   };
+
+   cpus {
+   enable-method = "nuvoton,npcm7xx-smp";
+   };
+};
+
+&clk {
+   status = "okay";
+};
+
+&watchdog1 {
+   status = "okay";
+};
+
+&serial0 {
+   status = "okay";
+};
+
+&serial1 {
+   status = "okay";
+};
+
+&serial2 {
+   status = "okay";
+};
+
+&serial3 {
+   status = "okay";
+};
diff --git a/arch/arm/boot/dts/nuvoton-npcm750.dtsi 
b/arch/arm/boot/dts/nuvoton-npcm750.dtsi
new file mode 100644
index ..bca96b3ae9d3
--- /dev/null
+++ b/arch/arm/boot/dts/nuvoton-npcm750.dtsi
@@ -0,0 +1,177 @@
+/*
+ * DTSi file for the NPCM750 SoC
+ *
+ * Copyright 2012 Tomer Maimon 
+ *
+ * The code contained herein is licensed under the GNU General Public
+ * License. You may obtain a copy of the GNU General Public License
+ * Version 2 or later at the following locations:
+ *
+ * http://www.opensource.or

[PATCH v4 1/3] arm: npcm: add basic support for Nuvoton BMCs

2017-09-06 Thread Brendan Higgins

Adds basic support for the Nuvoton NPCM750 BMC.

Signed-off-by: Brendan Higgins 
---
 arch/arm/Kconfig |  2 +
 arch/arm/Makefile|  1 +
 arch/arm/mach-npcm/Kconfig   | 50 +
 arch/arm/mach-npcm/Makefile  |  3 ++
 arch/arm/mach-npcm/headsmp.S | 17 +
 arch/arm/mach-npcm/npcm7xx.c | 34 +
 arch/arm/mach-npcm/platsmp.c | 89 
 7 files changed, 196 insertions(+)
 create mode 100644 arch/arm/mach-npcm/Kconfig
 create mode 100644 arch/arm/mach-npcm/Makefile
 create mode 100644 arch/arm/mach-npcm/headsmp.S
 create mode 100644 arch/arm/mach-npcm/npcm7xx.c
 create mode 100644 arch/arm/mach-npcm/platsmp.c

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 61a0cb15067e..05543f1cfbde 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -782,6 +782,8 @@ source "arch/arm/mach-netx/Kconfig"
 
 source "arch/arm/mach-nomadik/Kconfig"
 
+source "arch/arm/mach-npcm/Kconfig"
+
 source "arch/arm/mach-nspire/Kconfig"
 
 source "arch/arm/plat-omap/Kconfig"
diff --git a/arch/arm/Makefile b/arch/arm/Makefile
index 47d3a1ab08d2..60ca50c7d762 100644
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -191,6 +191,7 @@ machine-$(CONFIG_ARCH_MEDIATEK) += mediatek
 machine-$(CONFIG_ARCH_MXS) += mxs
 machine-$(CONFIG_ARCH_NETX)+= netx
 machine-$(CONFIG_ARCH_NOMADIK) += nomadik
+machine-$(CONFIG_ARCH_NPCM)+= npcm
 machine-$(CONFIG_ARCH_NSPIRE)  += nspire
 machine-$(CONFIG_ARCH_OXNAS)   += oxnas
 machine-$(CONFIG_ARCH_OMAP1)   += omap1
diff --git a/arch/arm/mach-npcm/Kconfig b/arch/arm/mach-npcm/Kconfig
new file mode 100644
index ..d47061855439
--- /dev/null
+++ b/arch/arm/mach-npcm/Kconfig
@@ -0,0 +1,50 @@
+menuconfig ARCH_NPCM
+   bool "Nuvoton NPCM Architecture"
+   select ARCH_REQUIRE_GPIOLIB
+   select USE_OF
+   select PINCTRL
+   select PINCTRL_NPCM7XX
+
+if ARCH_NPCM
+
+comment "NPCMX50 CPU type"
+
+config CPU_NPCM750
+   depends on ARCH_NPCM && ARCH_MULTI_V7
+   bool "Support for NPCM750 BMC CPU (Poleg)"
+   select CACHE_L2X0
+   select CPU_V7
+   select ARM_GIC
+   select HAVE_SMP
+   select SMP
+   select SMP_ON_UP
+   select HAVE_ARM_SCU
+   select HAVE_ARM_TWD if SMP
+   select ARM_ERRATA_458693
+   select ARM_ERRATA_720789
+   select ARM_ERRATA_742231
+   select ARM_ERRATA_754322
+   select ARM_ERRATA_764369
+   select ARM_ERRATA_794072
+   select PL310_ERRATA_588369
+   select PL310_ERRATA_727915
+   select USB_EHCI_ROOT_HUB_TT
+   select USB_ARCH_HAS_HCD
+   select USB_ARCH_HAS_EHCI
+   select USB_EHCI_HCD
+   select USB_ARCH_HAS_OHCI
+   select USB_OHCI_HCD
+   select USB
+   select FIQ
+   select CPU_USE_DOMAINS
+   select GENERIC_CLOCKEVENTS
+   select CLKDEV_LOOKUP
+   select COMMON_CLK if OF
+   select NPCM750_TIMER
+   select MFD_SYSCON
+   help
+ Support for NPCM750 BMC CPU (Poleg).
+
+ Nuvoton NPCM750 BMC based on the Cortex A9.
+
+endif
diff --git a/arch/arm/mach-npcm/Makefile b/arch/arm/mach-npcm/Makefile
new file mode 100644
index ..78416055b854
--- /dev/null
+++ b/arch/arm/mach-npcm/Makefile
@@ -0,0 +1,3 @@
+AFLAGS_headsmp.o   += -march=armv7-a
+
+obj-$(CONFIG_CPU_NPCM750)  += npcm7xx.o platsmp.o headsmp.o
diff --git a/arch/arm/mach-npcm/headsmp.S b/arch/arm/mach-npcm/headsmp.S
new file mode 100644
index ..9fccbbd49ed4
--- /dev/null
+++ b/arch/arm/mach-npcm/headsmp.S
@@ -0,0 +1,17 @@
+/*
+ * Copyright 2017 Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+
+ENTRY(npcm7xx_secondary_startup)
+   safe_svcmode_maskall r0
+
+   b   secondary_startup
+ENDPROC(npcm7xx_secondary_startup)
diff --git a/arch/arm/mach-npcm/npcm7xx.c b/arch/arm/mach-npcm/npcm7xx.c
new file mode 100644
index ..132e9d587857
--- /dev/null
+++ b/arch/arm/mach-npcm/npcm7xx.c
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2017 Nuvoton Technology corporation.
+ * Copyright 2017 Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NPCM7XX_AUX_VAL (L310_AUX_CTRL_INSTR_PREFETCH |
   \
+L310_AUX_CTRL_DATA_PREFETCH | \
+L310_AUX_CTRL_NS_LOCKDOWN |   \
+L310_AUX_CTRL_CACHE_REPLACE_RR |  \
+L2C_AUX_CTRL_SHARED_OVE

Re: [PATCH 2/2] mm/slub: don't use reserved memory for optimistic try

2017-09-06 Thread Vlastimil Babka

On 09/06/2017 06:37 AM, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> High-order atomic allocation is difficult to succeed since we cannot
> reclaim anything in this context. So, we reserves the pageblock for
> this kind of request.
> 
> In slub, we try to allocate higher-order page more than it actually
> needs in order to get the best performance. If this optimistic try is
> used with GFP_ATOMIC, alloc_flags will be set as ALLOC_HARDER and
> the pageblock reserved for high-order atomic allocation would be used.
> Moreover, this request would reserve the MIGRATE_HIGHATOMIC pageblock
> ,if succeed, to prepare further request. It would not be good to use
> MIGRATE_HIGHATOMIC pageblock in terms of fragmentation management
> since it unconditionally set a migratetype to request's migratetype
> when unreserving the pageblock without considering the migratetype of
> used pages in the pageblock.
> 
> This is not what we don't intend so fix it by unconditionally masking
> out __GFP_ATOMIC in order to not set ALLOC_HARDER.
> 
> And, it is also undesirable to use reserved memory for optimistic try
> so mask out __GFP_HIGH. This patch also adds __GFP_NOMEMALLOC since
> we don't want to use the reserved memory for optimistic try even if
> the user has PF_MEMALLOC flag.
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/gfp.h | 1 +
>  mm/page_alloc.c | 8 
>  mm/slub.c   | 6 ++
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f780718..1f5658e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -568,6 +568,7 @@ extern gfp_t gfp_allowed_mask;
>  
>  /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
> +gfp_t gfp_drop_reserves(gfp_t gfp_mask);
>  
>  extern void pm_restrict_gfp_mask(void);
>  extern void pm_restore_gfp_mask(void);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6dbc49e..0f34356 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3720,6 +3720,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>   return !!__gfp_pfmemalloc_flags(gfp_mask);
>  }
>  
> +gfp_t gfp_drop_reserves(gfp_t gfp_mask)
> +{
> + gfp_mask &= ~(__GFP_HIGH | __GFP_ATOMIC);
> + gfp_mask |= __GFP_NOMEMALLOC;
> +
> + return gfp_mask;
> +}
> +

I think it's wasteful to do a function call for this, inline definition
in header would be better (gfp_pfmemalloc_allowed() is different as it
relies on a rather heavyweight __gfp_pfmemalloc_flags().

>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward 
> progress
>   * for the given allocation request.
> diff --git a/mm/slub.c b/mm/slub.c
> index 45f4a4b..3d75d30 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1579,10 +1579,8 @@ static struct page *allocate_slab(struct kmem_cache 
> *s, gfp_t flags, int node)
>*/
>   alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
>   if (oo_order(oo) > oo_order(s->min)) {
> - if (alloc_gfp & __GFP_DIRECT_RECLAIM) {
> - alloc_gfp |= __GFP_NOMEMALLOC;
> - alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> - }
> + alloc_gfp = gfp_drop_reserves(alloc_gfp);
> + alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
>   }
>  
>   page = alloc_slab_page(s, alloc_gfp, node, oo);
>

Re: Donation

2017-09-06 Thread Mavis Wanczyk Foundation

Greetings To You,

My Name is Mavis wanczyk , the winner of the Power ball jackpot of $ $758.7 
million  in the AUGUST 24, 2017, My jackpot was a gift from God to me hence my 
Entire family/foundation has AGREED to do this. My foundation is donating 
$500,000.00USD to you. please contac maviswanczyk...@gmail.com for full details 
and please accept this token as a gift from me and my family.

Read more: 
http://money.cnn.com/2017/08/23/news/powerball-700-million-jackpot/index.html

 Best Regards,
 Mavis Wanczyk

Re: [PATCH v6] vfio: platform: reset: Add Broadcom FlexRM reset module

2017-09-06 Thread Anup Patel

On Tue, Sep 5, 2017 at 9:27 PM, Alex Williamson
 wrote:
> On Mon, 4 Sep 2017 15:20:11 +0530
> Anup Patel  wrote:
>
>> Sorry for delayed response...
>>
>> On Tue, Aug 29, 2017 at 7:39 PM, Konrad Rzeszutek Wilk
>>  wrote:
>> > On Tue, Aug 29, 2017 at 09:34:46AM +0530, Anup Patel wrote:
>> >> This patch adds Broadcom FlexRM low-level reset for
>> >> VFIO platform.
>> >>
>> >
>> > Is there an document that explains and /or details the various
>> > registers?
>>
>> Yes, there is a document but its not publicly accessible.
>>
>> >> It will do the following:
>> >> 1. Disable/Deactivate each FlexRM ring
>> >> 2. Flush each FlexRM ring
>> >>
>> >> The cleanup sequence for FlexRM rings is adapted from
>> >> Broadcom FlexRM mailbox driver.
>> >>
>> >> Signed-off-by: Anup Patel 
>> >> Reviewed-by: Oza Oza 
>> >> Reviewed-by: Scott Branden 
>> >> Reviewed-by: Eric Auger 
>> >> ---
>> >>  drivers/vfio/platform/reset/Kconfig|   9 ++
>> >>  drivers/vfio/platform/reset/Makefile   |   1 +
>> >>  .../vfio/platform/reset/vfio_platform_bcmflexrm.c  | 100 
>> >> +
>> >>  3 files changed, 110 insertions(+)
>> >>  create mode 100644 drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
>> >>
>> >> diff --git a/drivers/vfio/platform/reset/Kconfig 
>> >> b/drivers/vfio/platform/reset/Kconfig
>> >> index 705..392e3c0 100644
>> >> --- a/drivers/vfio/platform/reset/Kconfig
>> >> +++ b/drivers/vfio/platform/reset/Kconfig
>> >> @@ -13,3 +13,12 @@ config VFIO_PLATFORM_AMDXGBE_RESET
>> >> Enables the VFIO platform driver to handle reset for AMD XGBE
>> >>
>> >> If you don't know what to do here, say N.
>> >> +
>> >> +config VFIO_PLATFORM_BCMFLEXRM_RESET
>> >> + tristate "VFIO support for Broadcom FlexRM reset"
>> >> + depends on VFIO_PLATFORM && (ARCH_BCM_IPROC || COMPILE_TEST)
>> >> + default ARCH_BCM_IPROC
>> >> + help
>> >> +   Enables the VFIO platform driver to handle reset for Broadcom 
>> >> FlexRM
>> >> +
>> >> +   If you don't know what to do here, say N.
>> >> diff --git a/drivers/vfio/platform/reset/Makefile 
>> >> b/drivers/vfio/platform/reset/Makefile
>> >> index 93f4e23..8d9874b 100644
>> >> --- a/drivers/vfio/platform/reset/Makefile
>> >> +++ b/drivers/vfio/platform/reset/Makefile
>> >> @@ -5,3 +5,4 @@ ccflags-y += -Idrivers/vfio/platform
>> >>
>> >>  obj-$(CONFIG_VFIO_PLATFORM_CALXEDAXGMAC_RESET) += 
>> >> vfio-platform-calxedaxgmac.o
>> >>  obj-$(CONFIG_VFIO_PLATFORM_AMDXGBE_RESET) += vfio-platform-amdxgbe.o
>> >> +obj-$(CONFIG_VFIO_PLATFORM_BCMFLEXRM_RESET) += vfio_platform_bcmflexrm.o
>> >> diff --git a/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c 
>> >> b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
>> >> new file mode 100644
>> >> index 000..966a813
>> >> --- /dev/null
>> >> +++ b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
>> >> @@ -0,0 +1,100 @@
>> >> +/*
>> >> + * Copyright (C) 2017 Broadcom
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or modify
>> >> + * it under the terms of the GNU General Public License version 2 as
>> >> + * published by the Free Software Foundation.
>> >> + *
>> >> + * This program is distributed in the hope that it will be useful,
>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> >> + * GNU General Public License for more details.
>> >> + *
>> >> + * You should have received a copy of the GNU General Public License
>> >> + * along with this program.  If not, see .
>> >> + */
>> >> +
>> >> +/*
>> >> + * This driver provides reset support for Broadcom FlexRM ring manager
>> >> + * to VFIO platform.
>> >> + */
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +#include "vfio_platform_private.h"
>> >> +
>> >> +/* FlexRM configuration */
>> >> +#define RING_REGS_SIZE   0x1
>> >> +#define RING_VER_MAGIC   0x76303031
>> >> +
>> >> +/* Per-Ring register offsets */
>> >> +#define RING_VER 0x000
>> >> +#define RING_CONTROL 0x034
>> >> +#define RING_FLUSH_DONE  0x038
>> >> +
>> >> +/* Register RING_CONTROL fields */
>> >> +#define CONTROL_FLUSH_SHIFT  5
>> >> +
>> >> +/* Register RING_FLUSH_DONE fields */
>> >> +#define FLUSH_DONE_MASK  0x1
>> >> +
>> >> +static int vfio_platform_bcmflexrm_shutdown(void __iomem *ring)
>> >> +{
>> >> + unsigned int timeout;
>> >> +
>> >> + /* Disable/inactivate ring */
>> >> + writel_relaxed(0x0, ring + RING_CONTROL);
>> >> +
>> >> + /* Flush ring with timeout of 1s */
>> >> + timeout = 1000;
>> >
>> > Perhaps a #define for this value?
>>
>> This magic value 1000 makes it explic

Re: [2/2] genirq: Warn when IRQ_NOAUTOEN is used with shared interrupts

2017-09-06 Thread Thomas Gleixner

On Tue, 5 Sep 2017, Paul Burton wrote:
> I'm currently attempting to clean up a hack that we have in the MIPS GIC 
> irqchip driver - we have some interrupts which are really per-CPU, but are 
> currently used with the regular non-per-CPU IRQ APIs. Please search for usage 
> of gic_all_vpes_local_irq_controller (or for the string "HACK") in drivers/
> irqchip/irq-mips-gic.c if you wish to find what I'm talking about. The 
> important details are that the interrupts in question are both per-CPU and on 
> many systems are shared (between the CPU timer, performance counters & fast 
> debug channel).
> 
> I have been attempting to move towards using the per-CPU APIs instead in 
> order 
> to remove this hack - ie. using setup_percpu_irq() & enable_percpu_irq() in 
> place of plain old setup_irq(). Unfortunately what I've run into is this:
> 
>   - Per-CPU interrupts get the IRQ_NOAUTOEN flag set by default, in
> irq_set_percpu_devid_flags(). I can see why this makes sense in the
> general case, since the alternative is setup_percpu_irq() enabling the
> interrupt on the CPU that calls it & leaving it disabled on others, which
> feels a little unclean.
> 
>   - Your warning above triggers when a shared interrupt has the IRQ_NOAUTOEN
> flag set. I can see why your warning makes sense if another driver has
> already enabled the shared interrupt, which would make IRQ_NOAUTOEN
> ineffective. I'm not sure I follow your comment above the warning though -
> it sounds like you're trying to describe something else?

> > +   /*
> > +* Shared interrupts do not go well with disabling
> > +* auto enable. The sharing interrupt might request
> > +* it while it's still disabled and then wait for
> > +* interrupts forever.
> > +*/

Assume the following:

   request_irq(X, handler1, NOAUTOEN|SHARED, dev1);

now the second device does:

   request_irq(X, handler2, SHARED, dev2):

which will see the first handler installed, so it wont run into the code
path which starts up the interrupt. That means as long as dev1 does not
explicitely enable the interrupt dev2 will wait for it forever.

> For my interrupts which are both per-CPU & shared the combination of these 2 
> facts mean I end up triggering your warning. My current ideas include:
> 
>   - I could clear the IRQ_NOAUTOEN flag before calling setup_percpu_irq(). In
> my cases that should be fine - we call enable_percpu_irq() anyway, and
> would just enable the IRQ slightly earlier on the CPU which calls
> setup_percpu_irq() which wouldn't be a problem. It feels a bit yucky
> though.

What's the problem with IRQ_NOAUTOEN and do

   setup_percpu_irq();
   enable_percpu_irq();

on the boot CPU and then later call it when the secondary CPUs come up in
cpu bringup code or a hotplug state callback?

Thanks,

tglx

[PATCH v7] FlexRM support in VFIO platform

2017-09-06 Thread Anup Patel

This patchset primarily adds Broadcom FlexRM reset module for
VFIO platform driver.

The patches are based on Linux-4.13-rc3 and can also be
found at flexrm-vfio-v7 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v6:
 - Update the FlexRM ring flush sequence as suggested
   by HW folks
 - Shutdown all FlexRM ring anyway even if it fails on
   any of the FlexRM rings
 - Use dev_warn() instead of pr_warn()

Changes since v5:
 - Make kconfig option VFIO_PLATFORM_BCMFLEXRM_RESET
   default to ARCH_BCM_IPROC

Changes since v4:
 - Use "--timeout" instead of "timeout--" in
   vfio_platform_bcmflexrm_shutdown()

Changes since v3:
 - Improve "depends on" for Kconfig option
   VFIO_PLATFORM_BCMFLEXRM_RESET
 - Fix typo in pr_warn() called by
   vfio_platform_bcmflexrm_shutdown()
 - Return error from vfio_platform_bcmflexrm_shutdown()
   when FlexRM ring flush timeout happens

Changes since v2:
 - Remove PATCH1 because fixing VFIO no-IOMMU mode is
   a separate topic

Changes since v1:
 - Remove iommu_present() check in vfio_iommu_group_get()
 - Drop PATCH1-to-PATCH3 because IOMMU_CAP_BYPASS is not
   required
 - Move additional comments out of license header in
   vfio_platform_bcmflexrm.c

Anup Patel (1):
  vfio: platform: reset: Add Broadcom FlexRM reset module

 drivers/vfio/platform/reset/Kconfig|   9 ++
 drivers/vfio/platform/reset/Makefile   |   1 +
 .../vfio/platform/reset/vfio_platform_bcmflexrm.c  | 115 +
 3 files changed, 125 insertions(+)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c

-- 
2.7.4

[PATCH v7] vfio: platform: reset: Add Broadcom FlexRM reset module

2017-09-06 Thread Anup Patel

This patch adds Broadcom FlexRM low-level reset for
VFIO platform.

It will do the following:
1. Disable/Deactivate each FlexRM ring
2. Flush each FlexRM ring

The cleanup sequence for FlexRM rings is adapted from
Broadcom FlexRM mailbox driver.

Signed-off-by: Anup Patel 
Reviewed-by: Oza Oza 
Reviewed-by: Scott Branden 
Reviewed-by: Eric Auger 
---
 drivers/vfio/platform/reset/Kconfig|   9 ++
 drivers/vfio/platform/reset/Makefile   |   1 +
 .../vfio/platform/reset/vfio_platform_bcmflexrm.c  | 115 +
 3 files changed, 125 insertions(+)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c

diff --git a/drivers/vfio/platform/reset/Kconfig 
b/drivers/vfio/platform/reset/Kconfig
index 705..392e3c0 100644
--- a/drivers/vfio/platform/reset/Kconfig
+++ b/drivers/vfio/platform/reset/Kconfig
@@ -13,3 +13,12 @@ config VFIO_PLATFORM_AMDXGBE_RESET
  Enables the VFIO platform driver to handle reset for AMD XGBE
 
  If you don't know what to do here, say N.
+
+config VFIO_PLATFORM_BCMFLEXRM_RESET
+   tristate "VFIO support for Broadcom FlexRM reset"
+   depends on VFIO_PLATFORM && (ARCH_BCM_IPROC || COMPILE_TEST)
+   default ARCH_BCM_IPROC
+   help
+ Enables the VFIO platform driver to handle reset for Broadcom FlexRM
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/reset/Makefile 
b/drivers/vfio/platform/reset/Makefile
index 93f4e23..8d9874b 100644
--- a/drivers/vfio/platform/reset/Makefile
+++ b/drivers/vfio/platform/reset/Makefile
@@ -5,3 +5,4 @@ ccflags-y += -Idrivers/vfio/platform
 
 obj-$(CONFIG_VFIO_PLATFORM_CALXEDAXGMAC_RESET) += vfio-platform-calxedaxgmac.o
 obj-$(CONFIG_VFIO_PLATFORM_AMDXGBE_RESET) += vfio-platform-amdxgbe.o
+obj-$(CONFIG_VFIO_PLATFORM_BCMFLEXRM_RESET) += vfio_platform_bcmflexrm.o
diff --git a/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c 
b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
new file mode 100644
index 000..5f89066
--- /dev/null
+++ b/drivers/vfio/platform/reset/vfio_platform_bcmflexrm.c
@@ -0,0 +1,115 @@
+/*
+ * Copyright (C) 2017 Broadcom
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+/*
+ * This driver provides reset support for Broadcom FlexRM ring manager
+ * to VFIO platform.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_platform_private.h"
+
+/* FlexRM configuration */
+#define RING_REGS_SIZE 0x1
+#define RING_VER_MAGIC 0x76303031
+
+/* Per-Ring register offsets */
+#define RING_VER   0x000
+#define RING_CONTROL   0x034
+#define RING_FLUSH_DONE0x038
+
+/* Register RING_CONTROL fields */
+#define CONTROL_FLUSH_SHIFT5
+
+/* Register RING_FLUSH_DONE fields */
+#define FLUSH_DONE_MASK0x1
+
+static int vfio_platform_bcmflexrm_shutdown(void __iomem *ring)
+{
+   unsigned int timeout;
+
+   /* Disable/inactivate ring */
+   writel_relaxed(0x0, ring + RING_CONTROL);
+
+   /* Set ring flush state */
+   timeout = 1000; /* timeout of 1s */
+   writel_relaxed(BIT(CONTROL_FLUSH_SHIFT), ring + RING_CONTROL);
+   do {
+   if (readl_relaxed(ring + RING_FLUSH_DONE) &
+   FLUSH_DONE_MASK)
+   break;
+   mdelay(1);
+   } while (--timeout);
+   if (!timeout)
+   return -ETIMEDOUT;
+
+   /* Clear ring flush state */
+   timeout = 1000; /* timeout of 1s */
+   writel_relaxed(0x0, ring + RING_CONTROL);
+   do {
+   if (!(readl_relaxed(ring + RING_FLUSH_DONE) &
+ FLUSH_DONE_MASK))
+   break;
+   mdelay(1);
+   } while (--timeout);
+   if (!timeout)
+   return -ETIMEDOUT;
+
+   return 0;
+}
+
+static int vfio_platform_bcmflexrm_reset(struct vfio_platform_device *vdev)
+{
+   void __iomem *ring;
+   int rc = 0, ret = 0, ring_num = 0;
+   struct vfio_platform_region *reg = &vdev->regions[0];
+
+   /* Map FlexRM ring registers if not mapped */
+   if (!reg->ioaddr) {
+   reg->ioaddr = ioremap_nocache(reg->addr, reg->size);
+   if (!reg->ioaddr)
+

Re: [PATCH] arm64: KVM: VHE: save and restore some PSTATE bits

2017-09-06 Thread Marc Zyngier

On 05/09/17 19:58, gengdongjiu wrote:
> when exit from guest, some host PSTATE bits may be lost, such as
> PSTATE.PAN or PSTATE.UAO. It is because host and hypervisor all run
> in the EL2, host PSTATE value cannot be saved and restored via
> SPSR_EL2. So if guest has changed the PSTATE, host continues with
> a wrong value guest has set.
> 
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Haibin Zhang 
> ---
>  arch/arm64/include/asm/kvm_host.h |  8 +++
>  arch/arm64/include/asm/kvm_hyp.h  |  2 ++
>  arch/arm64/include/asm/sysreg.h   | 23 +++
>  arch/arm64/kvm/hyp/entry.S|  2 --
>  arch/arm64/kvm/hyp/switch.c   | 24 ++--
>  arch/arm64/kvm/hyp/sysreg-sr.c| 48 
> ---
>  6 files changed, 100 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index e923b58..cba7d3e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -193,6 +193,12 @@ struct kvm_cpu_context {
>   };
>  };
>  
> +struct kvm_cpu_host_pstate {
> + u64 daif;
> + u64 uao;
> + u64 pan;
> +};

I love it. This is the most expensive way of saving/restoring a single
32bit value.

More seriously, please see the discussion between James and Christoffer
there[1]. I expect James to address the PAN/UAO states together with the
debug state in the next iteration of his patch.

Thanks,

M.

[1] https://www.spinics.net/lists/arm-kernel/msg599798.html
-- 
Jazz is not dead. It just smells funny...

Re: [tip:locking/core] locking/refcount: Create unchecked atomic_t implementation

2017-09-06 Thread Peter Zijlstra

On Tue, Sep 05, 2017 at 09:15:36PM +0300, Alexey Dobriyan wrote:

> It is not a policy, just a "grassroot" movement to give people good
> example and wait until someone is annoyed enough to mass convert
> everything. -)

Well, its a mess and I cleaned up the outliers.

> Worked for fs/proc/Makefile .

No, that file is still a mess; the below is what it would take to clean
up according to your preferred pattern.

So go spend an hour or so and write a script that cleans this all up and
convince Linus to run it. Otherwise I really can't be arsed with this
nonsense.

---
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 12c6922c913c..8c19ab50eb7f 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -5,29 +5,35 @@
 obj-y   += proc.o
 
 CFLAGS_task_mmu.o  += $(call cc-option,-Wno-override-init,)
+
 proc-y := nommu.o task_nommu.o
 proc-$(CONFIG_MMU) := task_mmu.o
 
-proc-y   += inode.o root.o base.o generic.o array.o \
-   fd.o
-proc-$(CONFIG_TTY)  += proc_tty.o
-proc-y += cmdline.o
-proc-y += consoles.o
-proc-y += cpuinfo.o
-proc-y += devices.o
-proc-y += interrupts.o
-proc-y += loadavg.o
-proc-y += meminfo.o
-proc-y += stat.o
-proc-y += uptime.o
-proc-y += version.o
-proc-y += softirqs.o
-proc-y += namespaces.o
-proc-y += self.o
-proc-y += thread_self.o
-proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
-proc-$(CONFIG_NET) += proc_net.o
-proc-$(CONFIG_PROC_KCORE)  += kcore.o
-proc-$(CONFIG_PROC_VMCORE) += vmcore.o
-proc-$(CONFIG_PRINTK)  += kmsg.o
-proc-$(CONFIG_PROC_PAGE_MONITOR)   += page.o
+proc-y += array.o
+proc-y += base.o
+proc-y += cmdline.o
+proc-y += consoles.o
+proc-y += cpuinfo.o
+proc-y += devices.o
+proc-y += fd.o
+proc-y += generic.o
+proc-y += inode.o
+proc-y += interrupts.o
+proc-y += loadavg.o
+proc-y += meminfo.o
+proc-y += namespaces.o
+proc-y += root.o
+proc-y += self.o
+proc-y += softirqs.o
+proc-y += stat.o
+proc-y += thread_self.o
+proc-y += uptime.o
+proc-y += version.o
+
+PROC-$(config_net) += PROC_NET.O
+PROC-$(config_printk)  += KMSG.O
+PROC-$(config_proc_kcore)  += KCORE.O
+PROC-$(config_proc_page_monitor)+= PAGE.O
+PROC-$(config_proc_sysctl) += PROC_SYSCTL.O
+PROC-$(config_proc_vmcore) += VMCORE.O
+PROC-$(config_tty) += PROC_TTY.O

Re: [PATCH] mm, sparse: fix typo in online_mem_sections

2017-09-06 Thread Vlastimil Babka

On 09/04/2017 01:22 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> online_mem_sections accidentally marks online only the first section in
> the given range. This is a typo which hasn't been noticed because I
> haven't tested large 2GB blocks previously. All users of
> pfn_to_online_page would get confused on the the rest of the pfn range
> in the block.
> 
> All we need to fix this is to use iterator (pfn) rather than start_pfn.
> 
> Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have 
> holes")
> Cc: stable
> Signed-off-by: Michal Hocko 

Acked-by: Vlastimil Babka 

> ---
>  mm/sparse.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index a9783acf2bb9..83b3bf6461af 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -626,7 +626,7 @@ void online_mem_sections(unsigned long start_pfn, 
> unsigned long end_pfn)
>   unsigned long pfn;
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> - unsigned long section_nr = pfn_to_section_nr(start_pfn);
> + unsigned long section_nr = pfn_to_section_nr(pfn);
>   struct mem_section *ms;
>  
>   /* onlining code should never touch invalid ranges */
>

Re: [PATCH v3 2/3] arm: dts: add Nuvoton NPCM750 device tree

2017-09-06 Thread Brendan Higgins

On Tue, Sep 5, 2017 at 10:23 PM, Joel Stanley  wrote:
> On Wed, Sep 6, 2017 at 10:00 AM, Brendan Higgins
>  wrote:
>> +++ b/Documentation/devicetree/bindings/arm/npcm/npcm.txt
>> @@ -0,0 +1,6 @@
>> +NPCM Platforms Device Tree Bindings
>> +---
>> +NPCM750 SoC
>> +Required root node properties:
>> +   - compatible = "nuvoton,npcm750";
>> +
>
> This is minimal. I assume there will be more content added as more
> support is added?

Yep, that's the plan. They have a number of similar BMCs, both those
based on different ARM cores and some with different peripheral sets,
so we will probably want to have different compat strings for those.

>
> Does it need it's own directory?

Not sure, I saw that some of the other architectures did it, some did
not. I don't feel strongly about it.

>
>
>> diff --git a/arch/arm/boot/dts/nuvoton-npcm750-evb.dts 
>> b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts
>> new file mode 100644
>> index ..54df32cff21b
>> --- /dev/null
>> +++ b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts
>
>> +
>> +/dts-v1/;
>> +#include "nuvoton-npcm750.dtsi"
>> +
>> +/ {
>> +   model = "Nuvoton npcm750 Development Board (Device Tree)";
>> +   compatible = "nuvoton,npcm750";
>> +
>> +   chosen {
>> +   stdout-path = &serial3;
>> +   bootargs = "earlyprintk=serial,serial3,115200";
>> +   };
>> +
>> +   memory {
>> +   reg = <0 0x4000>;
>> +   };
>> +
>> +   cpus {
>> +   enable-method = "nuvoton,npcm7xx-smp";
>> +   };
>> +
>> +   clk: clock-controller@f0801000 {
>> +   status = "okay";
>> +   };
>> +
>> +   apb {
>> +   watchdog1: watchdog@f0009000 {
>> +   status = "okay";
>> +   };
>
> You've already got the label for the node, is there are reason you
> don't use a phandle to set the status?

Addressed in v4.

>
> &watchdog1 {
>status = "okay";
> };
>
> Same with the serial nodes below.
>
>> +
>> +   serial0: serial0@f0001000 {
>> +   status = "okay";
>> +   };
>> +
>> +   serial1: serial1@f0002000 {
>> +   status = "okay";
>> +   };
>> +
>> +   serial2: serial2@f0003000 {
>> +   status = "okay";
>> +   };
>> +
>> +   serial3: serial3@f0004000 {
>> +   status = "okay";
>> +   };
>> +   };

4.13 on thinkpad x220: oops when writing to SD card

2017-09-06 Thread Seraphime Kirkovski

Hi,

> > Seems 4.13-rc4 was already broken for that but unfortuantely
> > I didn't
> > reproduce that. So maybe Seraphime can do git-bisect as he said "I 
> > get
> > it everytime" for which I assume it could be easy for him to find 
> > out
> > the problematic commit?

I can reliably reproduce it, although sometimes it needs some more work.
For example, I couldn't trigger it while writing less than 1 gigabyte
and sometimes I have to do it more than once. It helps if the machine is
doing something else in meantime, I do kernel builds.

> Another unrelated issue with mmc_init_request() is that
> mmc_exit_request()
> is not called if mmc_init_request() fails, which means 
> mmc_init_request()
> must free anything it allocates when it fails.

I'm running your patch for 45 minutes now, it looks like it's fixing the 
issue on 4.13 81a84ad3cb5711cec79.

P.S. Sorry about the formatting, have to fix my editor

Re: [PATCH v5 1/3] mfd: Add support for Cherry Trail Dollar Cove TI PMIC

2017-09-06 Thread Takashi Iwai

On Wed, 06 Sep 2017 09:54:44 +0200,
Lee Jones wrote:
> 
> On Tue, 05 Sep 2017, Takashi Iwai wrote:
> 
> > On Tue, 05 Sep 2017 10:53:41 +0200,
> > Lee Jones wrote:
> > > 
> > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > > 
> > > > On Tue, 05 Sep 2017 10:10:49 +0200,
> > > > Lee Jones wrote:
> > > > > 
> > > > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > > > > 
> > > > > > On Tue, 05 Sep 2017 09:24:51 +0200,
> > > > > > Lee Jones wrote:
> > > > > > > 
> > > > > > > On Mon, 04 Sep 2017, Takashi Iwai wrote:
> > > > > > > 
> > > > > > > > This patch adds the MFD driver for Dollar Cove (TI version) 
> > > > > > > > PMIC with
> > > > > > > > ACPI INT33F5 that is found on some Intel Cherry Trail devices.
> > > > > > > > The driver is based on the original work by Intel, found at:
> > > > > > > >   https://github.com/01org/ProductionKernelQuilts
> > > > > > > > 
> > > > > > > > This is a minimal version for adding the basic resources.  
> > > > > > > > Currently,
> > > > > > > > only ACPI PMIC opregion and the external power-button are used.
> > > > > > > > 
> > > > > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=193891
> > > > > > > > Reviewed-by: Mika Westerberg 
> > > > > > > > Reviewed-by: Andy Shevchenko 
> > > > > > > > Signed-off-by: Takashi Iwai 
> > > > > > > > ---
> > > > > > > > v4->v5:
> > > > > > > > * Minor coding-style fixes suggested by Lee
> > > > > > > > * Put GPL text
> > > > > > > > v3->v4:
> > > > > > > > * no change for this patch
> > > > > > > > v2->v3:
> > > > > > > > * Rename dc_ti with chtdc_ti in all places
> > > > > > > > * Driver/kconfig renames accordingly
> > > > > > > > * Added acks by Andy and Mika
> > > > > > > > v1->v2:
> > > > > > > > * Minor cleanups as suggested by Andy
> > > > > > > > 
> > > > > > > >  drivers/mfd/Kconfig   |  13 +++
> > > > > > > >  drivers/mfd/Makefile  |   1 +
> > > > > > > >  drivers/mfd/intel_soc_pmic_chtdc_ti.c | 184 
> > > > > > > > ++
> > > > > > > >  3 files changed, 198 insertions(+)
> > > > > > > >  create mode 100644 drivers/mfd/intel_soc_pmic_chtdc_ti.c
> > > > > > > 
> > > > > > > For my own reference:
> > > > > > >   Acked-for-MFD-by: Lee Jones 
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > Now the question is how to deal with these.  It's no critical 
> > > > > > things,
> > > > > > so I'm OK to postpone for 4.15.  OTOH, it's really a new
> > > > > > device-specific stuff, thus it can't break anything else, and it'd 
> > > > > > be
> > > > > > fairly safe to add it for 4.14 although it's at a bit late stage.
> > > > > 
> > > > > Yes, you are over 2 weeks late for v4.14.  It will have to be v4.15.
> > > > 
> > > > OK, I'll ring your bells again once when 4.15 development is opened.
> > > > 
> > > > 
> > > > > > IMO, it'd be great if you can carry all stuff through MFD tree; or
> > > > > > create an immutable branch (again).  But how to handle it, when to 
> > > > > > do
> > > > > > it, It's all up to you guys.
> > > > > 
> > > > > If there aren't any build dependencies between the patches, each of
> > > > > the patches should be applied through their own trees.  What are the
> > > > > build-time dependencies?  Are there any?
> > > > 
> > > > No, there is no strict build-time dependency.  It's just that I don't
> > > > see it nice to have a commit for a dead code, partly for testing
> > > > purpose and partly for code consistency.  But if this makes
> > > > maintenance easier, I'm happy with that, too, of course.
> > > 
> > > There won't be any dead code.  All of the subsystem trees are pulled
> > > into -next [0] where the build bots can operate on the patches as a
> > > whole.
> > 
> > But the merge order isn't guaranteed, i.e. at the commit of other tree
> > for this new stuff, it's a dead code without merging the MFD stuff
> > beforehand.  e.g. Imagine to perform the git bisection.  It's not
> > about the whole tree, but about the each commit.
> 
> Only *building* is relevant for bisection until the whole feature
> lands.

Why only building?

When merging through several tress, commits for the same series are
scattered completely although they are softly tied.  This sucks when
you perform git bisection, e.g. if you have an issue in the middle of
the patch series.  It still works, but it jumps unnecessarily too far
away and back before reaching to the point, and kconfig appears /
disappears inconsistently (the dependent kconfig gone in the middle).
And, this is about the release kernel (4.15 or whatever).

Basically, my complaint here comes with my user's hat on.  It *is*
indeed worse than a straight application of patches in some levels.
It's unavoidable if you do in that way.

OTOH, with maintainer's hat on, I do agree with that it'll make things
often easier.  Judging with these merits and demerits, I find it's
acceptable, too.

> No one is going to bisect the function of a feature until it
> is present.  So long as there aren't any build-time depe

Re: Abysmal scheduler performance in Linus' tree?

2017-09-06 Thread Peter Zijlstra

On Tue, Sep 05, 2017 at 10:13:39PM -0700, Andy Lutomirski wrote:
> I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
> today, and I'm seeing abysmal scheduler performance.  Running make -j4
> ends up with all the tasks on CPU 3 most of the time (on my
> 4-logical-thread laptop).  taskset -c 0 whatever puts whatever on CPU
> 0, but plain while true; do true; done puts the infinite loop on CPU 3
> right along with the make -j4 tasks.
> 
> This is on Fedora 26, and I don't think I'm doing anything weird.
> systemd has enabled the cpu controller, but it doesn't seem to have
> configured anything or created any non-root cgroups.
> 
> Just a heads up.  I haven't tried to diagnose it at all.

"make O=defconfig-build -j80" results in:

%Cpu0  : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 88.7 us, 11.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 86.8 us, 13.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 89.7 us, 10.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 96.3 us,  3.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 95.3 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 94.4 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 91.7 us,  8.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 94.3 us,  5.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 96.2 us,  3.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 : 91.5 us,  8.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 : 97.2 us,  2.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 89.7 us, 10.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 : 93.4 us,  6.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 : 94.4 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 : 91.6 us,  8.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu25 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 : 93.4 us,  5.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
%Cpu27 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 : 88.8 us, 11.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu32 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu33 : 93.4 us,  6.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu34 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu35 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu36 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu37 : 97.2 us,  2.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu38 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu39 : 92.6 us,  7.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

Do you have a .config somewhere? Are you running with the systemd? Is it
creating cpu cgroups?

Any specifics on your setup?

[PATCH 1/2] kokr/doc: Update memory-barriers.txt for read-to-write dependencies

2017-09-06 Thread SeongJae Park

This commit applies upstream change, commit 66ce3a4dcb9f ("doc: Update
memory-barriers.txt for read-to-write dependencies") to Korean
translation.

Signed-off-by: SeongJae Park 
---
 .../translations/ko_KR/memory-barriers.txt | 38 +-
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/Documentation/translations/ko_KR/memory-barriers.txt 
b/Documentation/translations/ko_KR/memory-barriers.txt
index bc80fc0e210f..bc0be1d3053f 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -617,7 +617,22 @@ RELEASE 부류의 것들도 존재합니다.  로드와 스토어를 모두 수
 이 변경은 앞의 처음 두가지 결과 중 하나만이 발생할 수 있고, 세번째의 결과는
 발생할 수 없도록 합니다.
 
-데이터 의존성 배리어는 의존적 쓰기에 대해서도 순서를 잡아줍니다:
+
+[!] 이 상당히 반직관적인 상황은 분리된 캐시를 가지는 기계들에서 가장 잘
+발생하는데, 예를 들면 한 캐시 뱅크는 짝수 번호의 캐시 라인들을 처리하고, 다른
+뱅크는 홀수 번호의 캐시 라인들을 처리하는 경우임을 알아두시기 바랍니다.  포인터
+P 는 짝수 번호 캐시 라인에 저장되어 있고, 변수 B 는 홀수 번호 캐시 라인에
+저장되어 있을 수 있습니다.  여기서 값을 읽어오는 CPU 의 캐시의 홀수 번호 처리
+뱅크는 열심히 일감을 처리중인 반면 홀수 번호 처리 뱅크는 할 일 없이 한가한
+중이라면 포인터 P (&B) 의 새로운 값과 변수 B 의 기존 값 (2) 를 볼 수 있습니다.
+
+
+의존적 쓰기들의 순서를 맞추는데에는 데이터 의존성 배리어가 필요치 않은데, 이는
+리눅스 커널이 지원하는 CPU 들은 (1) 쓰기가 정말로 일어날지, (2) 쓰기가 어디에
+이루어질지, 그리고 (3) 쓰여질 값을 확실히 알기 전까지는 쓰기를 수행하지 않기
+때문입니다.  하지만 "컨트롤 의존성" 섹션과
+Documentation/RCU/rcu_dereference.txt 파일을 주의 깊게 읽어 주시기 바랍니다:
+컴파일러는 매우 창의적인 많은 방법으로 종속성을 깰 수 있습니다.
 
CPU 1 CPU 2
===   ===
@@ -626,28 +641,19 @@ RELEASE 부류의 것들도 존재합니다.  로드와 스토어를 모두 수
<쓰기 배리어>
WRITE_ONCE(P, &B);
  Q = READ_ONCE(P);
- <데이터 의존성 배리어>
- *Q = 5;
+ WRITE_ONCE(*Q, 5);
 
-이 데이터 의존성 배리어는 Q 로의 읽기가 *Q 로의 스토어와 순서를 맞추게
-해줍니다.  이는 다음과 같은 결과를 막습니다:
+따라서, Q 로의 읽기와 *Q 로의 쓰기 사이에는 데이터 종속성 배리어가 필요치
+않습니다.  달리 말하면, 데이터 종속성 배리어가 없더라도 다음 결과는 생기지
+않습니다:
 
(Q == &B) && (B == 4)
 
 이런 패턴은 드물게 사용되어야 함을 알아 두시기 바랍니다.  무엇보다도, 의존성
 순서 규칙의 의도는 쓰기 작업을 -예방- 해서 그로 인해 발생하는 비싼 캐시 미스도
 없애려는 것입니다.  이 패턴은 드물게 발생하는 에러 조건 같은것들을 기록하는데
-사용될 수 있고, 이렇게 배리어를 사용해 순서를 지키게 함으로써 그런 기록이
-사라지는 것을 막습니다.
-
-
-[!] 상당히 비직관적인 이 상황은 분리된 캐시를 가진 기계, 예를 들어 한 캐시
-뱅크가 짝수번 캐시 라인을 처리하고 다른 뱅크는 홀수번 캐시 라인을 처리하는 기계
-등에서 가장 잘 발생합니다.  포인터 P 는 홀수 번호의 캐시 라인에 있고, 변수 B 는
-짝수 번호 캐시 라인에 있다고 생각해 봅시다.  그런 상태에서 읽기 작업을 하는 CPU
-의 짝수번 뱅크는 할 일이 쌓여 매우 바쁘지만 홀수번 뱅크는 할 일이 없어 아무
-일도 하지 않고  있었다면, 포인터 P 는 새 값 (&B) 을, 그리고 변수 B 는 옛날 값
-(2) 을 가지고 있는 상태가 보여질 수도 있습니다.
+사용될 수 있으며, CPU의 자연적인 순서 보장이 그런 기록들을 사라지지 않게
+해줍니다.
 
 
 데이터 의존성 배리어는 매우 중요한데, 예를 들어 RCU 시스템에서 그렇습니다.
-- 
2.13.0

[PATCH 2/2] kokr/memory-barriers.txt: Apply atomic_t.txt change

2017-09-06 Thread SeongJae Park

This commit applies memory-barriers.txt part of upstream change, commit
706eeb3e9c6f ("Documentation/locking/atomic: Add documents for new
atomic_t APIs") to Korean translation.

Signed-off-by: SeongJae Park 
---
 .../translations/ko_KR/memory-barriers.txt | 94 ++
 1 file changed, 7 insertions(+), 87 deletions(-)

diff --git a/Documentation/translations/ko_KR/memory-barriers.txt 
b/Documentation/translations/ko_KR/memory-barriers.txt
index bc0be1d3053f..a7a813258013 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -523,11 +523,11 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니
  즉, ACQUIRE 는 최소한의 "취득" 동작처럼, 그리고 RELEASE 는 최소한의 "공개"
  처럼 동작한다는 의미입니다.
 
-core-api/atomic_ops.rst 에서 설명되는 어토믹 오퍼레이션들 중에는 완전히
-순서잡힌 것들과 (배리어를 사용하지 않는) 완화된 순서의 것들 외에 ACQUIRE 와
-RELEASE 부류의 것들도 존재합니다.  로드와 스토어를 모두 수행하는 조합된 어토믹
-오퍼레이션에서, ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE 는
-해당 오퍼레이션의 스토어 부분에만 적용됩니다.
+atomic_t.txt 에 설명된 어토믹 오퍼레이션들 중 일부는 완전히 순서잡힌 것들과
+(배리어를 사용하지 않는) 완화된 순서의 것들 외에 ACQUIRE 와 RELEASE 부류의
+것들도 존재합니다.  로드와 스토어를 모두 수행하는 조합된 어토믹 오퍼레이션에서,
+ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE 는 해당
+오퍼레이션의 스토어 부분에만 적용됩니다.
 
 메모리 배리어들은 두 CPU 간, 또는 CPU 와 디바이스 간에 상호작용의 가능성이 있을
 때에만 필요합니다.  만약 어떤 코드에 그런 상호작용이 없을 것이 보장된다면, 해당
@@ -1854,8 +1854,7 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
  이 코드는 객체의 업데이트된 death 마크가 레퍼런스 카운터 감소 동작
  *전에* 보일 것을 보장합니다.
 
- 더 많은 정보를 위해선 Documentation/core-api/atomic_ops.rst 문서를 참고하세요.
- 어디서 이것들을 사용해야 할지 궁금하다면 "어토믹 오퍼레이션" 서브섹션을
+ 더 많은 정보를 위해선 Documentation/atomic_{t,bitops}.txt 문서를
  참고하세요.
 
 
@@ -2474,86 +2473,7 @@ _않습니다_.
 전체 메모리 배리어를 내포하고 또 일부는 내포하지 않지만, 커널에서 상당히
 의존적으로 사용하는 기능 중 하나입니다.
 
-메모리의 어떤 상태를 수정하고 해당 상태에 대한 (예전의 또는 최신의) 정보를
-리턴하는 어토믹 오퍼레이션은 모두 SMP-조건적 범용 메모리 배리어(smp_mb())를
-실제 오퍼레이션의 앞과 뒤에 내포합니다.  이런 오퍼레이션은 다음의 것들을
-포함합니다:
-
-   xchg();
-   atomic_xchg();  atomic_long_xchg();
-   atomic_inc_return();atomic_long_inc_return();
-   atomic_dec_return();atomic_long_dec_return();
-   atomic_add_return();atomic_long_add_return();
-   atomic_sub_return();atomic_long_sub_return();
-   atomic_inc_and_test();  atomic_long_inc_and_test();
-   atomic_dec_and_test();  atomic_long_dec_and_test();
-   atomic_sub_and_test();  atomic_long_sub_and_test();
-   atomic_add_negative();  atomic_long_add_negative();
-   test_and_set_bit();
-   test_and_clear_bit();
-   test_and_change_bit();
-
-   /* exchange 조건이 성공할 때 */
-   cmpxchg();
-   atomic_cmpxchg();   atomic_long_cmpxchg();
-   atomic_add_unless();atomic_long_add_unless();
-
-이것들은 메모리 배리어 효과가 필요한 ACQUIRE 부류와 RELEASE 부류 오퍼레이션들을
-구현할 때, 그리고 객체 해제를 위해 레퍼런스 카운터를 조정할 때, 암묵적 메모리
-배리어 효과가 필요한 곳 등에 사용됩니다.
-
-
-다음의 오퍼레이션들은 메모리 배리어를 내포하지 _않기_ 때문에 문제가 될 수
-있지만, RELEASE 부류의 오퍼레이션들과 같은 것들을 구현할 때 사용될 수도
-있습니다:
-
-   atomic_set();
-   set_bit();
-   clear_bit();
-   change_bit();
-
-이것들을 사용할 때에는 필요하다면 적절한 (예를 들면 smp_mb__before_atomic()
-같은) 메모리 배리어가 명시적으로 함께 사용되어야 합니다.
-
-
-아래의 것들도 메모리 배리어를 내포하지 _않기_ 때문에, 일부 환경에서는 (예를
-들면 smp_mb__before_atomic() 과 같은) 명시적인 메모리 배리어 사용이 필요합니다.
-
-   atomic_add();
-   atomic_sub();
-   atomic_inc();
-   atomic_dec();
-
-이것들이 통계 생성을 위해 사용된다면, 그리고 통계 데이터 사이에 관계가 존재하지
-않는다면 메모리 배리어는 필요치 않을 겁니다.
-
-객체의 수명을 관리하기 위해 레퍼런스 카운팅 목적으로 사용된다면, 레퍼런스
-카운터는 락으로 보호되는 섹션에서만 조정되거나 호출하는 쪽이 이미 충분한
-레퍼런스를 잡고 있을 것이기 때문에 메모리 배리어는 아마 필요 없을 겁니다.
-
-만약 어떤 락을 구성하기 위해 사용된다면, 락 관련 동작은 일반적으로 작업을 특정
-순서대로 진행해야 하므로 메모리 배리어가 필요할 수 있습니다.
-
-기본적으로, 각 사용처에서는 메모리 배리어가 필요한지 아닌지 충분히 고려해야
-합니다.
-
-아래의 오퍼레이션들은 특별한 락 관련 동작들입니다:
-
-   test_and_set_bit_lock();
-   clear_bit_unlock();
-   __clear_bit_unlock();
-
-이것들은 ACQUIRE 류와 RELEASE 류의 오퍼레이션들을 구현합니다.  락 관련 도구를
-구현할 때에는 이것들을 좀 더 선호하는 편이 나은데, 이것들의 구현은 많은
-아키텍쳐에서 최적화 될 수 있기 때문입니다.
-
-[!] 이런 상황에 사용할 수 있는 특수한 메모리 배리어 도구들이 있습니다만, 일부
-CPU 에서는 사용되는 어토믹 인스트럭션 자체에 메모리 배리어가 내포되어 있어서
-어토믹 오퍼레이션과 메모리 배리어를 함께 사용하는 게 불필요한 일이 될 수
-있는데, 그런 경우에 이 특수 메모리 배리어 도구들은 no-op 이 되어 실질적으로
-아무일도 하지 않습니다.
-
-더 많은 내용을 위해선 Documentation/core-api/atomic_ops.rst 를 참고하세요.
+더 많은 내용을 위해선 Documentation/atomic_t.txt 를 참고하세요.
 
 
 디바이스 액세스
-- 
2.13.0

[RFC tip/locking v2 00/13] lockdep: Support deadlock detection for recursive read locks

2017-09-06 Thread Boqun Feng

Hi Ingo and Peter,

This is V2 for recursive read lock support in lockdep. I fix several
bugs in V1 and also add irq inversion detection support for recursive
read locks.

V1: https://marc.info/?l=linux-kernel&m=150393341825453


As Peter pointed out:

https://marc.info/?l=linux-kernel&m=150349072023540

The lockdep current has a limit support for recursive read locks, the
deadlock case as follow could not be detected:

read_lock(A);
lock(B);
lock(B);
write_lock(A);

I got some inspiration from Gautham R Shenoy:

https://lwn.net/Articles/332801/

, and came up with this series.

The basic idea is:

*   Add recursive read locks into the graph

*   Classify dependencies into --(RR)-->, --(NR)-->, --(RN)-->,
--(NN)-->, where R stands for recursive read lock, N stands for
other locks(i.e. non-recursive read locks and write locks).

*   Define strong dependency paths as the paths of dependencies
don't have two adjacent dependencies as --(*R)--> and --(R*)-->.

*   Extend __bfs() to only traverse on strong dependency paths.

*   If __bfs() finds a strong dependency circle, then a deadlock is
reported.

The whole series is based on current master branch of Linus' tree:

e7d0c41ecc2e ("Merge tag 'devprop-4.14-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")

, and I also put it at:

git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git arr-rfc-v2

The whole series consists of 13 patches:

1.  Do a clean up on the return value of __bfs() and its friends.

2.  Make __bfs() able to visit every dependency until a match is
found. The old version of __bfs() could only visit each lock
class once, and this is insufficient if we are going to add
recursive read locks into the dependency graph.

3.  Make lock state LOCK_*_READ stand for recursive read lock only
and LOCK_* stand for write lock and non-recursive read lock.

4-5 Extend __bfs() to be able to traverse the stong dependency
patchs after recursive read locks added into the graph.

6-8 Adjust check_redundant(), check_noncircular() and
check_irq_usage() with recursive read locks into consideration.

9.  Finally add recursive read locks into the dependency graph.

10-11   Adjust lock cache chain key generation with recursive read locks
into consideration, and provide a test case.

12-13   Add more test cases.

I did pass all the lockdep selftest cases(including those I introduce),
and now run it on one of my box, haven't shot my feet yet.

Test and comments are welcome!

Regards,
Boqun

[RFC tip/locking v2 01/13] lockdep: Demagic the return value of BFS

2017-09-06 Thread Boqun Feng

__bfs() could return four magic numbers:

1: search succeeds, but none match.
0: search succeeds, find one match.
-1: search fails because of the cq is full.
-2: search fails because a invalid node is found.

This patch cleans things up by making a enum type for the return value
of __bfs() and its friends, this improves the code readability of the
code, and further, could help if we want to extend the BFS.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 134 ---
 1 file changed, 79 insertions(+), 55 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 44c8d0d17170..df0b7e620659 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -996,21 +996,52 @@ static inline int get_lock_depth(struct lock_list *child)
}
return depth;
 }
+/*
+ * Return values of a bfs search:
+ *
+ * BFS_E* indicates an error
+ * BFS_R* indicates a result(match or not)
+ *
+ * BFS_EINVALIDNODE: Find a invalid node in the graph.
+ *
+ * BFS_EQUEUEFULL: The queue is full while doing the bfs.
+ *
+ * BFS_RMATCH: Find the matched node in the graph, and put that node * into
+ **@target_entry.
+ *
+ * BFS_RNOMATCH: Haven't found the matched node and keep *@target_entry
+ *  _unchanged_.
+ */
+enum bfs_result {
+   BFS_EINVALIDNODE = -2,
+   BFS_EQUEUEFULL = -1,
+   BFS_RMATCH = 0,
+   BFS_RNOMATCH = 1,
+};
+
+/*
+ * bfs_result < 0 means error
+ */
+
+static inline bool bfs_error(enum bfs_result res)
+{
+   return res < 0;
+}
 
-static int __bfs(struct lock_list *source_entry,
-void *data,
-int (*match)(struct lock_list *entry, void *data),
-struct lock_list **target_entry,
-int forward)
+static enum bfs_result __bfs(struct lock_list *source_entry,
+void *data,
+int (*match)(struct lock_list *entry, void *data),
+struct lock_list **target_entry,
+int forward)
 {
struct lock_list *entry;
struct list_head *head;
struct circular_queue *cq = &lock_cq;
-   int ret = 1;
+   enum bfs_result ret = BFS_RNOMATCH;
 
if (match(source_entry, data)) {
*target_entry = source_entry;
-   ret = 0;
+   ret = BFS_RMATCH;
goto exit;
}
 
@@ -1031,7 +1062,7 @@ static int __bfs(struct lock_list *source_entry,
__cq_dequeue(cq, (unsigned long *)&lock);
 
if (!lock->class) {
-   ret = -2;
+   ret = BFS_EINVALIDNODE;
goto exit;
}
 
@@ -1048,12 +1079,12 @@ static int __bfs(struct lock_list *source_entry,
mark_lock_accessed(entry, lock);
if (match(entry, data)) {
*target_entry = entry;
-   ret = 0;
+   ret = BFS_RMATCH;
goto exit;
}
 
if (__cq_enqueue(cq, (unsigned long)entry)) {
-   ret = -1;
+   ret = BFS_EQUEUEFULL;
goto exit;
}
cq_depth = __cq_get_elem_count(cq);
@@ -1066,19 +1097,21 @@ static int __bfs(struct lock_list *source_entry,
return ret;
 }
 
-static inline int __bfs_forwards(struct lock_list *src_entry,
-   void *data,
-   int (*match)(struct lock_list *entry, void *data),
-   struct lock_list **target_entry)
+static inline enum bfs_result
+__bfs_forwards(struct lock_list *src_entry,
+  void *data,
+  int (*match)(struct lock_list *entry, void *data),
+  struct lock_list **target_entry)
 {
return __bfs(src_entry, data, match, target_entry, 1);
 
 }
 
-static inline int __bfs_backwards(struct lock_list *src_entry,
-   void *data,
-   int (*match)(struct lock_list *entry, void *data),
-   struct lock_list **target_entry)
+static inline enum bfs_result
+__bfs_backwards(struct lock_list *src_entry,
+   void *data,
+   int (*match)(struct lock_list *entry, void *data),
+   struct lock_list **target_entry)
 {
return __bfs(src_entry, data, match, target_entry, 0);
 
@@ -1335,13 +1368,13 @@ unsigned long lockdep_count_backward_deps(struct 
lock_class *class)
 
 /*
  * Prove that the dependency graph starting at  can not
- * lead to . Print an error and return 0 if it does.
+ * lead to . Print an error and return BFS_RMATCH if it

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko

On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with 
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> > 
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> 
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
> 
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
> 
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.

I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.
 
> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.

I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?

>From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.

We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 1/2] pidmap(2)

2017-09-06 Thread Thomas Gleixner

On Tue, 5 Sep 2017, Randy Dunlap wrote:
> On 09/05/17 15:53, Andrew Morton wrote:
> > On Tue, 5 Sep 2017 22:05:00 +0300 Alexey Dobriyan  
> > wrote:
> > 
> >> Implement system call for bulk retrieveing of pids in binary form.
> >>
> >> Using /proc is slower than necessary: 3 syscalls + another 3 for each 
> >> thread +
> >> converting with atoi().
> >>
> >> /proc may be not mounted especially in containers. Natural extension of
> >> hidepid=2 efforts is to not mount /proc at all.
> >>
> >> It could be used by programs like ps, top or CRIU. Speed increase will
> >> become more drastic once combined with bulk retrieval of process 
> >> statistics.
> > 
> > The patches are performance optimizations, but their changelogs contain
> > no performance measurements!
> > 
> > Demonstration of some compelling real-world performance benefits would
> > help things along a lot.
> > 
> 
> also, I expect that the tiny kernel people will want kconfig options for
> these syscalls.

And of course that stuff wants the corresponding man pages written up.

Thanks,

tglx

[RFC tip/locking v2 02/13] lockdep: Make __bfs() visit every dependency rather than every class until a match

2017-09-06 Thread Boqun Feng

Currently, __bfs() will do a breadth-first search in the dependency
graph and visit each lock class in the graph exactly once, so for
example, in the following graph:

A -> B
|^
||
+--> C

a __bfs() call starts at A, will visit B through dependency A -> B and
visit C through dependency C -> B and that's it.

This is OK for now, as we only have strong dependencies in the
dependency graph, so whenever there is a traverse path from A to B in
__bfs(), it means A has strong dependency to B(IOW, B depends on A). So
no need to visit all dependencies in the graph.

However, as we are going to add recursive-read lock into the dependency
graph, not all the paths mean strong dependencies afterwards, in the
same example above, dependency A --> B may be a weak dependency and
traverse A -> C -> B may be a strong dependency path. So now we need to
visit each dependency in the graph once in __bfs() until a match.

The solution is simple:

We used to mark lock_class::lockdep_dependency_gen_id to indicate a
class has been visited in __bfs(), now we change the semantics a little
bit, we now mark lock_class::lockdep_dependency_gen_id to indicate all
the dependencies in its lock_{after,before} have been visited in the
__bfs(), note we only take one direction in a __bfs() search. In this
way, each dependency is guaranteed to be visited until we find a match.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 61 +++-
 1 file changed, 34 insertions(+), 27 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index df0b7e620659..5dbedcc571fd 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -960,24 +960,20 @@ static inline unsigned int  __cq_get_elem_count(struct 
circular_queue *cq)
return (cq->rear - cq->front) & CQ_MASK;
 }
 
-static inline void mark_lock_accessed(struct lock_list *lock,
-   struct lock_list *parent)
+static inline void mark_lock_list_accessed(struct lock_class *class)
 {
-   unsigned long nr;
+   class->dep_gen_id = lockdep_dependency_gen_id;
+}
 
-   nr = lock - list_entries;
-   WARN_ON(nr >= nr_list_entries); /* Out-of-bounds, input fail */
+static inline void visit_lock_entry(struct lock_list *lock,
+   struct lock_list *parent)
+{
lock->parent = parent;
-   lock->class->dep_gen_id = lockdep_dependency_gen_id;
 }
 
-static inline unsigned long lock_accessed(struct lock_list *lock)
+static inline unsigned long lock_list_accessed(struct lock_class *class)
 {
-   unsigned long nr;
-
-   nr = lock - list_entries;
-   WARN_ON(nr >= nr_list_entries); /* Out-of-bounds, input fail */
-   return lock->class->dep_gen_id == lockdep_dependency_gen_id;
+   return class->dep_gen_id == lockdep_dependency_gen_id;
 }
 
 static inline struct lock_list *get_lock_parent(struct lock_list *child)
@@ -1066,6 +1062,18 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
goto exit;
}
 
+   /*
+* If we have visited all the dependencies from this @lock to
+* others(iow, if we have visited all lock_list entries in
+* @lock->class->locks_{after,before}, we skip, otherwise go
+* and visit all the dependencies in the list and mark this
+* list accessed.
+*/
+   if (lock_list_accessed(lock->class))
+   continue;
+   else
+   mark_lock_list_accessed(lock->class);
+
if (forward)
head = &lock->class->locks_after;
else
@@ -1074,23 +1082,22 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
DEBUG_LOCKS_WARN_ON(!irqs_disabled());
 
list_for_each_entry_rcu(entry, head, entry) {
-   if (!lock_accessed(entry)) {
-   unsigned int cq_depth;
-   mark_lock_accessed(entry, lock);
-   if (match(entry, data)) {
-   *target_entry = entry;
-   ret = BFS_RMATCH;
-   goto exit;
-   }
+   unsigned int cq_depth;
 
-   if (__cq_enqueue(cq, (unsigned long)entry)) {
-   ret = BFS_EQUEUEFULL;
-   goto exit;
-   }
-   cq_depth = __cq_get_elem_count(cq);
-   if (max_bfs_queue_depth < cq_depth)
-   max_bfs_queue_depth = cq_depth;
+   visit_lock_entry(entry, lock);
+

[RFC tip/locking v2 07/13] lockdep: Support deadlock detection for recursive read in check_noncircular()

2017-09-06 Thread Boqun Feng

Currently, lockdep only has limit support for deadlock detection for
recursive read locks.

The basic idea of the detection is:

Since we make __bfs() able to traverse only the strong dependency paths,
so we report a circular deadlock if we could find a circle of a strong
dependency path.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d9959f25247a..8a09b1a02342 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1345,6 +1345,14 @@ static inline int hlock_equal(struct lock_list *entry, 
void *data)
   (hlock->read == 2 || !entry->is_rr);
 }
 
+static inline int hlock_conflict(struct lock_list *entry, void *data)
+{
+   struct held_lock *hlock = (struct held_lock *)data;
+
+   return hlock_class(hlock) == entry->class &&
+  (hlock->read != 2 || !entry->is_rr);
+}
+
 static noinline int print_circular_bug(struct lock_list *this,
struct lock_list *target,
struct held_lock *check_src,
@@ -1459,18 +1467,18 @@ unsigned long lockdep_count_backward_deps(struct 
lock_class *class)
 }
 
 /*
- * Prove that the dependency graph starting at  can not
+ * Prove that the dependency graph starting at  can not
  * lead to . Print an error and return BFS_RMATCH if it does.
  */
 static noinline enum bfs_result
-check_noncircular(struct lock_list *root, struct lock_class *target,
+check_noncircular(struct lock_list *root, struct held_lock *target,
  struct lock_list **target_entry)
 {
enum bfs_result result;
 
debug_atomic_inc(nr_cyclic_checks);
 
-   result = __bfs_forwards(root, target, class_equal, target_entry);
+   result = __bfs_forwards(root, target, hlock_conflict, target_entry);
 
return result;
 }
-- 
2.14.1

[RFC tip/locking v2 04/13] lockdep: Introduce lock_list::dep

2017-09-06 Thread Boqun Feng

To add recursive read locks into the dependency graph, we need to store
the types of dependencies for the BFS later. There are four kinds of
dependencies:

*   Non-recursive -> Non-recursive dependencies(NN)
e.g. write_lock(prev) -> write_lock(next)

*   Recursive -> Non-recursive dependencies(RN)
e.g. read_lock(prev) -> write_lock(next);

*   Non-recursive -> recursive dependencies(NR)
e.g. write_lock(prev) -> read_lock(next);

*   Recursive -> recursive dependencies(RR)
e.g. read_lock(prev) -> read_lock(next);

Given a pair of two locks, four kinds of dependencies could all exist
between them, so we use 4 bit for the presence of each kind(stored in
lock_list::dep).

Signed-off-by: Boqun Feng 
---
 include/linux/lockdep.h  |  2 ++
 kernel/locking/lockdep.c | 46 --
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index bfa8e0b0d6f1..35b3fc0d6793 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -192,6 +192,8 @@ struct lock_list {
struct lock_class   *class;
struct stack_trace  trace;
int distance;
+   /* bitmap of different dependencies from head to this */
+   u16 dep;
 
/*
 * The parent field is used to implement breadth-first search, and the
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 86ef7ea9f79f..1220dab7a506 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -884,6 +884,7 @@ static int add_lock_to_list(struct lock_class *this, struct 
list_head *head,
return 0;
 
entry->class = this;
+   entry->dep = dep;
entry->distance = distance;
entry->trace = *trace;
/*
@@ -1024,6 +1025,33 @@ static inline bool bfs_error(enum bfs_result res)
return res < 0;
 }
 
+#define DEP_NN_BIT 0
+#define DEP_RN_BIT 1
+#define DEP_NR_BIT 2
+#define DEP_RR_BIT 3
+
+#define DEP_NN_MASK (1U << (DEP_NN_BIT))
+#define DEP_RN_MASK (1U << (DEP_RN_BIT))
+#define DEP_NR_MASK (1U << (DEP_NR_BIT))
+#define DEP_RR_MASK (1U << (DEP_RR_BIT))
+
+static inline unsigned int __calc_dep_bit(int prev, int next)
+{
+   if (prev == 2 && next != 2)
+   return DEP_RN_BIT;
+   if (prev != 2 && next == 2)
+   return DEP_NR_BIT;
+   if (prev == 2 && next == 2)
+   return DEP_RR_BIT;
+   else
+   return DEP_NN_BIT;
+}
+
+static inline unsigned int calc_dep(int prev, int next)
+{
+   return 1U << __calc_dep_bit(prev, next);
+}
+
 static enum bfs_result __bfs(struct lock_list *source_entry,
 void *data,
 int (*match)(struct lock_list *entry, void *data),
@@ -1951,6 +1979,16 @@ check_prev_add(struct task_struct *curr, struct 
held_lock *prev,
if (entry->class == hlock_class(next)) {
if (distance == 1)
entry->distance = 1;
+   entry->dep |= calc_dep(prev->read, next->read);
+   }
+   }
+
+   /* Also, update the reverse dependency in @next's ->locks_before list */
+   list_for_each_entry(entry, &hlock_class(next)->locks_before, entry) {
+   if (entry->class == hlock_class(prev)) {
+   if (distance == 1)
+   entry->distance = 1;
+   entry->dep |= calc_dep(next->read, prev->read);
return 1;
}
}
@@ -1978,14 +2016,18 @@ check_prev_add(struct task_struct *curr, struct 
held_lock *prev,
 */
ret = add_lock_to_list(hlock_class(next),
   &hlock_class(prev)->locks_after,
-  next->acquire_ip, distance, trace);
+  next->acquire_ip, distance,
+  calc_dep(prev->read, next->read),
+  trace);
 
if (!ret)
return 0;
 
ret = add_lock_to_list(hlock_class(prev),
   &hlock_class(next)->locks_before,
-  next->acquire_ip, distance, trace);
+  next->acquire_ip, distance,
+  calc_dep(next->read, prev->read),
+  trace);
if (!ret)
return 0;
 
-- 
2.14.1

[RFC tip/locking v2 06/13] lockdep: Adjust check_redundant() for recursive read change

2017-09-06 Thread Boqun Feng

As we have four kinds of dependencies now, check_redundant() should only
report redundant if we have a dependency path which is equal or
_stronger_ than the current dependency. For example if check_prev_add()
we have:

prev->read == 2 && next->read != 2

, we should only report redundant if we find a path like:

prev--(RN)-->--(*N)-->next

and if we have:

prev->read == 2 && next->read == 2

, we could report redundant if we find a path like:

prev--(RN)-->--(*N)-->next

or

prev--(RN)-->--(*R)-->next

To do so, we need to pass the recursive-read status of next into
check_redundant(). This patch changes the parameter of check_redundant()
and the match function to achieve this.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3c43af353ea0..d9959f25247a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1337,9 +1337,12 @@ print_circular_bug_header(struct lock_list *entry, 
unsigned int depth,
return 0;
 }
 
-static inline int class_equal(struct lock_list *entry, void *data)
+static inline int hlock_equal(struct lock_list *entry, void *data)
 {
-   return entry->class == data;
+   struct held_lock *hlock = (struct held_lock *)data;
+
+   return hlock_class(hlock) == entry->class &&
+  (hlock->read == 2 || !entry->is_rr);
 }
 
 static noinline int print_circular_bug(struct lock_list *this,
@@ -1473,14 +1476,14 @@ check_noncircular(struct lock_list *root, struct 
lock_class *target,
 }
 
 static noinline enum bfs_result
-check_redundant(struct lock_list *root, struct lock_class *target,
+check_redundant(struct lock_list *root, struct held_lock *target,
struct lock_list **target_entry)
 {
enum bfs_result result;
 
debug_atomic_inc(nr_redundant_checks);
 
-   result = __bfs_forwards(root, target, class_equal, target_entry);
+   result = __bfs_forwards(root, target, hlock_equal, target_entry);
 
return result;
 }
@@ -2047,7 +2050,7 @@ check_prev_add(struct task_struct *curr, struct held_lock 
*prev,
 * Is the  ->  link redundant?
 */
bfs_init_root(&this, prev);
-   ret = check_redundant(&this, hlock_class(next), &target_entry);
+   ret = check_redundant(&this, next, &target_entry);
if (ret == BFS_RMATCH) {
debug_atomic_inc(nr_redundant);
return 2;
-- 
2.14.1

[RFC tip/locking v2 03/13] lockdep: Change the meanings of LOCK_{USED_IN, ENABLED}_*_{READ}

2017-09-06 Thread Boqun Feng

We have three types of lock acquisitions: write, non-recursive read and
recursive read, and write and non-recursive read have no difference from
a viewpoint for deadlock detections, because a write acquisition of the
corresponding lock on an independent CPU or task makes a non-recursive
lock a write lock. So we could treat them as same types in lockdep(for
example, we can call them as non-recursive locks).

As to the irq lock inversion detection(safe->unsafe deadlock detection),
we used to differ write lock with read lock(non-recursive and
recursive), such a classification could be improved as non-recursive
read lock behaves the same as write lock, so this patch changes the
meanings of LOCK_{USED_IN, ENABLED}_*_{READ}.

old:
LOCK_* : stands for write lock
LOCK_*_READ: stands for read lock(non-recursive and recursive)
new:
LOCK_* : stands for non-recursive(write lock and non-recursive
read lock)
LOCK_*_READ: stands for recursive read lock

(The names of them should be changed too to avoid confusion, so are the
related printks, however, this patch is simply for an RFC purpose
without too many unrelated changes, in the future, I could either add
necessary changes in next versions or leave those as TODOs)

Such a change is needed for a future improvement on recursive read
related irq inversion deadlock detection.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 5dbedcc571fd..86ef7ea9f79f 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3091,7 +3091,7 @@ static int mark_irqflags(struct task_struct *curr, struct 
held_lock *hlock)
 * mark the lock as used in these contexts:
 */
if (!hlock->trylock) {
-   if (hlock->read) {
+   if (hlock->read == 2) {
if (curr->hardirq_context)
if (!mark_lock(curr, hlock,
LOCK_USED_IN_HARDIRQ_READ))
@@ -3110,7 +3110,7 @@ static int mark_irqflags(struct task_struct *curr, struct 
held_lock *hlock)
}
}
if (!hlock->hardirqs_off) {
-   if (hlock->read) {
+   if (hlock->read == 2) {
if (!mark_lock(curr, hlock,
LOCK_ENABLED_HARDIRQ_READ))
return 0;
-- 
2.14.1

Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko

On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > Hmm. The changelog says "By default, it will look for the biggest leaf
> > cgroup, and kill the largest task inside." But you are accumulating
> > oom_score up the hierarchy and so parents will have higher score than
> > the layer of their children and the larger the sub-hierarchy the more
> > biased it will become. Say you have
> > root
> >  /\
> > /  \
> >AD
> >   / \
> >  B   C
> > 
> > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > going to go down A path and then chose C even though D is the largest
> > leaf group, right?
> 
> You're right, changelog is not accurate, I'll fix it.
> The behavior is correct, IMO.

Please explain why. This is really a non-intuitive semantic. Why should
larger hierarchies be punished more than shallow ones? I would
completely agree if the whole hierarchy would be a killable entity (aka
A would be kill-all).
 
[...]
> > I do not understand why do we have to handle root cgroup specially here.
> > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > (including root) so if the root memcg is the largest one then we
> > should simply consider it no?
> 
> We don't have necessary stats for the root cgroup, so we can't calculate
> it's oom_score.

We used to charge pages to the root memcg as well so we might resurrect
that idea. In any case this is something that could be hidden in
memcg_oom_badness rather then special cased somewhere else.

> > You are skipping root there because of
> > memcg_has_children but I suspect this and the whole accumulate up the
> > hierarchy approach just makes the whole thing more complex than necessary. 
> > With
> > "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> > on the global root memcg and leaf cgroups. The same applies to memcg
> > stats. So why cannot we simply do the tree walk, calculate
> > badness/check the priority and select the largest memcg in one go?
> 
> We have to traverse from top to bottom to make priority-based decision,
> but size-based oom_score is calculated as sum of descending leaf cgroup 
> scores.
> 
> For example:
>   root
>   /\
>  /  \
> AD
>/ \
>   B   C
> A and D have same priorities, B has larger priority than C.
> 
> In this case we need to calculate size-based score for A, which requires
> summing oom_score of the sub-tree (B an C), despite we don't need it
> for choosing between B and C.
> 
> Maybe I don't see it, but I don't know how to implement it more optimal.

I have to think about the priority based oom killing some more to be
honest. Do we really want to allow setting a priority to non-leaf
memcgs? How are you going to manage the whole tree consistency? Say your
above example have prio(A) < prio(D) && prio(C) > prio(D). Your current
implementation would kill D, right? Isn't that counter intuitive
behavior again. If anything we should prio(A) = max(tree_prio(A)). Again
I could understand comparing priorities only on killable entities.
-- 
Michal Hocko
SUSE Labs

[RFC tip/locking v2 05/13] lockdep: Extend __bfs() to work with multiple kinds of dependencies

2017-09-06 Thread Boqun Feng

Now we have four kinds of dependencies in the dependency graph, and not
all the pathes carry strong dependencies, for example:

Given lock A, B, C, if we have:

CPU1CPU2
=   ==
write_lock(A);  read_lock(B);
read_lock(B);   write_lock(C);

then we have dependencies A--(NR)-->B, and B-->(RN)-->C, (NR and
RN to indicate the dependency kind), A actually doesn't have
strong dependency to C(IOW, C doesn't depend on A), to see this,
let's say we have a third CPU3 doing:

CPU3:
=
write_lock(C);
write_lock(A);

, this is not a deadlock. However if we change the read_lock()
on CPU2 to a write_lock(), it's a deadlock then.

So A-->(NR)-->B-->(RN)-->C is not a strong dependency but
A-->(NR)--B-->(NN)-->C is a strong dependency.

We can generalize this as: If a path of dependencies doesn't have two
adjacent dependencies as (*R)--L-->(R*), where L is some lock, it is a
strong dependency path, otherwise it's not.

Now our mission is to make __bfs() traverse only the strong dependency
paths, which is simple: we record whether we have --(*R)--> at the
current tail of the path in lock_list::is_rr, and whenever we pick a
dependency in the traverse, we 1) make sure we don't pick a --(R*)-->
dependency if our current tail is --(*R)--> and 2) greedily pick a
--(*N)--> as hard as possible.

With this extension for __bfs(), we now only need to initialize the root
of __bfs() properly(with a correct ->is_rr), to do so, we introduce some
helper functions, which also cleans up a little bit for the __bfs() root
initialization code.

Signed-off-by: Boqun Feng 
---
 include/linux/lockdep.h  |  2 ++
 kernel/locking/lockdep.c | 85 +---
 2 files changed, 68 insertions(+), 19 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 35b3fc0d6793..68cbe7e8399a 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -194,6 +194,8 @@ struct lock_list {
int distance;
/* bitmap of different dependencies from head to this */
u16 dep;
+   /* used by BFS to record whether this is picked as a recursive read */
+   u16 is_rr;
 
/*
 * The parent field is used to implement breadth-first search, and the
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 1220dab7a506..3c43af353ea0 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -871,7 +871,7 @@ static struct lock_list *alloc_list_entry(void)
  * Add a new dependency to the head of the list:
  */
 static int add_lock_to_list(struct lock_class *this, struct list_head *head,
-   unsigned long ip, int distance,
+   unsigned long ip, int distance, unsigned int dep,
struct stack_trace *trace)
 {
struct lock_list *entry;
@@ -1052,6 +1052,54 @@ static inline unsigned int calc_dep(int prev, int next)
return 1U << __calc_dep_bit(prev, next);
 }
 
+/*
+ * return -1 if no proper dependency could be picked
+ * return 0 if a * --> N dependency could be picked
+ * return 1 if only a * --> R dependency could be picked
+ *
+ * N: non-recursive lock
+ * R: recursive read lock
+ */
+static inline int pick_dep(u16 is_rr, u16 cap_dep)
+{
+   if (is_rr) { /* could only pick N --> */
+   if (cap_dep & DEP_NN_MASK)
+   return 0;
+   else if (cap_dep & DEP_NR_MASK)
+   return 1;
+   else
+   return -1;
+   } else {
+   if (cap_dep & DEP_NN_MASK || cap_dep & DEP_RN_MASK)
+   return 0;
+   else
+   return 1;
+   }
+}
+
+/*
+ * Initialize a lock_list entry @lock belonging to @class as the root for a BFS
+ * search.
+ */
+static inline void __bfs_init_root(struct lock_list *lock,
+  struct lock_class *class)
+{
+   lock->class = class;
+   lock->parent = NULL;
+   lock->is_rr = 0;
+}
+
+/*
+ * Initialize a lock_list entry @lock based on a lock acquisition @hlock as the
+ * root for a BFS search.
+ */
+static inline void bfs_init_root(struct lock_list *lock,
+struct held_lock *hlock)
+{
+   __bfs_init_root(lock, hlock_class(hlock));
+   lock->is_rr = (hlock->read == 2);
+}
+
 static enum bfs_result __bfs(struct lock_list *source_entry,
 void *data,
 int (*match)(struct lock_list *entry, void *data),
@@ -1062,6 +1110,7 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
struct list_head *head;
struct circular_queue *cq = &lock_cq;
enum

[RFC tip/locking v2 10/13] lockdep/selftest: Add a R-L/L-W test case specific to chain cache behavior

2017-09-06 Thread Boqun Feng

As our chain cache doesn't differ read/write locks, so even we can
detect a read-lock/lock-write deadlock in check_noncircular(), we can
still be fooled if a read-lock/lock-read case(which is not a deadlock)
comes first.

So introduce this test case to test specific to the chain cache behavior
on detecting recursive read lock related deadlocks.

Signed-off-by: Boqun Feng 
---
 lib/locking-selftest.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index cd0b5c964bd0..1f794bb441a9 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -394,6 +394,49 @@ static void rwsem_ABBA1(void)
MU(Y1); // should fail
 }
 
+/*
+ * read_lock(A)
+ * spin_lock(B)
+ * spin_lock(B)
+ * write_lock(A)
+ *
+ * This test case is aimed at poking whether the chain cache prevents us from
+ * detecting a read-lock/lock-write deadlock: if the chain cache doesn't differ
+ * read/write locks, the following case may happen
+ *
+ * { read_lock(A)->lock(B) dependency exists }
+ *
+ * P0:
+ * lock(B);
+ * read_lock(A);
+ *
+ * { Not a deadlock, B -> A is added in the chain cache }
+ *
+ * P1:
+ * lock(B);
+ * write_lock(A);
+ *
+ * { B->A found in chain cache, not reported as a deadlock }
+ *
+ */
+static void rlock_chaincache_ABBA1(void)
+{
+   RL(X1);
+   L(Y1);
+   U(Y1);
+   RU(X1);
+
+   L(Y1);
+   RL(X1);
+   RU(X1);
+   U(Y1);
+
+   L(Y1);
+   WL(X1);
+   WU(X1);
+   U(Y1); // should fail
+}
+
 /*
  * read_lock(A)
  * spin_lock(B)
@@ -2052,6 +2095,10 @@ void locking_selftest(void)
pr_cont(" |");
dotest(rwsem_ABBA3, FAILURE, LOCKTYPE_RWSEM);
 
+   print_testname("chain cached mixed R-L/L-W ABBA");
+   pr_cont(" |");
+   dotest(rlock_chaincache_ABBA1, FAILURE, LOCKTYPE_RWLOCK);
+
printk("  
--\n");
 
/*
-- 
2.14.1

[RFC tip/locking v2 08/13] lockdep: Fix recursive read lock related safe->unsafe detection

2017-09-06 Thread Boqun Feng

There are four cases for recursive read lock realted deadlocks:

(--(X..Y)--> means a strong dependency path starts with a --(X*)-->
dependency and ends with a --(*Y)-- dependency.)

1.  An irq-safe lock L1 has a dependency --(*..*)--> to an
irq-unsafe lock L2.

2.  An irq-read-safe lock L1 has a dependency --(N..*)--> to an
irq-unsafe lock L2.

3.  An irq-safe lock L1 has a dependency --(*..N)--> to an
irq-read-unsafe lock L2.

4.  An irq-read-safe lock L1 has a dependency --(N..N)--> to an
irq-read-unsafe lock L2.

The current check_usage() only checks 1) and 2), so this patch adds
checks for 3) and 4) and makes sure when find_usage_{back,for}wards find
an irq-read-{,un}safe lock, the traverse path should ends at a
dependency --(*N)-->. Note when we search backwards, --(*N)--> indicates
a real dependency --(N*)-->.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8a09b1a02342..6659b91b70f2 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1505,7 +1505,14 @@ check_redundant(struct lock_list *root, struct held_lock 
*target,
 
 static inline int usage_match(struct lock_list *entry, void *bit)
 {
-   return entry->class->usage_mask & (1 << (enum lock_usage_bit)bit);
+   enum lock_usage_bit ub = (enum lock_usage_bit)bit;
+
+
+   if (ub & 1)
+   return entry->class->usage_mask & (1 << ub) &&
+  !entry->is_rr;
+   else
+   return entry->class->usage_mask & (1 << ub);
 }
 
 
@@ -1816,6 +1823,10 @@ static int check_irq_usage(struct task_struct *curr, 
struct held_lock *prev,
   exclusive_bit(bit), state_name(bit)))
return 0;
 
+   if (!check_usage(curr, prev, next, bit,
+  exclusive_bit(bit) + 1, state_name(bit)))
+   return 0;
+
bit++; /* _READ */
 
/*
@@ -1828,6 +1839,10 @@ static int check_irq_usage(struct task_struct *curr, 
struct held_lock *prev,
   exclusive_bit(bit), state_name(bit)))
return 0;
 
+   if (!check_usage(curr, prev, next, bit,
+  exclusive_bit(bit) + 1, state_name(bit)))
+   return 0;
+
return 1;
 }
 
-- 
2.14.1

[RFC tip/locking v2 13/13] lockdep/selftest: Add more recursive read related test cases

2017-09-06 Thread Boqun Feng

Add those four test cases:

1.
CPU1CPU2CPU3
=   =   =
write_lock(X);
write_lock(Y);
write_lock(Z);
read_lock(Y);
read_lock(Z);
read_lock(X);

deadlock.

2.
CPU1CPU2CPU3
=   =   =
write_lock(X);
read_lock(Y);
write_lock(Z);
write_lock(Y);
read_lock(Z);
read_lock(X);

deadlock.

3.
CPU1CPU2CPU3
=   =   =
write_lock(X);
read_lock(Y);
read_lock(Z);
write_lock(Y);
read_lock(Z);
write_lock(X);

not deadlock.

4.
CPU1CPU2CPU3
=   =   =
write_lock(X);
read_lock(Y);
write_lock(Z);
read_lock(Y);
read_lock(Z);
write_lock(X);

not deadlock.

Those self testcases are valuable for the development of supporting
recursive read related deadlock detection.

Signed-off-by: Boqun Feng 
---
 lib/locking-selftest.c | 161 +
 1 file changed, 161 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index cbdcec6a776e..0fe16f06ed00 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -1032,6 +1032,133 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_wlock)
 #undef E2
 #undef E3
 
+/*
+ * write-read / write-read / write-read deadlock even if read is recursive
+ */
+
+#define E1()   \
+   \
+   WL(X1); \
+   RL(Y1); \
+   RU(Y1); \
+   WU(X1);
+
+#define E2()   \
+   \
+   WL(Y1); \
+   RL(Z1); \
+   RU(Z1); \
+   WU(Y1);
+
+#define E3()   \
+   \
+   WL(Z1); \
+   RL(X1); \
+   RU(X1); \
+   WU(Z1);
+
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(W1R2_W2R3_W3R1)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * write-write / read-read / write-read deadlock even if read is recursive
+ */
+
+#define E1()   \
+   \
+   WL(X1); \
+   WL(Y1); \
+   WU(Y1); \
+   WU(X1);
+
+#define E2()   \
+   \
+   RL(Y1); \
+   RL(Z1); \
+   RU(Z1); \
+   RU(Y1);
+
+#define E3()   \
+   \
+   WL(Z1); \
+   RL(X1); \
+   RU(X1); \
+   WU(Z1);
+
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(W1W2_R2R3_W3R1)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * write-write / read-read / read-write is not deadlock when read is recursive
+ */
+
+#define E1()   \
+   \
+   WL(X1); \
+   WL(Y1); \
+   WU(Y1); \
+   WU(X1);
+
+#define E2()   \
+   \
+   RL(Y1); \
+   RL(Z1); \
+   RU(Z1); \
+   RU(Y1);
+
+#define E3()   \
+   \
+   RL(Z1); \
+   WL(X1); \
+   WU(X1); \
+   RU(Z1);
+
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(W1R2_R2R3_W3W1)
+
+#undef E1
+#undef E2
+#undef E3
+
+/*
+ * write-read / read-read / write-write is not deadlock when read is recursive
+ */
+
+#define E1()   \
+   \
+   WL(X1); \
+   RL(Y1); \
+   RU(Y1); \
+   WU(X1);
+
+#define E2()

[RFC tip/locking v2 11/13] lockdep: Take read/write status in consideration when generate chainkey

2017-09-06 Thread Boqun Feng

Currently, the chainkey of a lock chain is a hash sum of the class_idx
of all the held locks, the read/write status are not taken in to
consideration while generating the chainkey. This could result into a
problem, if we have:

P1()
{
read_lock(B);
lock(A);
}

P2()
{
lock(A);
read_lock(B);
}

P3()
{
lock(A);
write_lock(B);
}

, and P1(), P2(), P3() run one by one, when running P2(), it's detected
such a lock chain A -> B is not a deadlock, then it's added in the chain
cache, and then when running P3(), even if it's a deadlock, we could
miss it because of the hit of chain cache. This could be confirmed by
self testcase "chain cached mixed R-L/L-W ".

To resolve this, we use "hlock_id" to generate the chainkey, the
hlock_id is a tuple (hlock->class_idx, hlock->read), which fits in a u16
type. With this, the chainkeys are different is the lock sequences have
the same locks but different read/write status.

Besides, since we use "hlock_id" to generate chainkeys, the chain_hlocks
now store the "hlock_id"s rather than lock_class indexes.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 60 ++--
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b4f72a3380c3..9f7a02315c6a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -311,6 +311,21 @@ static struct hlist_head classhash_table[CLASSHASH_SIZE];
 
 static struct hlist_head chainhash_table[CHAINHASH_SIZE];
 
+/*
+ * the id  chain_hlocks
+ */
+static inline u16 hlock_id(struct held_lock *hlock)
+{
+   BUILD_BUG_ON(MAX_LOCKDEP_KEYS_BITS + 2 > 16);
+
+   return (hlock->class_idx | (hlock->read << MAX_LOCKDEP_KEYS_BITS));
+}
+
+static inline unsigned int chain_hlock_class_idx(u16 hlock_id)
+{
+   return hlock_id & MAX_LOCKDEP_KEYS;
+}
+
 /*
  * The hash key of the lock dependency chains is a hash itself too:
  * it's a hash of all locks taken up to that lock, including that lock.
@@ -2212,7 +2227,10 @@ static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS];
 
 struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i)
 {
-   return lock_classes + chain_hlocks[chain->base + i];
+   u16 chain_hlock = chain_hlocks[chain->base + i];
+   unsigned int class_idx = chain_hlock_class_idx(chain_hlock);
+
+   return lock_classes + class_idx - 1;
 }
 
 /*
@@ -2238,12 +2256,12 @@ static inline int get_first_held_lock(struct 
task_struct *curr,
 /*
  * Returns the next chain_key iteration
  */
-static u64 print_chain_key_iteration(int class_idx, u64 chain_key)
+static u64 print_chain_key_iteration(u16 hlock_id, u64 chain_key)
 {
-   u64 new_chain_key = iterate_chain_key(chain_key, class_idx);
+   u64 new_chain_key = iterate_chain_key(chain_key, hlock_id);
 
-   printk(" class_idx:%d -> chain_key:%016Lx",
-   class_idx,
+   printk(" hlock_id:%d -> chain_key:%016Lx",
+   (unsigned int)hlock_id,
(unsigned long long)new_chain_key);
return new_chain_key;
 }
@@ -2259,12 +2277,12 @@ print_chain_keys_held_locks(struct task_struct *curr, 
struct held_lock *hlock_ne
printk("depth: %u\n", depth + 1);
for (i = get_first_held_lock(curr, hlock_next); i < depth; i++) {
hlock = curr->held_locks + i;
-   chain_key = print_chain_key_iteration(hlock->class_idx, 
chain_key);
+   chain_key = print_chain_key_iteration(hlock_id(hlock), 
chain_key);
 
print_lock(hlock);
}
 
-   print_chain_key_iteration(hlock_next->class_idx, chain_key);
+   print_chain_key_iteration(hlock_id(hlock_next), chain_key);
print_lock(hlock_next);
 }
 
@@ -2272,14 +2290,14 @@ static void print_chain_keys_chain(struct lock_chain 
*chain)
 {
int i;
u64 chain_key = 0;
-   int class_id;
+   u16 hlock_id;
 
printk("depth: %u\n", chain->depth);
for (i = 0; i < chain->depth; i++) {
-   class_id = chain_hlocks[chain->base + i];
-   chain_key = print_chain_key_iteration(class_id + 1, chain_key);
+   hlock_id = chain_hlocks[chain->base + i];
+   chain_key = print_chain_key_iteration(hlock_id, chain_key);
 
-   print_lock_name(lock_classes + class_id);
+   print_lock_name(lock_classes + chain_hlock_class_idx(hlock_id) 
- 1);
printk("\n");
}
 }
@@ -2328,7 +2346,7 @@ static int check_no_collision(struct task_struct *curr,
}
 
for (j = 0; j < chain->depth - 1; j++, i++) {
-   id = curr->held_locks[i].class_idx - 1;
+   id = hlock_id(&curr->held_locks[i]);
 
if (DEBUG_LOCKS_WARN_ON(chain_hlocks[chain->base + j] != id)) {
print_c

[RFC tip/locking v2 12/13] lockdep/selftest: Unleash irq_read_recursion2 and add more

2017-09-06 Thread Boqun Feng

Now since we can handle recursive read related irq inversion deadlocks
correctly, uncomment the irq_read_recursion2 and add more testcases.

Signed-off-by: Boqun Feng 
---
 lib/locking-selftest.c | 59 --
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index 1f794bb441a9..cbdcec6a776e 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -1051,20 +1051,28 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_inversion_soft_wlock)
 #define E3()   \
\
IRQ_ENTER();\
-   RL(A);  \
+   LOCK(A);\
L(B);   \
U(B);   \
-   RU(A);  \
+   UNLOCK(A);  \
IRQ_EXIT();
 
 /*
- * Generate 12 testcases:
+ * Generate 24 testcases:
  */
 #include "locking-selftest-hardirq.h"
-GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_hard)
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_hard_rlock)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_hard_wlock)
 
 #include "locking-selftest-softirq.h"
-GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft)
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft_rlock)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft_wlock)
 
 #undef E1
 #undef E2
@@ -1078,8 +1086,8 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft)
\
IRQ_DISABLE();  \
L(B);   \
-   WL(A);  \
-   WU(A);  \
+   LOCK(A);\
+   UNLOCK(A);  \
U(B);   \
IRQ_ENABLE();
 
@@ -1096,13 +1104,21 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion_soft)
IRQ_EXIT();
 
 /*
- * Generate 12 testcases:
+ * Generate 24 testcases:
  */
 #include "locking-selftest-hardirq.h"
-// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_hard)
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_hard_rlock)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_hard_wlock)
 
 #include "locking-selftest-softirq.h"
-// GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft)
+#include "locking-selftest-rlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft_rlock)
+
+#include "locking-selftest-wlock.h"
+GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft_wlock)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define I_SPINLOCK(x) lockdep_reset_lock(&lock_##x.dep_map)
@@ -1255,6 +1271,25 @@ static inline void print_testname(const char *testname)
dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK);\
pr_cont("\n");
 
+#define DO_TESTCASE_2RW(desc, name, nr)\
+   print_testname(desc"/"#nr); \
+   pr_cont("  |"); \
+   dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK);\
+   dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK);\
+   pr_cont("\n");
+
+#define DO_TESTCASE_2x2RW(desc, name, nr)  \
+   DO_TESTCASE_2RW("hard-"desc, name##_hard, nr)   \
+   DO_TESTCASE_2RW("soft-"desc, name##_soft, nr)   \
+
+#define DO_TESTCASE_6x2x2RW(desc, name)\
+   DO_TESTCASE_2x2RW(desc, name, 123); \
+   DO_TESTCASE_2x2RW(desc, name, 132); \
+   DO_TESTCASE_2x2RW(desc, name, 213); \
+   DO_TESTCASE_2x2RW(desc, name, 231); \
+   DO_TESTCASE_2x2RW(desc, name, 312); \
+   DO_TESTCASE_2x2RW(desc, name, 321);
+
 #define DO_TESTCASE_6(desc, name)  \
print_testname(desc);   \
dotest(name##_spin, FAILURE, LOCKTYPE_SPIN);\
@@ -2111,8 +2146,8 @@ void locking_selftest(void)
DO_TESTCASE_6x6("safe-A + unsafe-B #2", irqsafe4);
DO_TESTCASE_6x6RW("irq lock-inversion", irq_inversion);
 
-   DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion);
-// DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2);
+   DO_TESTCASE_6x2x2RW("irq read-recursion", irq_read_recursion);
+   DO_TESTCASE_6x2x2RW("irq read-recursion #2", irq_read_recursion2);
 
ww_tests();
 
-- 
2.14.1

Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko

On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
[...]
> > > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct 
> > > *victim)
> > >   struct mm_struct *mm;
> > >   bool can_oom_reap = true;
> > >  
> > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
> > > + return;
> > > +
> > 
> > This will leak a reference to the victim AFACS
> 
> Good catch!
> I didn't fix this after moving reference dropping into __oom_kill_process().
> Fixed.

Btw. didn't you want to check
victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

here as well? Maybe I've missed something but you still can kill a task
which is oom disabled which I thought we agreed is the wrong thing to
do.
-- 
Michal Hocko
SUSE Labs

[RFC tip/locking v2 09/13] lockdep: Add recursive read locks into dependency graph

2017-09-06 Thread Boqun Feng

Since we have all the fundamental to handle recursive read locks, we now
add them into the dependency graph.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 6659b91b70f2..b4f72a3380c3 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2033,16 +2033,6 @@ check_prev_add(struct task_struct *curr, struct 
held_lock *prev,
if (!check_prev_add_irq(curr, prev, next))
return 0;
 
-   /*
-* For recursive read-locks we do all the dependency checks,
-* but we dont store read-triggered dependencies (only
-* write-triggered dependencies). This ensures that only the
-* write-side dependencies matter, and that if for example a
-* write-lock never takes any other locks, then the reads are
-* equivalent to a NOP.
-*/
-   if (next->read == 2 || prev->read == 2)
-   return 1;
/*
 * Is the  ->  dependency already present?
 *
@@ -2164,7 +2154,7 @@ check_prevs_add(struct task_struct *curr, struct 
held_lock *next)
 * Only non-recursive-read entries get new dependencies
 * added:
 */
-   if (hlock->read != 2 && hlock->check) {
+   if (hlock->check) {
int ret = check_prev_add(curr, hlock, next,
 distance, &trace, 
save);
if (!ret)
@@ -4965,7 +4955,7 @@ static inline struct lock_class *xlock_class(struct 
cross_lock *xlock)
  */
 static inline int depend_before(struct held_lock *hlock)
 {
-   return hlock->read != 2 && hlock->check && !hlock->trylock;
+   return hlock->check && !hlock->trylock;
 }
 
 /*
@@ -4973,7 +4963,7 @@ static inline int depend_before(struct held_lock *hlock)
  */
 static inline int depend_after(struct held_lock *hlock)
 {
-   return hlock->read != 2 && hlock->check;
+   return hlock->check;
 }
 
 /*
-- 
2.14.1

Re: a competition when some threads acquire futex

2017-09-06 Thread Thomas Gleixner

On Wed, 6 Sep 2017, chengjian (D) wrote:

> > > diff --git a/kernel/futex.c b/kernel/futex.c
> > > index 3d38eaf..0b2d17a 100644
> > > --- a/kernel/futex.c
> > > +++ b/kernel/futex.c
> > > @@ -1545,6 +1545,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32
> > > uval,
> > > struct futex_pi_state *pi_
> > >   spin_unlock(&hb->lock);
> > >   wake_up_q(&wake_q);
> > > +_cond_resched( );
> > 

> I wrote _cond_resched( ) in futex_wake( ) which will be called to wake up
> waiters
> when the process release the futex.
> 
> 
> But the patch producted by git format-patch displayed in wake_futex_pi( ).

Ok. Still that patch has issues.

1) It's white space damaged. Please use TAB not spaces for
   indentation. checkpatch.pl would have told you.

2) Why are you using _cond_resched() instead of plain cond_resched().

   cond_resched() is what you want to use.

Thanks,

tglx

Re: [PATCH 2/2] tracing: Add support for critical section events

2017-09-06 Thread Peter Zijlstra

On Tue, Sep 05, 2017 at 09:35:11AM -0700, Joel Fernandes wrote:
> On Mon, Sep 4, 2017 at 11:52 PM, Peter Zijlstra  wrote:
> > On Mon, Sep 04, 2017 at 08:26:13PM -0700, Joel Fernandes wrote:
> >
> >> Apologies, I meant (without the "off"):
> >>
> >> subsystem: atomic_section
> >> events:
> >>   irqs_disable
> >>   irqs_enable
> >>   preempt_disable
> >>   preempt_enable
> >>
> >> and additionally (similar to what my patch does):
> >>   preemptirq_enable
> >>   preemptirq_disable
> >>
> >
> > What do you need the last for?
> 
> The last 2 events above behave as 'disable' means either preempt or
> irq got disabled, and 'enable' means *both* preempt and irq are
> enabled (after either one of them was disabled).
> 
> This has the advantage of not generating events when we're already in
> an atomic section when using these events, for example acquiring spin
> locks in an interrupt handler might increase the preempt count and
> generate 'preempt_disable' events, but not preemptirq_disable events.
> This has the effect of reducing the spam in the traces when all we
> care about is being in an atomic section or not. These events happen a
> lot so to conserve space in the trace buffer, the user may want to
> just enable the latter 2 events. Does that sound Ok to you?

Hurm,... how about placing a filter on the other 4, such that we only
emit the event on 0<->x state transitions? IIRC tracing already has
filter bits and eBPF bits on that allow something like that.

That avoids having to introduce more tracepoints and gets you the same
results.

Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

2017-09-06 Thread Michal Hocko

On Tue 05-09-17 20:16:09, Roman Gushchin wrote:
> On Tue, Sep 05, 2017 at 05:12:51PM +0200, Michal Hocko wrote:
[...]
> > > Then we should probably hide corresponding
> > > cgroup interface (oom_group and oom_priority knobs) by default,
> > > and it feels as unnecessary complication and is overall against
> > > cgroup v2 interface design.
> > 
> > Why. If we care enough, we could simply return EINVAL when those knobs
> > are written while the corresponding strategy is not used.
> 
> It doesn't look as a nice default interface.

I do not have a strong opinion on this. A printk_once could explain why
the knob is ignored and instruct the admin how to enable the feature
completely.
 
> > > > I think we should instead go with
> > > > oom_strategy=[alloc_task,biggest_task,cgroup]
> > > 
> > > It would be a really nice interface; although I've no idea how to 
> > > implement it:
> > > "alloc_task" is an existing sysctl, which we have to preserve;
> > 
> > I would argue that we should simply deprecate and later drop the sysctl.
> > I _strongly_ suspect anybody is using this. If yes it is not that hard
> > to change the kernel command like rather than select the sysctl.
> 
> I agree. And if so, why do we need a new interface for an useless feature?

Well, I won't be opposed just deprecating the sysfs and only add a
"real" kill-allocate strategy if somebody explicitly asks for it.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] sound: soc: fsl: Do not set DAI sysclk when it is equal to system freq

2017-09-06 Thread Łukasz Majewski


Hi Fabio,


On Tue, Sep 5, 2017 at 6:13 PM, Łukasz Majewski  wrote:


&i2c1 {
 clock-frequency = <40>;
 pinctrl-names = "default";
 pinctrl-0 = <&pinctrl_i2c1>;
 status = "okay";

 codec: tfa9879@6C {
 #sound-dai-cells = <0>;
 compatible = "tfa9879";


This codec seems to miss device tree support. Don't you need something
like this?

--- a/sound/soc/codecs/tfa9879.c
+++ b/sound/soc/codecs/tfa9879.c
@@ -312,9 +312,15 @@ static const struct i2c_device_id tfa9879_i2c_id[] = {
  };
  MODULE_DEVICE_TABLE(i2c, tfa9879_i2c_id);

+static const struct of_device_id tfa9879_of_match[] = {
+   { .compatible = "nxp,tfa9879", },
+   { }
+};
+
  static struct i2c_driver tfa9879_i2c_driver = {
 .driver = {
 .name = "tfa9879",
+   .of_match_table = tfa9879_of_match,
 },
 .probe = tfa9879_i2c_probe,
 .remove = tfa9879_i2c_remove,




Maybe it should be added, but the driver is probed via I2C "node" in dts.

I took the codec (tfa9879) driver as-is.


--
Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: w...@denx.de

Re: [RFC tip/locking v2 00/13] lockdep: Support deadlock detection for recursive read locks

2017-09-06 Thread Boqun Feng

On Wed, Sep 06, 2017 at 04:28:11PM +0800, Boqun Feng wrote:
> Hi Ingo and Peter,
> 
> This is V2 for recursive read lock support in lockdep. I fix several
> bugs in V1 and also add irq inversion detection support for recursive
> read locks.
> 
> V1: https://marc.info/?l=linux-kernel&m=150393341825453
> 
> 
> As Peter pointed out:
> 
>   https://marc.info/?l=linux-kernel&m=150349072023540
> 
> The lockdep current has a limit support for recursive read locks, the
> deadlock case as follow could not be detected:
> 
>   read_lock(A);
>   lock(B);
>   lock(B);
>   write_lock(A);
> 
> I got some inspiration from Gautham R Shenoy:
> 
>   https://lwn.net/Articles/332801/
> 
> , and came up with this series.
> 
> The basic idea is:
> 
> * Add recursive read locks into the graph
> 
> * Classify dependencies into --(RR)-->, --(NR)-->, --(RN)-->,
>   --(NN)-->, where R stands for recursive read lock, N stands for
>   other locks(i.e. non-recursive read locks and write locks).
> 
> * Define strong dependency paths as the paths of dependencies
>   don't have two adjacent dependencies as --(*R)--> and --(R*)-->.
> 
> * Extend __bfs() to only traverse on strong dependency paths.
> 
> * If __bfs() finds a strong dependency circle, then a deadlock is
>   reported.
> 
> The whole series is based on current master branch of Linus' tree:
> 
>   e7d0c41ecc2e ("Merge tag 'devprop-4.14-rc1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
> 
> , and I also put it at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git arr-rfc-v2

Hmm.. should revert

d82fed752942 ("locking/lockdep/selftests: Fix mixed read-write ABBA 
tests")

for testing, as it is a work around because of we had limit support for
recursive read lock before.

I put a branch with that reverted at:

git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git 
arr-rfc-v2a

Selftest passed for that branch, now run it for more time.

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH] blk-mq: Start to fix memory ordering...

2017-09-06 Thread Peter Zijlstra

On Wed, Sep 06, 2017 at 09:13:04AM +0200, Andrea Parri wrote:
> > +   smp_mb__before_atomic();
> 
> I am wondering whether we should be using smp_wmb() instead: this would
> provide the above guarantee and save a full barrier on powerpc/arm64.

Right, did that.

> > +   set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> > +   if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> > +   /*
> > +* Coherence order guarantees these consequtive stores to a
> > +* singe variable propagate in the specified order. Thus the
> > +* clear_bit() is ordered _after_ the set bit. See
> > +* blk_mq_check_expired().
> > +*/
> > clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> 
> It could be useful to stress that set_bit(), clear_bit()  must "act" on
> the same subword of the unsigned long (whatever "subword" means at this
> level...) to rely on the coherence order (c.f., alpha's implementation).

As I wrote to your initial reply (which I now saw was private) I went
through the architectures again and found h8300 to use byte ops to
implement all the bitops.

So subword here means byte :/

The last time we looked at this was for PG_waiter and back then I think
we settled on u32 (with Alpha for example using 32bit ll/sc ops). Linus
moved PG_waiters to the same byte though, so that is all fine.

  b91e1302ad9b ("mm: optimize PageWaiters bit use for unlock_page()")

> > +   if (time_after_eq(jiffies, deadline)) {
> > +   if (!blk_mark_rq_complete(rq)) {
> > +   /*
> > +* Relies on the implied MB from test_and_clear() to
> > +* order the COMPLETE load against the STARTED load.
> > +* Orders against the coherence order in
> > +* blk_mq_start_request().
> 
> I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
> that the interested cycle is:
> 
>/* in blk_mq_start_request() */
>[STORE STARTED bit = 1 into atomic_flags]
>   -->co [STORE COMPLETE bit = 0 into atomic_flags]
>  /* in blk_mq_check_expired() */
>  -->rf [LOAD COMPLETE bit = 0 from atomic_flags]
> -->po-loc [LOAD STARTED bit = 0 from atomic_flags]
>/* in blk_mq_start_request() again */
>-->fr [STORE STARTED bit = 1 into atomic_flags]
> 
>(N.B. Assume all accesses happen to/from the same subword.)
> 
> This cycle being forbidden by the "coherence check", I'd say we do not
> need to rely on the MB mentioned by the comment; what am I missing?

Nothing, I forgot about the read-after-read thing but did spot the MB.
Either one suffices to guarantee the order we need. It just needs to be
documented as being relied upon.

Re: [PATCH] genirq: Provide IRQCHIP_ONESHOT_EDGE_SAFE

2017-09-06 Thread Thomas Gleixner

On Tue, 1 Aug 2017, Benjamin Herrenschmidt wrote:

> On Mon, 2017-07-31 at 21:33 +0200, Thomas Gleixner wrote:
> > If an interrupt chip is marked IRQCHIP_ONESHOT_SAFE it signals that the
> > interrupt chip does not require the ONESHOT mode for threaded
> > interrupts. Unfortunately this is applied independent of the interrupt type
> > (edge/level).
> > 
> > The powerpc XPIC wants this functionality only for edge type
> > interrupts. Provide a new flag which provides the same functionality
> > restricted to edge type interrupts.
> 
> Thanks ! I'll test that asap (got pulled into another emergency so it
> might take a day or two).

So that emergency still is on? A day or two is about 35 days ago :)

Re: [PATCH] genirq/msi: fix populating multiple interrupts

2017-09-06 Thread Marc Zyngier

Hi John,

On 05/09/17 18:12, John Keeping wrote:
> Use the correct variable to set up each interrupt in turn rather than
> configuring the first interrupt "nvec" times.

Thanks for addressing this. I think this bug deserves a slightly better
write-up. How about something like:

On allocating the interrupts routed via to a wire-to-MSI bridge, we
iterate over the MSI descriptors to build the hierarchy, but fail to use
the descriptor interrupt number, and instead use the base number,
generating the wrong IRQ domain mappings.

The fix is to use the MSI descriptor interrupt number when setting up
the interrupt instead of the base interrupt for the allocation range.

The only saving grace is that although the MSI descriptors are allocated
in bulk, the wired interrupts are only allocated one by one (so
desc->irq == virq) and the bug goes unnoticed.

> Signed-off-by: John Keeping 
> ---
>  kernel/irq/msi.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index 48eadf416c24..3fa4bd59f569 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -315,11 +315,12 @@ int msi_domain_populate_irqs(struct irq_domain *domain, 
> struct device *dev,
>  
>   ops->set_desc(arg, desc);
>   /* Assumes the domain mutex is held! */
> - ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
> + ret = irq_domain_alloc_irqs_hierarchy(domain, desc->irq, 1,
> +   arg);
>   if (ret)
>   break;
>  
> - irq_set_msi_desc_off(virq, 0, desc);
> + irq_set_msi_desc_off(desc->irq, 0, desc);
>   }
>  
>   if (ret) {
> 

Fixes: 2145ac9310b60 ("genirq/msi: Add msi_domain_populate_irqs")
Cc: sta...@vger.kernel.org #v4.5+
Reviewed-by: Marc Zyngier 

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

Re: [PATCH 1/2] pidmap(2)

2017-09-06 Thread Alexey Dobriyan

On 9/6/17, Andrew Morton  wrote:
> On Tue, 5 Sep 2017 22:05:00 +0300 Alexey Dobriyan 
> wrote:
>
>> Implement system call for bulk retrieveing of pids in binary form.
>>
>> Using /proc is slower than necessary: 3 syscalls + another 3 for each
>> thread +
>> converting with atoi().
>>
>> /proc may be not mounted especially in containers. Natural extension of
>> hidepid=2 efforts is to not mount /proc at all.
>>
>> It could be used by programs like ps, top or CRIU. Speed increase will
>> become more drastic once combined with bulk retrieval of process
>> statistics.
>
> The patches are performance optimizations, but their changelogs contain
> no performance measurements!
>
> Demonstration of some compelling real-world performance benefits would
> help things along a lot.

I forgot the sheet with numbers at work. :^)
They're very embarrassing for /proc.

pidmap:
N=1<<16 times
~130 processes (~250 task_structs) on a regular desktop system
opendir + readdir + closedir /proc + the same for every /proc/$PID/task
(roughly what htop(1) does) vs pidmap

/proc 16.80+-0.73%
pidmap 0.06+-0.31%

fdmap:
N=1<<22 times
4 opened descriptors (0, 1, 2, 3)
opendir+readdir+closedir /proc/self/fd (lsof(1)) vs fdmap

/proc 8.31+-0.37%
fdmap 0.32+-0.72%

Currently performance improvements may not be huge or even visible.
That's because programs like ps/top/lsof _have_ to use /proc to retrieve
other information. If combined with bulk taskstats-ish retrieval interfaces
they should run around /proc.
#include 
#include 
#include 
#include 
#include 

void f(void)
{
	DIR *d;
	struct dirent *de;

	d = opendir("/proc/");
	while ((de = readdir(d))) {
		if ('1' <= de->d_name[0] && de->d_name[0] <= '9') {
			int pid = atoi(de->d_name);
			char buf[32];
			DIR *dt;
			struct dirent *dte;

			snprintf(buf, sizeof(buf), "/proc/%d/task", pid);
			dt = opendir(buf);
			readdir(dt);
			readdir(dt);
			while ((dte = readdir(dt))) {
int tid = atoi(dte->d_name);
asm volatile ("" :: "g" (&tid) : "memory");
			}
			closedir(dt);
		}
	}
	closedir(d);
}

static inline long sys_pidmap(int *pid, unsigned int n, int start)
{
	register long r10 asm ("r10") = 0;
	long rv;
	asm volatile (
		"syscall"
		: "=a" (rv)
		: "0" (333), "D" (pid), "S" (n), "d" (start), "r" (r10)
		: "rcx", "r11", "cc", "memory"
	);
	return rv;
}

void g(void)
{
	int pid[1024];

	sys_pidmap(pid, sizeof(pid)/sizeof(pid[0]), 0);
}

int main(void)
{
	unsigned int i;

//	for (i = 0; i < (1<<16); i++)
//		f();

	for (i = 0; i < (1<<16); i++)
		g();

	return 0;
}
#include 
#include 
#include 
#include 

void f(void)
{
	DIR *d;
	struct dirent *de;

	d = opendir("/proc/self/fd");
	while ((de = readdir(d))) {
		int fd = atoi(de->d_name);
		asm volatile ("" :: "g" (&fd) : "memory");
	}
	closedir(d);
}

static inline long sys_fdmap(int pid, int *fd, unsigned int n, int start)
{
	register long r10 asm ("r10") = start;
	register long r8 asm ("r8") = 0;
	long rv;
	asm volatile (
		"syscall"
		: "=a" (rv)
		: "0" (334), "D" (pid), "S" (fd), "d" (n), "r" (r10), "r" (r8)
		: "rcx", "r11", "cc", "memory"
	);
	return rv;
}

void g(void)
{
	int fd[1024];

	sys_fdmap(0, fd, sizeof(fd)/sizeof(fd[0]), 0);
}

int main(void)
{
	unsigned int i;

//	for (i = 0; i < (1<<22); i++)
//		f();

	dup(0);
	for (i = 0; i < (1<<22); i++)
		g();

	return 0;
}

Re: a competition when some threads acquire futex

2017-09-06 Thread Peter Zijlstra

On Wed, Sep 06, 2017 at 10:36:29AM +0200, Thomas Gleixner wrote:
> On Wed, 6 Sep 2017, chengjian (D) wrote:
> 
> > > > diff --git a/kernel/futex.c b/kernel/futex.c
> > > > index 3d38eaf..0b2d17a 100644
> > > > --- a/kernel/futex.c
> > > > +++ b/kernel/futex.c
> > > > @@ -1545,6 +1545,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32
> > > > uval,
> > > > struct futex_pi_state *pi_
> > > > spin_unlock(&hb->lock);
> > > > wake_up_q(&wake_q);
> > > > +_cond_resched( );
> > > 
> 
> > I wrote _cond_resched( ) in futex_wake( ) which will be called to wake up
> > waiters
> > when the process release the futex.
> > 
> > 
> > But the patch producted by git format-patch displayed in wake_futex_pi( ).
> 
> Ok. Still that patch has issues.
> 
> 1) It's white space damaged. Please use TAB not spaces for
>indentation. checkpatch.pl would have told you.
> 
> 2) Why are you using _cond_resched() instead of plain cond_resched().
> 
>cond_resched() is what you want to use.

Right, but even if it was a coherent patch, I'm not sure it makes sense.

futex_wait() / futex_wake() don't make ordering guarantees and in
general you don't get to have wakeup preemption if you don't run a
PREEMPT kernel.

So what makes this wakeup so special? Any changelog would need to have a
convincing argument.

[PATCH] libata: zpodd: make arrays cdb static, reduces object code size

2017-09-06 Thread Colin King

From: Colin Ian King 

Don't populate the arrays cdb on the stack, instead make them static.
Makes the object code smaller by 230 bytes:

Before:
   textdata bss dec hex filename
   3797 240   04037 fc5 drivers/ata/libata-zpodd.o

After:
   textdata bss dec hex filename
   3407 400   03807 edf drivers/ata/libata-zpodd.o

Signed-off-by: Colin Ian King 
---
 drivers/ata/libata-zpodd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index 8a01d09ac4db..23a62e4015d0 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -34,7 +34,7 @@ struct zpodd {
 static int eject_tray(struct ata_device *dev)
 {
struct ata_taskfile tf;
-   const char cdb[] = {  GPCMD_START_STOP_UNIT,
+   static const char cdb[] = {  GPCMD_START_STOP_UNIT,
0, 0, 0,
0x02, /* LoEj */
0, 0, 0, 0, 0, 0, 0,
@@ -55,7 +55,7 @@ static enum odd_mech_type zpodd_get_mech_type(struct 
ata_device *dev)
unsigned int ret;
struct rm_feature_desc *desc = (void *)(buf + 8);
struct ata_taskfile tf;
-   char cdb[] = {  GPCMD_GET_CONFIGURATION,
+   static const char cdb[] = {  GPCMD_GET_CONFIGURATION,
2,  /* only 1 feature descriptor requested */
0, 3,   /* 3, removable medium feature */
0, 0, 0,/* reserved */
-- 
2.14.1

Re: [PATCH 3/6] hid: make device_attribute const

2017-09-06 Thread Jiri Kosina

On Mon, 21 Aug 2017, Bhumika Goyal wrote:

> Make this const as it is only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/hid/hid-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c
> index 9bc9116..24e929c 100644
> --- a/drivers/hid/hid-core.c
> +++ b/drivers/hid/hid-core.c
> @@ -1662,7 +1662,7 @@ static bool hid_hiddev(struct hid_device *hdev)
>   .size = HID_MAX_DESCRIPTOR_SIZE,
>  };
>  
> -static struct device_attribute dev_attr_country = {
> +static const struct device_attribute dev_attr_country = {
>   .attr = { .name = "country", .mode = 0444 },
>   .show = show_country,

Applied, thanks.

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH] HID: hid-lg: make array cbuf static const to shink object code size

2017-09-06 Thread Jiri Kosina

On Fri, 25 Aug 2017, Colin King wrote:

> From: Colin Ian King 
> 
> Don't populate array cbuf on the stack, instead make it static.
> Makes the object code smaller by over 110 bytes:
> 
> Before:
>text  data bss dec hex filename
>   15096  3504 128   187284928 drivers/hid/hid-lg.o
> 
> After:
>text  data bss dec hex filename
>   14884  3600 128   1861248b4 drivers/hid/hid-lg.o
> 
> Signed-off-by: Colin Ian King 

Applied, thanks.

-- 
Jiri Kosina
SUSE Labs

Re: Abysmal scheduler performance in Linus' tree?

2017-09-06 Thread Andy Lutomirski



> On Sep 6, 2017, at 1:25 AM, Peter Zijlstra  wrote:
> 
>> On Tue, Sep 05, 2017 at 10:13:39PM -0700, Andy Lutomirski wrote:
>> I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
>> today, and I'm seeing abysmal scheduler performance.  Running make -j4
>> ends up with all the tasks on CPU 3 most of the time (on my
>> 4-logical-thread laptop).  taskset -c 0 whatever puts whatever on CPU
>> 0, but plain while true; do true; done puts the infinite loop on CPU 3
>> right along with the make -j4 tasks.
>> 
>> This is on Fedora 26, and I don't think I'm doing anything weird.
>> systemd has enabled the cpu controller, but it doesn't seem to have
>> configured anything or created any non-root cgroups.
>> 
>> Just a heads up.  I haven't tried to diagnose it at all.
> 
> "make O=defconfig-build -j80" results in:
> 
> %Cpu0  : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu1  : 88.7 us, 11.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu2  : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu3  : 86.8 us, 13.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu4  : 89.7 us, 10.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu5  : 96.3 us,  3.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu6  : 95.3 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu7  : 94.4 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu8  : 91.7 us,  8.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu9  : 94.3 us,  5.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu10 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu11 : 96.2 us,  3.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu12 : 91.5 us,  8.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu13 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu14 : 97.2 us,  2.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu15 : 89.7 us, 10.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu16 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu17 : 93.4 us,  6.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu18 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu19 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu20 : 94.4 us,  5.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu21 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu22 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu23 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu24 : 91.6 us,  8.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu25 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu26 : 93.4 us,  5.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 
> st
> %Cpu27 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu28 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu29 : 88.8 us, 11.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu30 : 90.6 us,  9.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu31 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu32 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu33 : 93.4 us,  6.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu34 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu35 : 93.5 us,  6.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu36 : 90.7 us,  9.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu37 : 97.2 us,  2.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu38 : 92.5 us,  7.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> %Cpu39 : 92.6 us,  7.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> 
> Do you have a .config somewhere?

I'll attach tomorrow.  I'll also test in a VM.

> Are you running with the systemd? Is it
> creating cpu cgroups?

Yes systemd, no cgroups.

> 
> Any specifics on your setup?

On further fiddling, I only see this after a suspend and resume cycle.

Re: [PATCH] powerpc/powernv: Clear LPCR[PECE1] via stop-api only for deep state offline

2017-09-06 Thread pavrampu


On 2017-08-31 17:17, Gautham R. Shenoy wrote:

From: "Gautham R. Shenoy" 

commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via
stop-api only on Hotplug") clears the PECE1 bit of the LPCR via
stop-api during CPU-Hotplug to prevent wakeup due to a decrementer on
an offlined CPU which is in a deep stop state.

In the case where the stop-api support is found to be lacking, the
commit 785a12afdb4a ("powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT
states when stop-api fails") disables deep states that lose hypervisor
context. Thus in this case, the offlined CPU will be put to some
shallow idle state.

However, we currently unconditionally clear the PECE1 in LPCR via
stop-api during CPU-Hotplug even when deep states are disabled due to
stop-api failure.

Fix this by clearing PECE1 of LPCR via stop-api during CPU-Hotplug
*only* when the offlined CPU will be put to a deep state that loses
hypervisor context.

Fixes: commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via
stop-api only on Hotplug")

Reported-by: Pavithra Prakash 
Signed-off-by: Gautham R. Shenoy 


Tested-by: Pavithra Prakash 


---
 arch/powerpc/platforms/powernv/idle.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c
b/arch/powerpc/platforms/powernv/idle.c
index 9f59041..23f8fba 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -393,7 +393,13 @@ static void pnv_program_cpu_hotplug_lpcr(unsigned
int cpu, u64 lpcr_val)
u64 pir = get_hard_smp_processor_id(cpu);

mtspr(SPRN_LPCR, lpcr_val);
-   opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
+
+   /*
+* Program the LPCR via stop-api only for deepest stop state
+* can lose hypervisor context.
+*/
+   if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT)
+   opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
 }

 /*

RE: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of manually

2017-09-06 Thread Prakhya, Sai Praneeth



> -Original Message-
> From: Sai Praneeth Prakhya [mailto:sai.praneeth.prak...@intel.com]
> Sent: Tuesday, September 5, 2017 7:43 PM
> To: Bhupesh Sharma 
> Cc: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; Matt Fleming
> ; Ard Biesheuvel ;
> j...@suse.com; Borislav Petkov ; Luck, Tony
> ; l...@kernel.org; m...@redhat.com; Neri, Ricardo
> ; Shankar, Ravi V 
> Subject: Re: [PATCH V2 0/3] Use mm_struct and switch_mm() instead of
> manually
> 
> On Tue, 2017-09-05 at 19:21 -0700, Sai Praneeth Prakhya wrote:
> > > I get a similar crash on Qemu with linus's master branch and the V2
> > > applied on top of it. Here are the details of my test environment:
> > >
> > > 1. I use the OVMF (EDK2) EFI firmware to launch the kernel:
> > > edk2.git/ovmf-x64
> > >
> > > 2. I used linus's master branch (HEAD - commit:
> > > b1b6f83ac938d176742c85757960dec2cf10e468) and applied your v2 on top
> > > of the same.
> > >
> > > 3. I use the following qemu command line to launch the test:
> > >
> > > # /usr/local/bin/qemu-system-x86_64 --version QEMU emulator version
> > > 2.9.50 (v2.9.0-526-g76d20ea) Copyright (c) 2003-2017 Fabrice Bellard
> > > and the QEMU Project developers
> > >
> > > # /usr/local/bin/qemu-system-x86_64 -enable-kvm  -net nic -net tap
> > > -m $MEMSIZE -nographic -drive
> > > file=$DISK_IMAGE,if=virtio,format=qcow2
> > > -vga std -boot c -cpu host -kernel $KERNEL -append
> > > "crashkernel=$CRASH_MEMSIZE console=ttyS0,115200n81"  -initrd
> > > $INITRAMFS -bios $OVMF_FW_PATH
> > >
> > > And here is the crash log:
> > >
> > > [0.006054] general protection fault:  [#1] SMP
> > > [0.006459] Modules linked in:
> > > [0.006711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3
> > > [0.007000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS 0.0.0 02/06/2015
> > > [0.007000] task: 81e0f480 task.stack: 81e0
> > > [0.007000] RIP: 0010:switch_mm_irqs_off+0x1bc/0x440
> > > [0.007000] RSP: :81e03d80 EFLAGS: 00010086
> > > [0.007000] RAX: 80007d084000 RBX:  RCX:
> 77ff8000
> > > [0.007000] RDX: 7d084000 RSI: 8000 RDI:
> 00019a00
> > > [0.007000] RBP: 81e03dc0 R08:  R09:
> 88007d085000
> > > [0.007000] R10: 81e03dd8 R11: 7d095063 R12:
> 81e5c6a0
> > > [0.007000] R13: 81ed4f40 R14: 0030 R15:
> 0001
> > > [0.007000] FS:  () GS:88007d40()
> > > knlGS:
> > > [0.007000] CS:  0010 DS:  ES:  CR0: 80050033
> > > [0.007000] CR2: 88007d754000 CR3: 0220a000 CR4:
> 000406b0
> > > [0.007000] Call Trace:
> > > [0.007000]  switch_mm+0xd/0x20
> > > [0.007000]  ? switch_mm+0xd/0x20
> > > [0.007000]  efi_switch_mm+0x3e/0x4a
> > > [0.007000]  efi_call_phys_prolog+0x28/0x1ac
> > > [0.007000]  efi_enter_virtual_mode+0x35a/0x48f
> > > [0.007000]  start_kernel+0x332/0x3b8
> > > [0.007000]  x86_64_start_reservations+0x2a/0x2c
> > > [0.007000]  x86_64_start_kernel+0x178/0x18b
> > > [0.007000]  secondary_startup_64+0xa5/0xa5
> > > [0.007000]  ? secondary_startup_64+0xa5/0xa5
> > > [0.007000] Code: 00 00 00 80 49 03 55 50 0f 82 7f 02 00 00 48 b9
> > > 00 00 00 80 ff 77 00 00 48 be 00 00 00 00 00 00 00 80 48 01 ca 48 09
> > > f0 48 09 d0 <0f> 22 d8 0f 1f 44 00 00 e9 47 ff ff ff 65 8b 05 b8 87
> > > fb 7e 89
> > > [0.007000] RIP: switch_mm_irqs_off+0x1bc/0x440 RSP: 81e03d80
> > > [0.007000] ---[ end trace bfa55bf4e4765255 ]---
> > > [0.007000] Kernel panic - not syncing: Attempted to kill the idle 
> > > task!
> > > [0.007000] ---[ end Kernel panic - not syncing: Attempted to kill
> > > the idle task!
> > >
> > > 4. Note though that if I use the EFI_MIXED mode (i.e. 32-bit ovmf
> > > firmware and 64-bit x86 kernel) with your patches, the primary
> > > kernel boots fine on Qemu:
> > >
> > > ovmf firmware used in this case - edk2.git/ovmf-ia32
> > >
> > > 5. Also, if I append 'efi=old_map' to the bootargs (for the failing
> > > case in point 3 above), I see the primary kernel boots fine on Qemu
> > > as well.
> > >
> > > Regards,
> > > Bhupesh
> >
> > Hi Bhupesh,
> >
> > Thanks a lot for the detailed explanation. They are helpful to
> > reproduce the issue quickly. From my initial debug, I think that AMD
> > SME + efi_mm_struct patches + -cpu host (in qemu) are required to
> > reproduce the issue on qemu.
> >
> > I have tried the following combinations (all tests are on qemu):
> > On Linus's tree:
> > 1. With  SME and  efi_mm and  -cpu host -> panics 2. With  SME and
> > efi_mm and !-cpu host -> boots 3. With  SME and !efi_mm and  -cpu host
> > -> boots 4. With  SME and !efi_mm and !-cpu host -> boots 5. With !SME
> > and  efi_mm and  -cpu host -> boots 6. With !SME and  efi_mm and !-cpu
> > host -> boots 7. With !SME and !efi_

Re: [PATCH] HID: multitouch: support buttons and trackpoint on Lenovo X1 Tab Gen2

2017-09-06 Thread Jiri Kosina

On Fri, 25 Aug 2017, Pavel Tatashin wrote:

> On the 2nd generation Lenovo Tablet only clickpad is working; the
> trackpoint and three mouse buttons do not work.
> 
> hid_multitouch must export all inputs in order to get trackpoint and
> buttons to function.
> 
> Signed-off-by: Pavel Tatashin 

Applied to for-4.14/upstream-fixes. Thanks,

-- 
Jiri Kosina
SUSE Labs

Re: [RFC v2 6/6] platform/x86: intel_pmc_ipc: Use generic Intel IPC device calls

2017-09-06 Thread Andy Shevchenko

On Wed, Sep 6, 2017 at 8:27 AM, Kuppuswamy, Sathyanarayanan
 wrote:
> On 9/5/2017 12:38 AM, Lee Jones wrote:
>> On Sat, 02 Sep 2017, sathyanarayanan.kuppusw...@linux.intel.com wrote:

>> I'm a bit concerned by the API.
>
> This is not a new change. Even before refactoring this driver, we have been
> using u32 bit variable to pass the DPTR and SPTR address.
>>
>>Any reason why you're not using
>> pointers for addresses?
>
> I think the main reason is,  this is the expected format defined by SCU/PMC
> spec. According to the spec document, SPTR and DPTR registers are used to
> program the 32 bit SRAM address from which the PMC/SCU firmware can
> read/write the data of an IPC command. if we are not using SPTR or DPTR, we
> need to leave the value at zero.

It's an effect, the cause is that the system controllers which are
used in Intel SoCs are 32-bit microprocessors and they can't address
more. That's why SRAM is placed in 32-bit address space.
And thus, the SCU protocol defines fixed width parameters for
addresses. So, it means we can't use any address outside of 4G space,
iow 32-bit wide.

>> pointers, you should be using NULL, instead of 0.


-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v3] Input: goodix: Add support for capacitive home button

2017-09-06 Thread Bastien Nocera

Hey,

On Tue, 2017-06-20 at 18:08 +0200, Bastien Nocera wrote:
> From: "Sergei A. Trusov" 
> 
> On some x86 tablets with a Goodix touchscreen, the Windows logo on
> the
> front is a capacitive home button. Touching this button results in a
> touch
> with bit 4 of the first byte set, while only the lower 4 bits (0-3)
> are
> used to indicate the number of touches.
> 
> Report a KEY_LEFTMETA press when this happens.
> 
> Note that the hardware might support more than one button, in which
> case the "id" byte of coor_data would identify the button in
> question.
> This is not implemented as we don't have access to hardware with
> multiple buttons.
> 
> Signed-off-by: Sergei A. Trusov 
> Acked-by: Bastien Nocera 

Can we please get this merged? I didn't receive any reviews on it, and
it's been sitting there for 2 months...

Re: [PATCH v8 01/13] x86/apic: Construct a selector for the interrupt delivery mode

2017-09-06 Thread Baoquan He

On 09/06/17 at 12:18pm, Dou Liyang wrote:
> > > +static int __init apic_intr_mode_select(void)
> > > +{
> > > + /* Check kernel option */
> > > + if (disable_apic) {
> > > + pr_info("APIC disabled via kernel command line\n");
> > > + return APIC_PIC;
> > > + }
> > > +
> > > + /* Check BIOS */
> > > +#ifdef CONFIG_X86_64
> > > + /* On 64-bit, the APIC must be integrated, Check local APIC only */
> > > + if (!boot_cpu_has(X86_FEATURE_APIC)) {
> > > + disable_apic = 1;
> > > + pr_info("APIC disabled by BIOS\n");
> > > + return APIC_PIC;
> > > + }
> > > +#else
> > > + /*
> > > +  * On 32-bit, check whether there is a separate chip or integrated
> 
> Change the comment to:
> 
> On 32-bit, the APIC may be a separated chip(82489DX) or integrated chip.
> if BSP doesn't has APIC feature, we can sure there is no integrated
> chip, but can not be sure there is no independent chip. So check two
> situation when BSP doesn't has APIC feature.
> 
> > > +  * APIC
> > > +  */
> > > +
> > > + /* Has a separate chip ? */
> 
> If there is also no SMP configuration, we can be sure there is no
> separated chip. Switch the interrupt delivery node to APIC_PIC directly.
> 
> > > + if (!boot_cpu_has(X86_FEATURE_APIC) && !smp_found_config) {

Here, the most confusing thing to me is the '!smp_found_config'. Why
does 'smp_found_config' has anything with APIC being separate or
integrated?

>From code, 'smp_found_config = 1' when process ACPI MADT, or in
smp_scan_config(). Do you have any finding about the thing that if no
smp config apic must not exist?

Just for curiosity, I know this is copied from APIC_init_uniprocessor().
But I don't understand the logic clearly.

> > > + disable_apic = 1;
> > > +
> > > + return APIC_PIC;
> > > + }
> > > +
> > Here you said several times you are checking if APIC is integrated, but
> > you always check boot_cpu_has(X86_FEATURE_APIC), and you also check
> > smp_found_config in above case. Can you make the comment match the code?
> > 
> 
> Yes.
> 
> > E.g if (!boot_cpu_has(X86_FEATURE_APIC)), cpu doesn't support lapic,
> > just return, you can remove the CONFIG_X86_64 check to make it a common
> > check. And we have lapic_is_integrated() to check if lapic is integrated.
> > 
> I am sorry my comment may confuse you. our target is checking if BIOS
> supports APIC, no matter what type(separated/integrated) it is.
> 
> The new logic 1) as you said may like :
> 
>   if (!boot_cpu_has(X86_FEATURE_APIC))
>   return ...
>   if (lapic_is_integrated())
>   return ...
> here we miss (!boot_cpu_has(X86_FEATURE_APIC) && smp_found_config) for
> a separated chip.
> 
> > Besides, we are saying lapic is integrated with ioapic in a single chip,
> > right? I found MP-Spec mention it. If yes, could you add more words to
> 
> Yes, 82489DX – it was a discrete chip that functioned both as local and
> I/O APIC
> 
> > make it more specific and precise? Then people can get the exact
> 
> Indeed, I will. Please see the modification of comments
> 
> > information from the comment and code.
> > 
> > Thanks
> > Baoquan
> > 
> > > + /* Has a local APIC ? */
> 
> Sanity check if the BIOS pretends there is one local APIC.
> 
> 
> Thanks,
>   dou.
> 
> > > + if (!boot_cpu_has(X86_FEATURE_APIC) &&
> > > + APIC_INTEGRATED(boot_cpu_apic_version)) {
> > > + disable_apic = 1;
> > > + pr_err(FW_BUG "Local APIC %d not detected, force emulation\n",
> > > +boot_cpu_physical_apicid);
> > > +
> > > + return APIC_PIC;
> > > + }
> > > +#endif
> > > +
> > > + /* Check MP table or ACPI MADT configuration */
> > > + if (!smp_found_config) {
> > > + disable_ioapic_support();
> > > +
> > > + if (!acpi_lapic)
> > > + pr_info("APIC: ACPI MADT or MP tables are not 
> > > detected\n");
> > > +
> > > + return APIC_VIRTUAL_WIRE;
> > > + }
> > > +
> > > + return APIC_SYMMETRIC_IO;
> > > +}
> > > +
> > >  /*
> > >   * An initial setup of the virtual wire mode.
> > >   */
> > > --
> > > 2.5.5
> > > 
> > > 
> > > 
> > 
> > 
> > 
> 
>

Re: [PATCH v3] Input: goodix: Add support for capacitive home button

2017-09-06 Thread Jiri Kosina

On Wed, 6 Sep 2017, Bastien Nocera wrote:

> Hey,
> 
> On Tue, 2017-06-20 at 18:08 +0200, Bastien Nocera wrote:
> > From: "Sergei A. Trusov" 
> > 
> > On some x86 tablets with a Goodix touchscreen, the Windows logo on
> > the
> > front is a capacitive home button. Touching this button results in a
> > touch
> > with bit 4 of the first byte set, while only the lower 4 bits (0-3)
> > are
> > used to indicate the number of touches.
> > 
> > Report a KEY_LEFTMETA press when this happens.
> > 
> > Note that the hardware might support more than one button, in which
> > case the "id" byte of coor_data would identify the button in
> > question.
> > This is not implemented as we don't have access to hardware with
> > multiple buttons.
> > 
> > Signed-off-by: Sergei A. Trusov 
> > Acked-by: Bastien Nocera 
> 
> Can we please get this merged? I didn't receive any reviews on it, and
> it's been sitting there for 2 months...

Let's CC Dmitry.

-- 
Jiri Kosina
SUSE Labs

Re: Abysmal scheduler performance in Linus' tree?

2017-09-06 Thread Peter Zijlstra

On Wed, Sep 06, 2017 at 01:59:14AM -0700, Andy Lutomirski wrote:
> > Any specifics on your setup?
> 
> On further fiddling, I only see this after a suspend and resume cycle.

Ah, ok. That's not something I otherwise test. Lets see if I can force
this brick of mine through a suspend-resume cycle :-)

Re: [PATCH 1/2] pidmap(2)

2017-09-06 Thread Alexey Dobriyan

On 9/6/17, Randy Dunlap  wrote:
> On 09/05/17 15:53, Andrew Morton wrote:
>> On Tue, 5 Sep 2017 22:05:00 +0300 Alexey Dobriyan 
>> wrote:
>>
>>> Implement system call for bulk retrieveing of pids in binary form.
>>>
>>> Using /proc is slower than necessary: 3 syscalls + another 3 for each
>>> thread +
>>> converting with atoi().
>>>
>>> /proc may be not mounted especially in containers. Natural extension of
>>> hidepid=2 efforts is to not mount /proc at all.
>>>
>>> It could be used by programs like ps, top or CRIU. Speed increase will
>>> become more drastic once combined with bulk retrieval of process
>>> statistics.
>>
>> The patches are performance optimizations, but their changelogs contain
>> no performance measurements!
>>
>> Demonstration of some compelling real-world performance benefits would
>> help things along a lot.
>>
>
> also, I expect that the tiny kernel people will want kconfig options for
> these syscalls.

We'll add it but the question if it is a good idea. Ideally these system calls
should be mandatory and /proc optional.

$ size kernel/pidmap.o fs/fdmap.o
   textdata bss dec hex filename
560   0   0 560 230 kernel/pidmap.o
617   0   0 617 269 fs/fdmap.o

Re: [PATCH] HID: asus: Add support for Fn keys on Asus ROG G752

2017-09-06 Thread Jiri Kosina

On Sun, 20 Aug 2017, Maxime Bellengé wrote:

> This patch adds support for Fn keys on Asus ROG G752 laptop.
> The report descriptor is broken so I fixed it.
> 
> Tested on an Asus G752VT.
> 
> Signed-off-by: Maxime Bellengé 

Looks good to me, but the patch is whitespace damaged (it adds only 
spaces, no tabs).

Could you please fix that up and resend (with Benjamin's Reviewed-by 
added)?

Thanks,

-- 
Jiri Kosina
SUSE Labs

[GIT PULL] MMC for v.4.14

2017-09-06 Thread Ulf Hansson

Hi Linus,

Here's the PR for MMC for v4.14. Details about the highlights are as usual found
in the signed tag.

Please pull this in!

Kind regards
Ulf Hansson


The following changes since commit 99c14fc360dbbb583a03ab985551b12b5c5ca4f1:

  mmc: sdhci-xenon: add set_power callback (2017-08-30 14:11:47 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc.git tags/mmc-v4.14

for you to fetch changes up to c16a854e4463078aedad601fac76341760a66dd1:

  mmc: renesas_sdhi: Add r8a7743/5 support (2017-09-01 15:31:01 +0200)


MMC core:
 - Continue to refactor the mmc block code to prepare for blkmq
 - Move mmc block debugfs into block module
 - Next step for eMMC CMDQ by adding a new mmc host interface for it
 - Move Kconfig option MMC_DEBUG from core to host
 - Some additional minor improvements

MMC host:
 - Declare structs as const when applicable
 - Explicitly request exclusive reset control when applicable
 - Improve some error paths and other various cleanups
 - sdhci: Preparations to support SDHCI OMAP
 - sdhci: Improve some PM related code
 - sdhci: Re-factoring and modernizations
 - sdhci-xenon: Add runtime PM and system sleep support
 - sdhci-xenon: Add support for eMMC HS400 Enhanced Strobe
 - sdhci-cadence: Add system sleep support
 - sdhci-of-at91: Improve system sleep support
 - dw_mmc: Add support for Hisilicon hi3660
 - sunxi: Add support for A83T eMMC
 - sunxi: Add support for DDR52 mode
 - meson-gx: Add support for UHS-I SD-cards
 - meson-gx: Cleanups and improvements
 - tmio: Fix CMD12 (STOP) handling
 - tmio: Cleanups and improvements
 - renesas_sdhi: Add r8a7743/5 support
 - renesas-sdhi: Add support for R-Car Gen3 SDHI DMAC
 - renesas_sdhi: Cleanups and improvements


Addy Ke (1):
  mmc: dw_mmc: introduce timer for broken command transfer over scheme

Adrian Hunter (7):
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
  mmc: core: Add mmc_retune_hold_now()
  mmc: core: Add members to mmc_request and mmc_data for CQE's
  mmc: host: Add CQE interface
  mmc: core: Turn off CQE before sending commands
  mmc: sdhci: Tidy reading 136-bit responses
  mmc: core: Move mmc_start_areq() declaration

Andy Shevchenko (2):
  sdhci: acpi: Use new method to get ACPI companion
  sdhci: pci: Fix up power if device has ACPI companion

Arnd Bergmann (1):
  mmc: test: reduce stack usage in mmc_test_nonblock_transfer

Arvind Yadav (7):
  mmc: sdhci-st: Handle return value of clk_prepare_enable
  mmc: omap_hsmmc: constify dev_pm_ops structures
  mmc: host: via-sdmmc: constify pci_device_id.
  mmc: wmt-sdmmc: Handle return value of clk_prepare_enable
  mmc: mxcmmc: Handle return value of clk_prepare_enable
  mmc: vub300: constify usb_device_id
  mmc: mmci: constify amba_id

Axel Lin (1):
  mmc: cavium-octeon: Convert to use module_platform_driver

Biju Das (2):
  mmc: renesas_sdhi: Add r8a7743/5 support
  mmc: renesas_sdhi: Add r8a7743/5 support

Chaotian Jing (1):
  mmc: mediatek: add ops->get_cd() support

Chen-Yu Tsai (8):
  clk: sunxi-ng: Add interface to query or configure MMC timing modes.
  clk: sunxi-ng: Add MP_MMC clocks that support MMC timing modes switching
  clk: sunxi-ng: a83t: Support new timing mode for mmc2 clock
  mmc: sunxi: Support controllers that can use both old and new timings
  mmc: sunxi: Support MMC DDR52 transfer mode with new timing mode
  mmc: sunxi: Add support for A83T eMMC (MMC2)
  mmc: sunxi: Fix NULL pointer reference on clk_delays
  mmc: sunxi: Fix clock rate passed to sunxi_mmc_clk_set_phase

Chris Paterson (1):
  dt-bindings: mmc: sh_mmcif: Document r8a7743 DT bindings

Colin Ian King (1):
  mmc: rtsx_usb_sdmmc: make array 'width' static const

Fabrizio Castro (1):
  dt-bindings: mmc: sh_mmcif: Document r8a7745 DT bindings

Gustavo A. R. Silva (2):
  mmc: android-goldfish: remove logically dead code in goldfish_mmc_irq()
  mmc: mxcmmc: fix error return code in mxcmci_probe()

Hu Ziji (2):
  mmc: sdhci-xenon: Add Xenon SDHCI specific system-level PM support
  mmc: sdhci-xenon: Support HS400 Enhanced Strobe feature

Ian Molton (1):
  MMC: Remove HIGHMEM dependency from mmc-spi driver

Icenowy Zheng (1):
  mmc: sunxi: fix support for new timings mode only SoCs

Ivan Mikhaylov (1):
  mmc: sdhci-st: add FSP2(ppc476fpe) into depends for sdhci-st

Jean-Francois Dagenais (1):
  mmc: sdhci-of-arasan: use io functions from sdhci.h

Jerome Brunet (17):
  mmc: meson-gx: fix mux mask definition
  mmc: meson-gx: remove CLK_DIVIDER_ALLOW_ZERO clock flag
  mmc: meson-gx: clean up some constants
  mmc: meson-gx: initialize sane clk default before clock register
  mmc: meson-gx: cfg init overwrite values
  mmc: meson-gx: rework set_ios f

Re: [PATCH v5 1/3] mfd: Add support for Cherry Trail Dollar Cove TI PMIC

2017-09-06 Thread Lee Jones

On Wed, 06 Sep 2017, Takashi Iwai wrote:

> On Wed, 06 Sep 2017 09:54:44 +0200,
> Lee Jones wrote:
> > 
> > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > 
> > > On Tue, 05 Sep 2017 10:53:41 +0200,
> > > Lee Jones wrote:
> > > > 
> > > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > > > 
> > > > > On Tue, 05 Sep 2017 10:10:49 +0200,
> > > > > Lee Jones wrote:
> > > > > > 
> > > > > > On Tue, 05 Sep 2017, Takashi Iwai wrote:
> > > > > > 
> > > > > > > On Tue, 05 Sep 2017 09:24:51 +0200,
> > > > > > > Lee Jones wrote:
> > > > > > > > 
> > > > > > > > On Mon, 04 Sep 2017, Takashi Iwai wrote:
> > > > > > > > 
> > > > > > > > > This patch adds the MFD driver for Dollar Cove (TI version) 
> > > > > > > > > PMIC with
> > > > > > > > > ACPI INT33F5 that is found on some Intel Cherry Trail devices.
> > > > > > > > > The driver is based on the original work by Intel, found at:
> > > > > > > > >   https://github.com/01org/ProductionKernelQuilts
> > > > > > > > > 
> > > > > > > > > This is a minimal version for adding the basic resources.  
> > > > > > > > > Currently,
> > > > > > > > > only ACPI PMIC opregion and the external power-button are 
> > > > > > > > > used.
> > > > > > > > > 
> > > > > > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=193891
> > > > > > > > > Reviewed-by: Mika Westerberg 
> > > > > > > > > Reviewed-by: Andy Shevchenko 
> > > > > > > > > Signed-off-by: Takashi Iwai 
> > > > > > > > > ---
> > > > > > > > > v4->v5:
> > > > > > > > > * Minor coding-style fixes suggested by Lee
> > > > > > > > > * Put GPL text
> > > > > > > > > v3->v4:
> > > > > > > > > * no change for this patch
> > > > > > > > > v2->v3:
> > > > > > > > > * Rename dc_ti with chtdc_ti in all places
> > > > > > > > > * Driver/kconfig renames accordingly
> > > > > > > > > * Added acks by Andy and Mika
> > > > > > > > > v1->v2:
> > > > > > > > > * Minor cleanups as suggested by Andy
> > > > > > > > > 
> > > > > > > > >  drivers/mfd/Kconfig   |  13 +++
> > > > > > > > >  drivers/mfd/Makefile  |   1 +
> > > > > > > > >  drivers/mfd/intel_soc_pmic_chtdc_ti.c | 184 
> > > > > > > > > ++
> > > > > > > > >  3 files changed, 198 insertions(+)
> > > > > > > > >  create mode 100644 drivers/mfd/intel_soc_pmic_chtdc_ti.c
> > > > > > > > 
> > > > > > > > For my own reference:
> > > > > > > >   Acked-for-MFD-by: Lee Jones 
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > Now the question is how to deal with these.  It's no critical 
> > > > > > > things,
> > > > > > > so I'm OK to postpone for 4.15.  OTOH, it's really a new
> > > > > > > device-specific stuff, thus it can't break anything else, and 
> > > > > > > it'd be
> > > > > > > fairly safe to add it for 4.14 although it's at a bit late stage.
> > > > > > 
> > > > > > Yes, you are over 2 weeks late for v4.14.  It will have to be v4.15.
> > > > > 
> > > > > OK, I'll ring your bells again once when 4.15 development is opened.
> > > > > 
> > > > > 
> > > > > > > IMO, it'd be great if you can carry all stuff through MFD tree; or
> > > > > > > create an immutable branch (again).  But how to handle it, when 
> > > > > > > to do
> > > > > > > it, It's all up to you guys.
> > > > > > 
> > > > > > If there aren't any build dependencies between the patches, each of
> > > > > > the patches should be applied through their own trees.  What are the
> > > > > > build-time dependencies?  Are there any?
> > > > > 
> > > > > No, there is no strict build-time dependency.  It's just that I don't
> > > > > see it nice to have a commit for a dead code, partly for testing
> > > > > purpose and partly for code consistency.  But if this makes
> > > > > maintenance easier, I'm happy with that, too, of course.
> > > > 
> > > > There won't be any dead code.  All of the subsystem trees are pulled
> > > > into -next [0] where the build bots can operate on the patches as a
> > > > whole.
> > > 
> > > But the merge order isn't guaranteed, i.e. at the commit of other tree
> > > for this new stuff, it's a dead code without merging the MFD stuff
> > > beforehand.  e.g. Imagine to perform the git bisection.  It's not
> > > about the whole tree, but about the each commit.
> > 
> > Only *building* is relevant for bisection until the whole feature
> > lands.
> 
> Why only building?
> 
> When merging through several tress, commits for the same series are
> scattered completely although they are softly tied.  This sucks when
> you perform git bisection, e.g. if you have an issue in the middle of
> the patch series.  It still works, but it jumps unnecessarily too far
> away and back before reaching to the point, and kconfig appears /
> disappears inconsistently (the dependent kconfig gone in the middle).
> And, this is about the release kernel (4.15 or whatever).

Think about how bisection works.  You state a good commit and a bad
one.  The good commit will be when the feature last worked, which will
not be until the feature has fully landed

Re: [PATCH] HID: multitouch: Support ALPS PTP stick with pid 0x120A

2017-09-06 Thread Jiri Kosina

On Thu, 10 Aug 2017, Shrirang Bagul wrote:

> This patch adds ALPS PTP sticks with pid/device id 0x120A to the list of
> devices supported by hid-multitouch.
> 
> Signed-off-by: Shrirang Bagul 

Applied to for-4.14/upstream-fixes. Thanks,

-- 
Jiri Kosina
SUSE Labs

Re: a competition when some threads acquire futex

2017-09-06 Thread Peter Zijlstra

On Wed, Sep 06, 2017 at 10:56:08AM +0200, Peter Zijlstra wrote:

> Right, but even if it was a coherent patch, I'm not sure it makes sense.
> 
> futex_wait() / futex_wake() don't make ordering guarantees and in
> general you don't get to have wakeup preemption if you don't run a
> PREEMPT kernel.
> 
> So what makes this wakeup so special? Any changelog would need to have a
> convincing argument.

Also, even on !PREEMPT, if that wakeup sets NEED_RESCHED, the return to
userspace after futex_wake() should reschedule.


So I'm really not getting it.

Re: [PATCH v3 51/59] KVM: arm/arm64: GICv4: Add doorbell interrupt handling

2017-09-06 Thread Shannon Zhao



On 2017/8/1 1:26, Marc Zyngier wrote:
> When a vPE is not running, a VLPI being made pending results in a
> doorbell interrupt being delivered. Let's handle this interrupt
> and update the pending_last flag that indicates that VLPIs are
> pending. The corresponding vcpu is also kicked into action.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  virt/kvm/arm/vgic/vgic-v4.c | 34 ++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/vgic-v4.c b/virt/kvm/arm/vgic/vgic-v4.c
> index 534d3051a078..6af3cde6d7d4 100644
> --- a/virt/kvm/arm/vgic/vgic-v4.c
> +++ b/virt/kvm/arm/vgic/vgic-v4.c
> @@ -21,6 +21,19 @@
>  
>  #include "vgic.h"
>  
> +static irqreturn_t vgic_v4_doorbell_handler(int irq, void *info)
> +{
> + struct kvm_vcpu *vcpu = info;
> +
> + if (!kvm_vgic_vcpu_pending_irq(vcpu)) {
> + vcpu->arch.vgic_cpu.vgic_v3.its_vpe.pending_last = true;
> + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
> + kvm_vcpu_kick(vcpu);
> + }
> +
> + return IRQ_HANDLED;
> +}
> +
>  int vgic_v4_init(struct kvm *kvm)
>  {
>   struct vgic_dist *dist = &kvm->arch.vgic;
> @@ -57,16 +70,37 @@ int vgic_v4_init(struct kvm *kvm)
>   return ret;
>   }
>  
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + int irq = dist->its_vm.vpes[i]->irq;
> +
> + ret = request_irq(irq, vgic_v4_doorbell_handler,
> +   0, "vcpu", vcpu);
> + if (ret) {
> + kvm_err("failed to allocate vcpu IRQ%d\n", irq);
> + dist->its_vm.nr_vpes = i;
This overwirtes the nr_vpes while it uses kvm->online_vcpus in
its_alloc_vcpu_irqs to alloc irqs and if this fails it uses the
overwirten nr_vpes other than kvm->online_vcpus in its_free_vcpu_irqs to
free the irqs. So there will be memory leak on error path.

> + break;
> + }
> + }
> +
> + if (ret)
> + vgic_v4_teardown(kvm);
> +
>   return ret;
>  }
>  
>  void vgic_v4_teardown(struct kvm *kvm)
>  {
>   struct its_vm *its_vm = &kvm->arch.vgic.its_vm;
> + int i;
>  
>   if (!its_vm->vpes)
>   return;
>  
> + for (i = 0; i < its_vm->nr_vpes; i++) {
> + struct kvm_vcpu *vcpu = kvm_get_vcpu(kvm, i);
> + free_irq(its_vm->vpes[i]->irq, vcpu);
> + }
> +
>   its_free_vcpu_irqs(its_vm);
>   kfree(its_vm->vpes);
>   its_vm->nr_vpes = 0;
> 

Thanks,
-- 
Shannon

Re: [PATCH 6/10] ixgbe: Use ARRAY_SIZE macro

2017-09-06 Thread Thomas Meyer

On Tue, Sep 05, 2017 at 02:22:05PM -0700, David Miller wrote:
> From: Joe Perches 
> Date: Tue, 05 Sep 2017 13:01:18 -0700
> 
> > On Tue, 2017-09-05 at 21:45 +0200, Thomas Meyer wrote:
> >> On Tue, Sep 05, 2017 at 11:50:44AM -0700, David Miller wrote:
> >> > From: Thomas Meyer 
> >> > Date: Sun, 03 Sep 2017 14:19:31 +0200
> >> > 
> >> > > Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
> >> > > yourself.
> >> > > Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i 
> >> > > -e
> >> > > 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ 
> >> > > /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
> >> > > /ARRAY_SIZE(\1)/g' and manual check/verification.
> >> > > 
> >> > > Signed-off-by: Thomas Meyer 
> >> > 
> >> > This should be submitted to the Intel ethernet driver maintainers.
> >> 
> >> Hi,
> >> 
> >> my script checks the output of get_maintainer scripts and only sends to 
> >> "open
> >> list" entries.
> >> 
> >> The intel-wired-...@lists.osuosl.org is moderated, so that's why the patch
> >> wasn't send there.
> >> 
> >> Strangely the lists for nouv...@lists.freedesktop.org and
> >> intel-gvt-...@lists.freedesktop.org appears as open lists in the 
> >> MAINTAINERS
> >> file but seems to be also moderated lists... At least I got some reply 
> >> that my
> >> message awaits approval. Maybe an update to the MAINTAINERS file is missing
> >> here?
> >> 
> >> I may drop above check in my script and send to all mailing lists that
> >> get_maintainer.pl will return.
> > 
> > There's a difference between moderated and subscriber-only
> > entries in MAINTAINERS.
> > 
> > get_maintainers will by default list moderated lists and
> > not show subscriber-only lists unless using the -s switch.
> 
> Furthermore, nothing prevented you from CC:'ing the maintainer,
> Jeff Kirscher.

Hi,

That's the other condition in my script. I only send to the role
"maintainer" from the output of get_maintainer.pl. But Mr Jeff
Kirscher is only listed as supporter...

Anyway I did bounce the email to him.

with kind regards
thomas

[PATCH v13 0/5] Replace PCI pool by DMA pool API

2017-09-06 Thread Romain Perier

The current PCI pool API are simple macro functions direct expanded to
the appropriate dma pool functions. The prototypes are almost the same
and semantically, they are very similar. I propose to use the DMA pool
API directly and get rid of the old API.

This set of patches, replaces the old API by the dma pool API
and remove the defines.

Changes in v13:
- Rebased series onto next-20170906
- Added a new commit for the hinic ethernet driver
- Remove previously merged patches

Changes in v12:
- Rebased series onto next-20170822

Changes in v11:
- Rebased series onto next-20170809
- Removed patches 08-14, these have been merged.

Changes in v10:
- Rebased series onto next-20170706
- I have fixed and improved patch "scsi: megaraid: Replace PCI pool old API"

Changes in v9:
- Rebased series onto next-20170522
- I have fixed and improved the patch for lpfc driver

Changes in v8:
- Rebased series onto next-20170428

Changes in v7:
- Rebased series onto next-20170416
- Added Acked-by, Tested-by and Reviwed-by tags

Changes in v6:
- Fixed an issue reported by kbuild test robot about changes in DAC960
- Removed patches 15/19,16/19,17/19,18/19. They have been merged by Greg
- Added Acked-by Tags

Changes in v5:
- Re-worded the cover letter (remove sentence about checkpatch.pl)
- Rebased series onto next-20170308
- Fix typos in commit message
- Added Acked-by Tags

Changes in v4:
- Rebased series onto next-20170301
- Removed patch 20/20: checks done by checkpath.pl, no longer required.
  Thanks to Peter and Joe for their feedbacks.
- Added Reviewed-by tags

Changes in v3:
- Rebased series onto next-20170224
- Fix checkpath.pl reports for patch 11/20 and patch 12/20
- Remove prefix RFC
Changes in v2:
- Introduced patch 18/20
- Fixed cosmetic changes: spaces before brace, live over 80 characters
- Removed some of the check for NULL pointers before calling dma_pool_destroy
- Improved the regexp in checkpatch for pci_pool, thanks to Joe Perches
- Added Tested-by and Acked-by tags


Romain Perier (5):
  block: DAC960: Replace PCI pool old API
  dmaengine: pch_dma: Replace PCI pool old API
  net: e100: Replace PCI pool old API
  hinic: Replace PCI pool old API
  PCI: Remove PCI pool macro functions

 drivers/block/DAC960.c| 38 +++
 drivers/block/DAC960.h|  4 +--
 drivers/dma/pch_dma.c | 12 +++
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +++---
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
 drivers/net/ethernet/intel/e100.c | 12 +++
 include/linux/pci.h   |  9 --
 7 files changed, 38 insertions(+), 49 deletions(-)

-- 
2.11.0

[PATCH v13 5/5] PCI: Remove PCI pool macro functions

2017-09-06 Thread Romain Perier

Now that all the drivers use dma pool API, we can remove the macro
functions for PCI pool.

Signed-off-by: Romain Perier 
Reviewed-by: Peter Senna Tschudin 
---
 include/linux/pci.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index f68c58a93dd0..89dfc277a6c6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1322,15 +1322,6 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 #include 
 #include 
 
-#definepci_pool dma_pool
-#define pci_pool_create(name, pdev, size, align, allocation) \
-   dma_pool_create(name, &pdev->dev, size, align, allocation)
-#definepci_pool_destroy(pool) dma_pool_destroy(pool)
-#definepci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, 
handle)
-#definepci_pool_zalloc(pool, flags, handle) \
-   dma_pool_zalloc(pool, flags, handle)
-#definepci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, 
addr)
-
 struct msix_entry {
u32 vector; /* kernel uses to write allocated vector */
u16 entry;  /* driver uses to specify entry, OS writes */
-- 
2.11.0

[PATCH v13 2/5] dmaengine: pch_dma: Replace PCI pool old API

2017-09-06 Thread Romain Perier

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
Acked-by: Peter Senna Tschudin 
Tested-by: Peter Senna Tschudin 
---
 drivers/dma/pch_dma.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/dma/pch_dma.c b/drivers/dma/pch_dma.c
index f9028e9d0dfc..afd8f27bda96 100644
--- a/drivers/dma/pch_dma.c
+++ b/drivers/dma/pch_dma.c
@@ -123,7 +123,7 @@ struct pch_dma_chan {
 struct pch_dma {
struct dma_device   dma;
void __iomem *membase;
-   struct pci_pool *pool;
+   struct dma_pool *pool;
struct pch_dma_regs regs;
struct pch_dma_desc_regs ch_regs[MAX_CHAN_NR];
struct pch_dma_chan channels[MAX_CHAN_NR];
@@ -437,7 +437,7 @@ static struct pch_dma_desc *pdc_alloc_desc(struct dma_chan 
*chan, gfp_t flags)
struct pch_dma *pd = to_pd(chan->device);
dma_addr_t addr;
 
-   desc = pci_pool_zalloc(pd->pool, flags, &addr);
+   desc = dma_pool_zalloc(pd->pool, flags, &addr);
if (desc) {
INIT_LIST_HEAD(&desc->tx_list);
dma_async_tx_descriptor_init(&desc->txd, chan);
@@ -549,7 +549,7 @@ static void pd_free_chan_resources(struct dma_chan *chan)
spin_unlock_irq(&pd_chan->lock);
 
list_for_each_entry_safe(desc, _d, &tmp_list, desc_node)
-   pci_pool_free(pd->pool, desc, desc->txd.phys);
+   dma_pool_free(pd->pool, desc, desc->txd.phys);
 
pdc_enable_irq(chan, 0);
 }
@@ -880,7 +880,7 @@ static int pch_dma_probe(struct pci_dev *pdev,
goto err_iounmap;
}
 
-   pd->pool = pci_pool_create("pch_dma_desc_pool", pdev,
+   pd->pool = dma_pool_create("pch_dma_desc_pool", &pdev->dev,
   sizeof(struct pch_dma_desc), 4, 0);
if (!pd->pool) {
dev_err(&pdev->dev, "Failed to alloc DMA descriptors\n");
@@ -931,7 +931,7 @@ static int pch_dma_probe(struct pci_dev *pdev,
return 0;
 
 err_free_pool:
-   pci_pool_destroy(pd->pool);
+   dma_pool_destroy(pd->pool);
 err_free_irq:
free_irq(pdev->irq, pd);
 err_iounmap:
@@ -963,7 +963,7 @@ static void pch_dma_remove(struct pci_dev *pdev)
tasklet_kill(&pd_chan->tasklet);
}
 
-   pci_pool_destroy(pd->pool);
+   dma_pool_destroy(pd->pool);
pci_iounmap(pdev, pd->membase);
pci_release_regions(pdev);
pci_disable_device(pdev);
-- 
2.11.0

[PATCH v13 4/5] hinic: Replace PCI pool old API

2017-09-06 Thread Romain Perier

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
---
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +-
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
index 7d95f0866fb0..28a81ac97af5 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
@@ -143,7 +143,7 @@ int hinic_alloc_cmdq_buf(struct hinic_cmdqs *cmdqs,
struct hinic_hwif *hwif = cmdqs->hwif;
struct pci_dev *pdev = hwif->pdev;
 
-   cmdq_buf->buf = pci_pool_alloc(cmdqs->cmdq_buf_pool, GFP_KERNEL,
+   cmdq_buf->buf = dma_pool_alloc(cmdqs->cmdq_buf_pool, GFP_KERNEL,
   &cmdq_buf->dma_addr);
if (!cmdq_buf->buf) {
dev_err(&pdev->dev, "Failed to allocate cmd from the pool\n");
@@ -161,7 +161,7 @@ int hinic_alloc_cmdq_buf(struct hinic_cmdqs *cmdqs,
 void hinic_free_cmdq_buf(struct hinic_cmdqs *cmdqs,
 struct hinic_cmdq_buf *cmdq_buf)
 {
-   pci_pool_free(cmdqs->cmdq_buf_pool, cmdq_buf->buf, cmdq_buf->dma_addr);
+   dma_pool_free(cmdqs->cmdq_buf_pool, cmdq_buf->buf, cmdq_buf->dma_addr);
 }
 
 static unsigned int cmdq_wqe_size_from_bdlen(enum bufdesc_len len)
@@ -875,7 +875,7 @@ int hinic_init_cmdqs(struct hinic_cmdqs *cmdqs, struct 
hinic_hwif *hwif,
int err;
 
cmdqs->hwif = hwif;
-   cmdqs->cmdq_buf_pool = pci_pool_create("hinic_cmdq", pdev,
+   cmdqs->cmdq_buf_pool = dma_pool_create("hinic_cmdq", &pdev->dev,
   HINIC_CMDQ_BUF_SIZE,
   HINIC_CMDQ_BUF_SIZE, 0);
if (!cmdqs->cmdq_buf_pool)
@@ -916,7 +916,7 @@ int hinic_init_cmdqs(struct hinic_cmdqs *cmdqs, struct 
hinic_hwif *hwif,
devm_kfree(&pdev->dev, cmdqs->saved_wqs);
 
 err_saved_wqs:
-   pci_pool_destroy(cmdqs->cmdq_buf_pool);
+   dma_pool_destroy(cmdqs->cmdq_buf_pool);
return err;
 }
 
@@ -942,5 +942,5 @@ void hinic_free_cmdqs(struct hinic_cmdqs *cmdqs)
 
devm_kfree(&pdev->dev, cmdqs->saved_wqs);
 
-   pci_pool_destroy(cmdqs->cmdq_buf_pool);
+   dma_pool_destroy(cmdqs->cmdq_buf_pool);
 }
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
index b35583400cb6..23f8d39eab68 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
@@ -157,7 +157,7 @@ struct hinic_cmdq {
 struct hinic_cmdqs {
struct hinic_hwif   *hwif;
 
-   struct pci_pool *cmdq_buf_pool;
+   struct dma_pool *cmdq_buf_pool;
 
struct hinic_wq *saved_wqs;
 
-- 
2.11.0

[PATCH v13 3/5] net: e100: Replace PCI pool old API

2017-09-06 Thread Romain Perier

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
Acked-by: Peter Senna Tschudin 
Acked-by: Jeff Kirsher 
Tested-by: Peter Senna Tschudin 
---
 drivers/net/ethernet/intel/e100.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e100.c 
b/drivers/net/ethernet/intel/e100.c
index 4d10270ddf8f..d1470d30351c 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -607,7 +607,7 @@ struct nic {
struct mem *mem;
dma_addr_t dma_addr;
 
-   struct pci_pool *cbs_pool;
+   struct dma_pool *cbs_pool;
dma_addr_t cbs_dma_addr;
u8 adaptive_ifs;
u8 tx_threshold;
@@ -1892,7 +1892,7 @@ static void e100_clean_cbs(struct nic *nic)
nic->cb_to_clean = nic->cb_to_clean->next;
nic->cbs_avail++;
}
-   pci_pool_free(nic->cbs_pool, nic->cbs, nic->cbs_dma_addr);
+   dma_pool_free(nic->cbs_pool, nic->cbs, nic->cbs_dma_addr);
nic->cbs = NULL;
nic->cbs_avail = 0;
}
@@ -1910,7 +1910,7 @@ static int e100_alloc_cbs(struct nic *nic)
nic->cb_to_use = nic->cb_to_send = nic->cb_to_clean = NULL;
nic->cbs_avail = 0;
 
-   nic->cbs = pci_pool_alloc(nic->cbs_pool, GFP_KERNEL,
+   nic->cbs = dma_pool_alloc(nic->cbs_pool, GFP_KERNEL,
  &nic->cbs_dma_addr);
if (!nic->cbs)
return -ENOMEM;
@@ -2961,8 +2961,8 @@ static int e100_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
netif_err(nic, probe, nic->netdev, "Cannot register net device, 
aborting\n");
goto err_out_free;
}
-   nic->cbs_pool = pci_pool_create(netdev->name,
-  nic->pdev,
+   nic->cbs_pool = dma_pool_create(netdev->name,
+  &nic->pdev->dev,
   nic->params.cbs.max * sizeof(struct cb),
   sizeof(u32),
   0);
@@ -3002,7 +3002,7 @@ static void e100_remove(struct pci_dev *pdev)
unregister_netdev(netdev);
e100_free(nic);
pci_iounmap(pdev, nic->csr);
-   pci_pool_destroy(nic->cbs_pool);
+   dma_pool_destroy(nic->cbs_pool);
free_netdev(netdev);
pci_release_regions(pdev);
pci_disable_device(pdev);
-- 
2.11.0

Re: a competition when some threads acquire futex

2017-09-06 Thread chengjian (D)


On 2017/9/6 16:36, Thomas Gleixner write:

Ok. Still that patch has issues.

1) It's white space damaged. Please use TAB not spaces for
indentation. checkpatch.pl would have told you.

2) Why are you using _cond_resched() instead of plain cond_resched().

cond_resched() is what you want to use.



Hi Thomas.

I am very sorry about the issues of my patch.
I am a kernel newbie and i set expandtab in my .vimrc file
which means tab characters are converted to spaces.
I have fixed it and will check my patch first next time.


Thanks.
   Cheng Jian

[PATCH v13 1/5] block: DAC960: Replace PCI pool old API

2017-09-06 Thread Romain Perier

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
Acked-by: Peter Senna Tschudin 
Tested-by: Peter Senna Tschudin 
---
 drivers/block/DAC960.c | 38 ++
 drivers/block/DAC960.h |  4 ++--
 2 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/drivers/block/DAC960.c b/drivers/block/DAC960.c
index 255591ab3716..2a8950ee382c 100644
--- a/drivers/block/DAC960.c
+++ b/drivers/block/DAC960.c
@@ -268,17 +268,17 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
   void *AllocationPointer = NULL;
   void *ScatterGatherCPU = NULL;
   dma_addr_t ScatterGatherDMA;
-  struct pci_pool *ScatterGatherPool;
+  struct dma_pool *ScatterGatherPool;
   void *RequestSenseCPU = NULL;
   dma_addr_t RequestSenseDMA;
-  struct pci_pool *RequestSensePool = NULL;
+  struct dma_pool *RequestSensePool = NULL;
 
   if (Controller->FirmwareType == DAC960_V1_Controller)
 {
   CommandAllocationLength = offsetof(DAC960_Command_T, V1.EndMarker);
   CommandAllocationGroupSize = DAC960_V1_CommandAllocationGroupSize;
-  ScatterGatherPool = pci_pool_create("DAC960_V1_ScatterGather",
-   Controller->PCIDevice,
+  ScatterGatherPool = dma_pool_create("DAC960_V1_ScatterGather",
+   &Controller->PCIDevice->dev,
DAC960_V1_ScatterGatherLimit * sizeof(DAC960_V1_ScatterGatherSegment_T),
sizeof(DAC960_V1_ScatterGatherSegment_T), 0);
   if (ScatterGatherPool == NULL)
@@ -290,18 +290,18 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
 {
   CommandAllocationLength = offsetof(DAC960_Command_T, V2.EndMarker);
   CommandAllocationGroupSize = DAC960_V2_CommandAllocationGroupSize;
-  ScatterGatherPool = pci_pool_create("DAC960_V2_ScatterGather",
-   Controller->PCIDevice,
+  ScatterGatherPool = dma_pool_create("DAC960_V2_ScatterGather",
+   &Controller->PCIDevice->dev,
DAC960_V2_ScatterGatherLimit * sizeof(DAC960_V2_ScatterGatherSegment_T),
sizeof(DAC960_V2_ScatterGatherSegment_T), 0);
   if (ScatterGatherPool == NULL)
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION (SG)");
-  RequestSensePool = pci_pool_create("DAC960_V2_RequestSense",
-   Controller->PCIDevice, sizeof(DAC960_SCSI_RequestSense_T),
+  RequestSensePool = dma_pool_create("DAC960_V2_RequestSense",
+   &Controller->PCIDevice->dev, sizeof(DAC960_SCSI_RequestSense_T),
sizeof(int), 0);
   if (RequestSensePool == NULL) {
-   pci_pool_destroy(ScatterGatherPool);
+   dma_pool_destroy(ScatterGatherPool);
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION (SG)");
   }
@@ -335,16 +335,16 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
   Command->Next = Controller->FreeCommands;
   Controller->FreeCommands = Command;
   Controller->Commands[CommandIdentifier-1] = Command;
-  ScatterGatherCPU = pci_pool_alloc(ScatterGatherPool, GFP_ATOMIC,
+  ScatterGatherCPU = dma_pool_alloc(ScatterGatherPool, GFP_ATOMIC,
&ScatterGatherDMA);
   if (ScatterGatherCPU == NULL)
  return DAC960_Failure(Controller, "AUXILIARY STRUCTURE CREATION");
 
   if (RequestSensePool != NULL) {
- RequestSenseCPU = pci_pool_alloc(RequestSensePool, GFP_ATOMIC,
+ RequestSenseCPU = dma_pool_alloc(RequestSensePool, GFP_ATOMIC,
&RequestSenseDMA);
  if (RequestSenseCPU == NULL) {
-pci_pool_free(ScatterGatherPool, ScatterGatherCPU,
+dma_pool_free(ScatterGatherPool, ScatterGatherCPU,
 ScatterGatherDMA);
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION");
@@ -379,8 +379,8 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
 static void DAC960_DestroyAuxiliaryStructures(DAC960_Controller_T *Controller)
 {
   int i;
-  struct pci_pool *ScatterGatherPool = Controller->ScatterGatherPool;
-  struct pci_pool *RequestSensePool = NULL;
+  struct dma_pool *ScatterGatherPool = Controller->ScatterGatherPool;
+  struct dma_pool *RequestSensePool = NULL;
   void *ScatterGatherCPU;
   dma_addr_t ScatterGatherDMA;
   void *RequestSenseCPU;
@@ -411,9 +411,9 @@ static void 
DAC960_DestroyAuxiliaryStructures(DAC960_Controller_T *Controller)
  RequestSenseDMA = Command->V2.RequestSenseDMA;
   }
   if (ScatterGatherCPU != NULL)
-  pci_pool_free(ScatterGatherPool, ScatterGatherCPU, ScatterGatherDMA);
+  dma_pool_free(ScatterGatherPool, ScatterGatherCPU, ScatterGatherDMA);

[PATCH v4 05/11] libsas: Use dynamic alloced work to avoid sas event lost

2017-09-06 Thread Jason Yan

Now libsas hotplug work is static, every sas event type has its own
static work, LLDD driver queues the hotplug work into shost->work_q.
If LLDD driver burst posts lots hotplug events to libsas, the hotplug
events may pending in the workqueue like

shost->work_q
new work[PORTE_BYTES_DMAED] --> |[PHYE_LOSS_OF_SIGNAL][PORTE_BYTES_DMAED] -> 
processing
|<---wait worker to process>|

In this case, a new PORTE_BYTES_DMAED event coming, libsas try to queue
it to shost->work_q, but this work is already pending, so it would be
lost. Finally, libsas delete the related sas port and sas devices, but
LLDD driver expect libsas add the sas port and devices(last sas event).

This patch use dynamic allocated work to avoid this issue.

Signed-off-by: Yijing Wang 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
Signed-off-by: Jason Yan 
---
 drivers/scsi/libsas/sas_event.c| 75 +-
 drivers/scsi/libsas/sas_init.c | 27 --
 drivers/scsi/libsas/sas_internal.h |  6 +++
 drivers/scsi/libsas/sas_phy.c  | 44 +-
 drivers/scsi/libsas/sas_port.c | 18 -
 include/scsi/libsas.h  | 16 +---
 6 files changed, 115 insertions(+), 71 deletions(-)

diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index 3e225ef..35412d9 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -29,7 +29,8 @@
 
 int sas_queue_work(struct sas_ha_struct *ha, struct sas_work *sw)
 {
-   int rc = 0;
+   /* it's added to the defer_q when draining so return succeed */
+   int rc = 1;
 
if (!test_bit(SAS_HA_REGISTERED, &ha->state))
return 0;
@@ -44,19 +45,15 @@ int sas_queue_work(struct sas_ha_struct *ha, struct 
sas_work *sw)
return rc;
 }
 
-static int sas_queue_event(int event, unsigned long *pending,
-   struct sas_work *work,
+static int sas_queue_event(int event, struct sas_work *work,
struct sas_ha_struct *ha)
 {
-   int rc = 0;
+   unsigned long flags;
+   int rc;
 
-   if (!test_and_set_bit(event, pending)) {
-   unsigned long flags;
-
-   spin_lock_irqsave(&ha->lock, flags);
-   rc = sas_queue_work(ha, work);
-   spin_unlock_irqrestore(&ha->lock, flags);
-   }
+   spin_lock_irqsave(&ha->lock, flags);
+   rc = sas_queue_work(ha, work);
+   spin_unlock_irqrestore(&ha->lock, flags);
 
return rc;
 }
@@ -66,6 +63,7 @@ void __sas_drain_work(struct sas_ha_struct *ha)
 {
struct workqueue_struct *wq = ha->core.shost->work_q;
struct sas_work *sw, *_sw;
+   int ret;
 
set_bit(SAS_HA_DRAINING, &ha->state);
/* flush submitters */
@@ -78,7 +76,11 @@ void __sas_drain_work(struct sas_ha_struct *ha)
clear_bit(SAS_HA_DRAINING, &ha->state);
list_for_each_entry_safe(sw, _sw, &ha->defer_q, drain_node) {
list_del_init(&sw->drain_node);
-   sas_queue_work(ha, sw);
+   ret = sas_queue_work(ha, sw);
+   if (ret != 1) {
+   struct asd_sas_event *ev = to_asd_sas_event(&sw->work);
+   sas_free_event(ev);
+   }
}
spin_unlock_irq(&ha->lock);
 }
@@ -119,29 +121,68 @@ void sas_enable_revalidation(struct sas_ha_struct *ha)
if (!test_and_clear_bit(ev, &d->pending))
continue;
 
-   sas_queue_event(ev, &d->pending, &d->disc_work[ev].work, ha);
+   sas_queue_event(ev, &d->disc_work[ev].work, ha);
}
mutex_unlock(&ha->disco_mutex);
 }
 
+
+static void sas_port_event_worker(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+
+   sas_port_event_fns[ev->event](work);
+   sas_free_event(ev);
+}
+
+static void sas_phy_event_worker(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+
+   sas_phy_event_fns[ev->event](work);
+   sas_free_event(ev);
+}
+
 static int sas_notify_port_event(struct asd_sas_phy *phy, enum port_event 
event)
 {
+   struct asd_sas_event *ev;
struct sas_ha_struct *ha = phy->ha;
+   int ret;
 
BUG_ON(event >= PORT_NUM_EVENTS);
 
-   return sas_queue_event(event, &phy->port_events_pending,
-  &phy->port_events[event].work, ha);
+   ev = sas_alloc_event(phy);
+   if (!ev)
+   return -ENOMEM;
+
+   INIT_SAS_EVENT(ev, sas_port_event_worker, phy, event);
+
+   ret = sas_queue_event(event, &ev->work, ha);
+   if (ret != 1)
+   sas_free_event(ev);
+
+   return ret;
 }
 
 int sas_notify_phy_event(struct asd_sas_phy *phy, enum phy_event event)
 {
+   struct asd_sas_event *ev;
struct sas_ha_s

[PATCH v4 04/11] libsas: rename notify_port_event() for consistency

2017-09-06 Thread Jason Yan

Rename function notify_port_event() to sas_notify_port_event(), which
will be consistent with sas_notify_phy_event().

Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/libsas/sas_event.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index 70c4653..3e225ef 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -124,7 +124,7 @@ void sas_enable_revalidation(struct sas_ha_struct *ha)
mutex_unlock(&ha->disco_mutex);
 }
 
-static int notify_port_event(struct asd_sas_phy *phy, enum port_event event)
+static int sas_notify_port_event(struct asd_sas_phy *phy, enum port_event 
event)
 {
struct sas_ha_struct *ha = phy->ha;
 
@@ -146,7 +146,7 @@ int sas_notify_phy_event(struct asd_sas_phy *phy, enum 
phy_event event)
 
 int sas_init_events(struct sas_ha_struct *sas_ha)
 {
-   sas_ha->notify_port_event = notify_port_event;
+   sas_ha->notify_port_event = sas_notify_port_event;
sas_ha->notify_phy_event = sas_notify_phy_event;
 
return 0;
-- 
2.5.0

[PATCH v4 08/11] libsas: Use new workqueue to run sas event and disco event

2017-09-06 Thread Jason Yan

Now all libsas works are queued to scsi host workqueue,
include sas event work post by LLDD and sas discovery
work, and a sas hotplug flow may be divided into several
works, e.g libsas receive a PORTE_BYTES_DMAED event,
currently we process it as following steps:
sas_form_port  --- run in work in shost workq
sas_discover_domain  --- run in another work in shost workq
...
sas_probe_devices  --- run in new work in shost workq
We found during hot-add a device, libsas may need run several
works in same workqueue to add device in system, the process is
not atomic, it may interrupt by other sas event works, like
PHYE_LOSS_OF_SIGNAL.

This patch is preparation of execute libsas sas event in sync. We need
to use different workqueue to run sas event and disco event. Otherwise
the work will be blocked for waiting another chained work in the same
workqueue.

Signed-off-by: Yijing Wang 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
Signed-off-by: Jason Yan 
---
 drivers/scsi/libsas/sas_discover.c |  2 +-
 drivers/scsi/libsas/sas_event.c|  4 ++--
 drivers/scsi/libsas/sas_init.c | 18 ++
 include/scsi/libsas.h  |  3 +++
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 60de662..14f714d 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -534,7 +534,7 @@ static void sas_chain_work(struct sas_ha_struct *ha, struct 
sas_work *sw)
 * workqueue, or known to be submitted from a context that is
 * not racing against draining
 */
-   scsi_queue_work(ha->core.shost, &sw->work);
+   queue_work(ha->disco_q, &sw->work);
 }
 
 static void sas_chain_event(int event, unsigned long *pending,
diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index 35412d9..c120657 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -40,7 +40,7 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work 
*sw)
if (list_empty(&sw->drain_node))
list_add(&sw->drain_node, &ha->defer_q);
} else
-   rc = scsi_queue_work(ha->core.shost, &sw->work);
+   rc = queue_work(ha->event_q, &sw->work);
 
return rc;
 }
@@ -61,7 +61,7 @@ static int sas_queue_event(int event, struct sas_work *work,
 
 void __sas_drain_work(struct sas_ha_struct *ha)
 {
-   struct workqueue_struct *wq = ha->core.shost->work_q;
+   struct workqueue_struct *wq = ha->event_q;
struct sas_work *sw, *_sw;
int ret;
 
diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index e2d98a8..b49c46f 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -109,6 +109,7 @@ void sas_hash_addr(u8 *hashed, const u8 *sas_addr)
 
 int sas_register_ha(struct sas_ha_struct *sas_ha)
 {
+   char name[64];
int error = 0;
 
mutex_init(&sas_ha->disco_mutex);
@@ -142,10 +143,24 @@ int sas_register_ha(struct sas_ha_struct *sas_ha)
goto Undo_ports;
}
 
+   error = -ENOMEM;
+   snprintf(name, sizeof(name), "%s_event_q", dev_name(sas_ha->dev));
+   sas_ha->event_q = create_singlethread_workqueue(name);
+   if (!sas_ha->event_q)
+   goto Undo_ports;
+
+   snprintf(name, sizeof(name), "%s_disco_q", dev_name(sas_ha->dev));
+   sas_ha->disco_q = create_singlethread_workqueue(name);
+   if (!sas_ha->disco_q)
+   goto Undo_event_q;
+
INIT_LIST_HEAD(&sas_ha->eh_done_q);
INIT_LIST_HEAD(&sas_ha->eh_ata_q);
 
return 0;
+
+Undo_event_q:
+   destroy_workqueue(sas_ha->event_q);
 Undo_ports:
sas_unregister_ports(sas_ha);
 Undo_phys:
@@ -176,6 +191,9 @@ int sas_unregister_ha(struct sas_ha_struct *sas_ha)
__sas_drain_work(sas_ha);
mutex_unlock(&sas_ha->drain_mutex);
 
+   destroy_workqueue(sas_ha->disco_q);
+   destroy_workqueue(sas_ha->event_q);
+
return 0;
 }
 
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 08c1c32..d1ab157 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -388,6 +388,9 @@ struct sas_ha_struct {
struct device *dev;   /* should be set */
struct module *lldd_module; /* should be set */
 
+   struct workqueue_struct *event_q;
+   struct workqueue_struct *disco_q;
+
u8 *sas_addr; /* must be set */
u8 hashed_sas_addr[HASHED_SAS_ADDR_SIZE];
 
-- 
2.5.0

[PATCH v4 03/11] libsas: remove unused port_gone_completion and DISCE_PORT_GONE

2017-09-06 Thread Jason Yan

No one uses the port_gone_completion in struct asd_sas_port and
DISCE_PORT_GONE in enum disover_event, clean them out.

Signed-off-by: Jason Yan 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 include/scsi/libsas.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index ccf3b48..99f32b5 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -81,7 +81,6 @@ enum phy_event {
 enum discover_event {
DISCE_DISCOVER_DOMAIN   = 0U,
DISCE_REVALIDATE_DOMAIN,
-   DISCE_PORT_GONE,
DISCE_PROBE,
DISCE_SUSPEND,
DISCE_RESUME,
@@ -256,8 +255,6 @@ struct sas_discovery {
 /* The port struct is Class:RW, driver:RO */
 struct asd_sas_port {
 /* private: */
-   struct completion port_gone_completion;
-
struct sas_discovery disc;
struct domain_device *port_dev;
spinlock_t dev_list_lock;
-- 
2.5.0

[PATCH v4 11/11] libsas: add event to defer list tail instead of head when draining

2017-09-06 Thread Jason Yan

From: chenxiang 

Events will be added to defer_q list when setting ha->status to
SAS_HA_DRAINING. Events will be called after drain workqueue.

Those events are added to the head of list, but they are scanned one
by one from the head to the tail, which will cause those events be
called in the reverse order of being added. So change list_add to
list_add_tail in function sas_queue_work.

Signed-off-by: chenxiang 
Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/libsas/sas_event.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c
index c120657..b124198 100644
--- a/drivers/scsi/libsas/sas_event.c
+++ b/drivers/scsi/libsas/sas_event.c
@@ -38,7 +38,7 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work 
*sw)
if (test_bit(SAS_HA_DRAINING, &ha->state)) {
/* add it to the defer list, if not already pending */
if (list_empty(&sw->drain_node))
-   list_add(&sw->drain_node, &ha->defer_q);
+   list_add_tail(&sw->drain_node, &ha->defer_q);
} else
rc = queue_work(ha->event_q, &sw->work);
 
-- 
2.5.0

[PATCH v4 10/11] libsas: direct call probe and destruct

2017-09-06 Thread Jason Yan

In commit 87c8331f ([SCSI] libsas: prevent domain rediscovery competing
with ata error handling) introduced disco mutex to prevent rediscovery
competing with ata error handling and put the whole revalidation in the
mutex. But the rphy add/remove needs to wait for the error handling
which also grabs the disco mutex. This may leads to dead lock.So the
probe and destruct event were introduce to do the rphy add/remove
asynchronously and out of the lock.

The asynchronously processed workers makes the whole discovery process
not atomic, the other events may interrupt the process. For example,
if a loss of signal event inserted before the probe event, the
sas_deform_port() is called and the port will be deleted.

And sas_port_delete() may run before the destruct event, but the
port-x:x is the top parent of end device or expander. This leads to
a kernel WARNING such as:

[   82.042979] sysfs group 'power' not found for kobject 'phy-1:0:22'
[   82.042983] [ cut here ]
[   82.042986] WARNING: CPU: 54 PID: 1714 at fs/sysfs/group.c:237
sysfs_remove_group+0x94/0xa0
[   82.043059] Call trace:
[   82.043082] [] sysfs_remove_group+0x94/0xa0
[   82.043085] [] dpm_sysfs_remove+0x60/0x70
[   82.043086] [] device_del+0x138/0x308
[   82.043089] [] sas_phy_delete+0x38/0x60
[   82.043091] [] do_sas_phy_delete+0x6c/0x80
[   82.043093] [] device_for_each_child+0x58/0xa0
[   82.043095] [] sas_remove_children+0x40/0x50
[   82.043100] [] sas_destruct_devices+0x64/0xa0
[   82.043102] [] process_one_work+0x1fc/0x4b0
[   82.043104] [] worker_thread+0x50/0x490
[   82.043105] [] kthread+0xfc/0x128
[   82.043107] [] ret_from_fork+0x10/0x50

Make probe and destruct a direct call in the disco and revalidate function,
but put them outside the lock. The whole discovery or revalidate won't
be interrupted by other events. And the DISCE_PROBE and DISCE_DESTRUCT
event are deleted as a result of the direct call.

Introduce a new list to destruct the sas_port and put the port delete after
the destruct. This makes sure the right order of destroying the sysfs
kobject and fix the warning above.

Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
CC: Dan Williams 
---
 drivers/scsi/libsas/sas_ata.c  |  1 -
 drivers/scsi/libsas/sas_discover.c | 34 --
 drivers/scsi/libsas/sas_expander.c |  2 +-
 drivers/scsi/libsas/sas_internal.h |  1 +
 drivers/scsi/libsas/sas_port.c |  3 +++
 include/scsi/libsas.h  |  3 +--
 include/scsi/scsi_transport_sas.h  |  1 +
 7 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 87f5e694..dbe8c5e 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -729,7 +729,6 @@ int sas_discover_sata(struct domain_device *dev)
if (res)
return res;
 
-   sas_discover_event(dev->port, DISCE_PROBE);
return 0;
 }
 
diff --git a/drivers/scsi/libsas/sas_discover.c 
b/drivers/scsi/libsas/sas_discover.c
index 14f714d..d5f5b58 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -212,13 +212,9 @@ void sas_notify_lldd_dev_gone(struct domain_device *dev)
}
 }
 
-static void sas_probe_devices(struct work_struct *work)
+static void sas_probe_devices(struct asd_sas_port *port)
 {
struct domain_device *dev, *n;
-   struct sas_discovery_event *ev = to_sas_discovery_event(work);
-   struct asd_sas_port *port = ev->port;
-
-   clear_bit(DISCE_PROBE, &port->disc.pending);
 
/* devices must be domain members before link recovery and probe */
list_for_each_entry(dev, &port->disco_list, disco_list_node) {
@@ -294,7 +290,6 @@ int sas_discover_end_dev(struct domain_device *dev)
res = sas_notify_lldd_dev_found(dev);
if (res)
return res;
-   sas_discover_event(dev->port, DISCE_PROBE);
 
return 0;
 }
@@ -353,13 +348,9 @@ static void sas_unregister_common_dev(struct asd_sas_port 
*port, struct domain_d
sas_put_device(dev);
 }
 
-static void sas_destruct_devices(struct work_struct *work)
+void sas_destruct_devices(struct asd_sas_port *port)
 {
struct domain_device *dev, *n;
-   struct sas_discovery_event *ev = to_sas_discovery_event(work);
-   struct asd_sas_port *port = ev->port;
-
-   clear_bit(DISCE_DESTRUCT, &port->disc.pending);
 
list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) {
list_del_init(&dev->disco_list_node);
@@ -370,6 +361,16 @@ static void sas_destruct_devices(struct work_struct *work)
}
 }
 
+void sas_destruct_ports(struct asd_sas_port *port)
+{
+   struct sas_port *sas_port, *p;
+
+   list_for_each_entry_safe(sas_port, p, &port->sas_port_del_list, 
del_list) {
+   list_del_init(&sas_port->del_list);
+   sas_port_delete(sas_port);

[PATCH v4 06/11] libsas: shut down the PHY if events reached the threshold

2017-09-06 Thread Jason Yan

If the PHY burst too many events, we will alloc a lot of events for the
worker. This may leads to memory exhaustion.

Dan Williams suggested to shut down the PHY if the events reached the
threshold, because in this case the PHY may have gone into some
erroneous state. Users can re-enable the PHY by sysfs if they want.

We cannot use the fixed memory pool because if we run out of events, the
shut down event and loss of signal event will lost too. The events still
need to be allocated and processed in this case.

Suggested-by: Dan Williams 
Signed-off-by: Jason Yan 
CC: John Garry 
CC: Johannes Thumshirn 
CC: Ewan Milne 
CC: Christoph Hellwig 
CC: Tomas Henzl 
---
 drivers/scsi/libsas/sas_init.c | 21 -
 drivers/scsi/libsas/sas_phy.c  | 31 ++-
 include/scsi/libsas.h  |  6 ++
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index 85c278a..b1e03d5 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -122,6 +122,8 @@ int sas_register_ha(struct sas_ha_struct *sas_ha)
INIT_LIST_HEAD(&sas_ha->defer_q);
INIT_LIST_HEAD(&sas_ha->eh_dev_q);
 
+   sas_ha->event_thres = SAS_PHY_SHUTDOWN_THRES;
+
error = sas_register_phys(sas_ha);
if (error) {
printk(KERN_NOTICE "couldn't register sas phys:%d\n", error);
@@ -556,14 +558,31 @@ EXPORT_SYMBOL_GPL(sas_domain_attach_transport);
 
 struct asd_sas_event *sas_alloc_event(struct asd_sas_phy *phy)
 {
+   struct asd_sas_event *event;
gfp_t flags = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL;
 
-   return kmem_cache_zalloc(sas_event_cache, flags);
+   event = kmem_cache_zalloc(sas_event_cache, flags);
+   if (!event)
+   return NULL;
+
+   atomic_inc(&phy->event_nr);
+   if (atomic_read(&phy->event_nr) > phy->ha->event_thres &&
+   !phy->in_shutdown) {
+   phy->in_shutdown = 1;
+   sas_printk("The phy%02d bursting events, shut it down.\n",
+  phy->id);
+   sas_notify_phy_event(phy, PHYE_SHUTDOWN);
+   }
+
+   return event;
 }
 
 void sas_free_event(struct asd_sas_event *event)
 {
+   struct asd_sas_phy *phy = event->phy;
+
kmem_cache_free(sas_event_cache, event);
+   atomic_dec(&phy->event_nr);
 }
 
 /* -- SAS Class register/unregister -- */
diff --git a/drivers/scsi/libsas/sas_phy.c b/drivers/scsi/libsas/sas_phy.c
index 59f8292..3df1eec 100644
--- a/drivers/scsi/libsas/sas_phy.c
+++ b/drivers/scsi/libsas/sas_phy.c
@@ -35,6 +35,7 @@ static void sas_phye_loss_of_signal(struct work_struct *work)
struct asd_sas_event *ev = to_asd_sas_event(work);
struct asd_sas_phy *phy = ev->phy;
 
+   phy->in_shutdown = 0;
phy->error = 0;
sas_deform_port(phy, 1);
 }
@@ -44,6 +45,7 @@ static void sas_phye_oob_done(struct work_struct *work)
struct asd_sas_event *ev = to_asd_sas_event(work);
struct asd_sas_phy *phy = ev->phy;
 
+   phy->in_shutdown = 0;
phy->error = 0;
 }
 
@@ -105,6 +107,32 @@ static void sas_phye_resume_timeout(struct work_struct 
*work)
 }
 
 
+static void sas_phye_shutdown(struct work_struct *work)
+{
+   struct asd_sas_event *ev = to_asd_sas_event(work);
+   struct asd_sas_phy *phy = ev->phy;
+   struct sas_ha_struct *sas_ha = phy->ha;
+   struct sas_internal *i =
+   to_sas_internal(sas_ha->core.shost->transportt);
+
+   if (phy->enabled && i->dft->lldd_control_phy) {
+   int ret;
+
+   phy->error = 0;
+   phy->enabled = 0;
+   ret = i->dft->lldd_control_phy(phy, PHY_FUNC_DISABLE, NULL);
+   if (ret)
+   sas_printk("lldd disable phy%02d returned %d\n",
+   phy->id, ret);
+
+   } else if (!i->dft->lldd_control_phy)
+   sas_printk("lldd does not support phy%02d control\n", phy->id);
+   else
+   sas_printk("phy%02d is not enabled, cannot shutdown\n",
+   phy->id);
+
+}
+
 /* -- Phy class registration -- */
 
 int sas_register_phys(struct sas_ha_struct *sas_ha)
@@ -116,6 +144,7 @@ int sas_register_phys(struct sas_ha_struct *sas_ha)
struct asd_sas_phy *phy = sas_ha->sas_phy[i];
 
phy->error = 0;
+   atomic_set(&phy->event_nr, 0);
INIT_LIST_HEAD(&phy->port_phy_el);
 
phy->port = NULL;
@@ -151,5 +180,5 @@ const work_func_t sas_phy_event_fns[PHY_NUM_EVENTS] = {
[PHYE_OOB_ERROR] = sas_phye_oob_error,
[PHYE_SPINUP_HOLD] = sas_phye_spinup_hold,
[PHYE_RESUME_TIMEOUT] = sas_phye_resume_timeout,
-
+   [PHYE_SHUTDOWN] = sas_phye_shutdown,
 };
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index c80321b..2fa0b13 100644
--- a/include/scsi/libsas.h
+++

[PATCH v4 00/11] Enhance libsas hotplug feature

2017-09-06 Thread Jason Yan

Hello all, Yijing Wang handed over this topic to me. We are working
on it the last two months. We have tested the patchset for a long
time. Here is the new version.

Now the libsas hotplug has some issues, Dan Williams report
a similar bug here before
https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg39187.html

The issues we have found
1. if LLDD burst reports lots of phy-up/phy-down sas events, some events
   may lost because a same sas events is pending now, finally libsas topo
   may different the hardware.
2. receive a phy down sas event, libsas call sas_deform_port to remove
   devices, it would first delete the sas port, then put a destruction
   discovery event in a new work, and queue it at the tail of workqueue,
   once the sas port be deleted, its children device will be deleted too,
   when the destruction work start, it will found the target device has
   been removed, and report a sysfs warnning.
3. since a hotplug process will be divided into several works, if a phy up
   sas event insert into phydown works, like
   destruction work  ---> PORTE_BYTES_DMAED (sas_form_port) 
>PHYE_LOSS_OF_SIGNAL
   the hot remove flow would broken by PORTE_BYTES_DMAED event, it's not
   we expected, and issues would occur.

v3->v4: -get rid of unused ha event and do some cleanup
-use dynamic alloced work and support shutting down the phy if active 
event reached the threshold
-use flush_workqueue instead of wait-completion to process 
discover events synchronously
-direct call probe and destruct function
-other small code improvements 
v2->v3: some code improvements suggested by Johannes and John,
split v2 patch 2 into several small patches.
v1->v2: some code improvements suggested by John Garry

Jason Yan (10):
  libsas: kill useless ha_event and do some cleanup
  libsas: remove the numbering for each event enum
  libsas: remove unused port_gone_completion and DISCE_PORT_GONE
  libsas: rename notify_port_event() for consistency
  libsas: Use dynamic alloced work to avoid sas event lost
  libsas: shut down the PHY if events reached the threshold
  libsas: make the event threshold configurable
  libsas: Use new workqueue to run sas event and disco event
  libsas: libsas: use flush_workqueue to process disco events
synchronously
  libsas: direct call probe and destruct

chenxiang (1):
  libsas: add event to defer list tail instead of head when draining

 drivers/scsi/aic94xx/aic94xx_hwi.c|   3 -
 drivers/scsi/hisi_sas/hisi_sas_main.c |   7 ++-
 drivers/scsi/libsas/sas_ata.c |   1 -
 drivers/scsi/libsas/sas_discover.c|  36 +++-
 drivers/scsi/libsas/sas_dump.c|  10 
 drivers/scsi/libsas/sas_dump.h|   1 -
 drivers/scsi/libsas/sas_event.c   |  97 +++-
 drivers/scsi/libsas/sas_expander.c|   2 +-
 drivers/scsi/libsas/sas_init.c| 101 +-
 drivers/scsi/libsas/sas_internal.h|   7 +++
 drivers/scsi/libsas/sas_phy.c |  73 
 drivers/scsi/libsas/sas_port.c|  25 +
 include/scsi/libsas.h |  81 ---
 include/scsi/scsi_transport_sas.h |   1 +
 14 files changed, 270 insertions(+), 175 deletions(-)

-- 
2.5.0

1 2 3 4 5 6 7 8 9 >

1 - 100 of 803 matches

Mail list logo