Re: [PATCH v3] drm: bridge: synopsys/dw-hdmi: Enable cec clock

2017-11-24 Thread Archit Taneja

Hi,

On 11/20/2017 06:00 PM, Hans Verkuil wrote:

I didn't see this merged for 4.15, is it too late to include this?
All other changes needed to get CEC to work on rk3288 and rk3399 are all merged.


Sorry for the late reply. I was out last week.

Dave recently sent the second pull request for 4.15, so I think it would be 
hard to get it
in the merge window now. We could target it for the 4.15-rcs since it is 
preventing the
feature from working. Is it possible to rephrase the commit message a bit so 
that it's clear
that we need it for CEC to work?

Thanks,
Archit



Regards,

Hans

On 10/26/2017 08:19 PM, Pierre-Hugues Husson wrote:

The documentation already mentions "cec" optional clock, but
currently the driver doesn't enable it.

Changes:
v3:
- Drop useless braces

v2:
- Separate ENOENT errors from others
- Propagate other errors (especially -EPROBE_DEFER)

Signed-off-by: Pierre-Hugues Husson 
---
  drivers/gpu/drm/bridge/synopsys/dw-hdmi.c | 25 +
  1 file changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c 
b/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
index bf14214fa464..d82b9747a979 100644
--- a/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
+++ b/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
@@ -138,6 +138,7 @@ struct dw_hdmi {
struct device *dev;
struct clk *isfr_clk;
struct clk *iahb_clk;
+   struct clk *cec_clk;
struct dw_hdmi_i2c *i2c;
  
  	struct hdmi_data_info hdmi_data;

@@ -2382,6 +2383,26 @@ __dw_hdmi_probe(struct platform_device *pdev,
goto err_isfr;
}
  
+	hdmi->cec_clk = devm_clk_get(hdmi->dev, "cec");

+   if (PTR_ERR(hdmi->cec_clk) == -ENOENT) {
+   hdmi->cec_clk = NULL;
+   } else if (IS_ERR(hdmi->cec_clk)) {
+   ret = PTR_ERR(hdmi->cec_clk);
+   if (ret != -EPROBE_DEFER)
+   dev_err(hdmi->dev, "Cannot get HDMI cec clock: %d\n",
+   ret);
+
+   hdmi->cec_clk = NULL;
+   goto err_iahb;
+   } else {
+   ret = clk_prepare_enable(hdmi->cec_clk);
+   if (ret) {
+   dev_err(hdmi->dev, "Cannot enable HDMI cec clock: %d\n",
+   ret);
+   goto err_iahb;
+   }
+   }
+
/* Product and revision IDs */
hdmi->version = (hdmi_readb(hdmi, HDMI_DESIGN_ID) << 8)
  | (hdmi_readb(hdmi, HDMI_REVISION_ID) << 0);
@@ -2518,6 +2539,8 @@ __dw_hdmi_probe(struct platform_device *pdev,
cec_notifier_put(hdmi->cec_notifier);
  
  	clk_disable_unprepare(hdmi->iahb_clk);

+   if (hdmi->cec_clk)
+   clk_disable_unprepare(hdmi->cec_clk);
  err_isfr:
clk_disable_unprepare(hdmi->isfr_clk);
  err_res:
@@ -2541,6 +2564,8 @@ static void __dw_hdmi_remove(struct dw_hdmi *hdmi)
  
  	clk_disable_unprepare(hdmi->iahb_clk);

clk_disable_unprepare(hdmi->isfr_clk);
+   if (hdmi->cec_clk)
+   clk_disable_unprepare(hdmi->cec_clk);
  
  	if (hdmi->i2c)

i2c_del_adapter(&hdmi->i2c->adap);





--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] mm,madvise: bugfix of madvise systemcall infinite loop under special circumstances.

2017-11-24 Thread Michal Hocko
On Fri 24-11-17 10:27:57, guoxuenan wrote:
> From: chenjie 
> 
> The madvise() system call supported a set of "conventional" advice values,
> the MADV_WILLNEED parameter will trigger an infinite loop under direct
> access mode(DAX). In DAX mode, the function madvise_vma() will return
> directly without updating the pointer [prev].
> 
> For example:
> Special circumstances:
> 1、init [ start < vam->vm_start < vam->vm_end < end ]
> 2、madvise_vma() using MADV_WILLNEED parameter ;
> madvise_vma() -> madvise_willneed() -> return 0 && without updating [prev]
> 
> ===
> in Function SYSCALL_DEFINE3(madvise,...)
> 
> for (;;)
> {
> //[first loop: start = vam->vm_start < vam->vm_end update [start = vma->vm_start | end  ]
> 
> con0: if (start >= end) //false always;
>   goto out;
>   tmp = vma->vm_end;
> 
> //do not update [prev] and always return 0;
>   error = madvise_willneed();
> 
> con1: if (error)//false always;
>   goto out;
> 
> //[ vam->vm_start < start = vam->vm_end update [start = tmp ]
> 
> con2: if (start >= end) //false always ;
>   goto out;
> 
> //because of pointer [prev] did not change,[vma] keep as it was;
>   update [ vma = prev->vm_next ]
> }
> 
> ===
> After the first cycle ;it will always keep
> [ vam->vm_start < start = vam->vm_end  < end ].
> since Circulation exit conditions (con{0,1,2}) will never meet ,the
> program stuck in infinite loop.

Are you sure? Have you tested this? I might be missing something because
madvise code is a bit of a mess but AFAICS prev pointer (updated or not)
will allow to move advance
if (prev)
vma = prev->vm_next;
else/* madvise_remove dropped mmap_sem */
vma = find_vma(current->mm, start);
note that start is vma->vm_end and find_vma will find a vma which
vma_end > addr

So either I am missing something or this code has actaully never worked
for DAX, XIP which I find rather suspicious.
 
> Signed-off-by: chenjie 
> Signed-off-by: guoxuenan 
> ---
>  mm/madvise.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 21261ff..c355fee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -294,6 +294,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  #endif
>  
>   if (IS_DAX(file_inode(file))) {
> + *prev = vma;
>   /* no bad return value, but ignore advice */
>   return 0;
>   }
> -- 
> 2.9.5
> 

-- 
Michal Hocko
SUSE Labs


Re: VMs freezing when host is running 4.14

2017-11-24 Thread Marc Haber
On Thu, Nov 23, 2017 at 06:26:36PM +0200, Liran Alon wrote:
> If there is no nested guest so no. My fix here probably won't help.

I can confirm that I am not running nested virt, the host is running
directly on the APU. I also have three other machines that are running
flawlessly with 4.14, and another virtualization host, a "real" server
with a somewhat dated AMD Opteron 1389 that has the same issue. The
machine that first showed the issue is Intel, so we are not having a
vendor issue.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


Re: [PATCH] ASoC: amd: added error checks in dma driver

2017-11-24 Thread Guenter Roeck
On Fri, Nov 24, 2017 at 3:07 AM, Mukunda,Vijendar
 wrote:
>
>
>
> On Thursday 23 November 2017 10:59 PM, Mark Brown wrote:
>>
>> On Thu, Nov 23, 2017 at 08:59:43AM -0800, Guenter Roeck wrote:
>>>
>>> On Thu, Nov 23, 2017 at 8:30 AM, Vijendar Mukunda
>>>  wrote:

 added error checks in acp dma driver
 Signed-off-by: Vijendar Mukunda 
 Signed-off-by: Akshu Agrawal 
 Signed-off-by: Guenter Roeck 
>>>
>>> This is inappropriate.
>>
>> Specifically: if Guenter wasn't involved in writing or forwarding the
>> patch he shouldn't have a signoff in there, and if you're the one
>> sending the mail you should be the last person in the chain of signoffs.
>> Please see SubmittingPatches for details of what a signoff means and why
>> they're important.
>
>
>   This patch was implemented on top of changes implemented by Guenter.
>   There is a separate thread - RE: [PATCH] ASoC: amd: Add error checking
>   to probe function in which Guenter posted changes.

That was my patch. This is yours.

Guenter

>   Got it, apologies will post changes as v2 version.
>


Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-24 Thread Michal Hocko
On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > for each struct page. So you are doubling the size. Who is going to
> > > > enable this config option? You are moving this to page_ext in a later
> > > > patch which is a good step but it doesn't go far enough because this
> > > > still consumes those resources. Is there any problem to make this
> > > > kernel command line controllable? Something we do for page_owner for
> > > > example?
> > > 
> > > Sure. I will add it.
> > > 
> > > > Also it would be really great if you could give us some measures about
> > > > the runtime overhead. I do not expect it to be very large but this is
> > > 
> > > The major overhead would come from the amount of additional memory
> > > consumption for 'lockdep_map's.
> > 
> > yes
> > 
> > > Do you want me to measure the overhead by the additional memory
> > > consumption?
> > > 
> > > Or do you expect another overhead?
> > 
> > I would be also interested how much impact this has on performance. I do
> > not expect it would be too large but having some numbers for cache cold
> > parallel kbuild or other heavy page lock workloads.
> 
> Hello Michal,
> 
> I measured 'cache cold parallel kbuild' on my qemu machine. The result
> varies much so I cannot confirm, but I think there's no meaningful
> difference between before and after applying crossrelease to page locks.
> 
> Actually, I expect little overhead in lock_page() and unlock_page() even
> after applying crossreleas to page locks, but only expect a bit overhead
> by additional memory consumption for 'lockdep_map's per page.
> 
> I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> 
>make clean
>echo 3 > drop_caches
>time make -j4

Maybe FS people will help you find a more representative workload. E.g.
linear cache cold file read should be good as well. Maybe there are some
tests in fstests (or how they call xfstests these days).

> The results are:
> 
># w/o page lock tracking
> 
>At the 1st try,
>real 5m28.105s
>user 17m52.716s
>sys  3m8.871s
> 
>At the 2nd try,
>real 5m27.023s
>user 17m50.134s
>sys  3m9.289s
> 
>At the 3rd try,
>real 5m22.837s
>user 17m34.514s
>sys  3m8.097s
> 
># w/ page lock tracking
> 
>At the 1st try,
>real 5m18.158s
>user 17m18.200s
>sys  3m8.639s
> 
>At the 2nd try,
>real 5m19.329s
>user 17m19.982s
>sys  3m8.345s
> 
>At the 3rd try,
>real 5m19.626s
>user 17m21.363s
>sys  3m9.869s
> 
> I think thers's no meaningful difference on my small machine.

Yeah, this doesn't seem to indicate anything. Maybe moving the build to
shmem to rule out IO cost would tell more. But as I've said previously
page I do not really expect this would be very visible. It was more a
matter of my curiosity than an acceptance requirement. I think it is
much more important to make this runtime configurable because almost
nobody is going to enable the feature if it is only build time. The cost
is jut too high.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] ASoC: wm2000: Delete an error message for a failed memory allocation in wm2000_i2c_probe()

2017-11-24 Thread Charles Keepax
On Fri, Nov 24, 2017 at 08:36:17AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 23 Nov 2017 22:28:00 +0100
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Acked-by: Charles Keepax 

Thanks,
Charles


Linux 3.18.84

2017-11-24 Thread Greg KH
I'm announcing the release of the 3.18.84 kernel.

All users of the 3.18 kernel series must upgrade.

The updated 3.18.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-3.18.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile  |2 +-
 drivers/char/ipmi/ipmi_msghandler.c   |   10 ++
 drivers/net/ethernet/fealnx.c |6 +++---
 fs/coda/upcall.c  |3 +--
 fs/ocfs2/file.c   |9 +++--
 include/linux/skbuff.h|7 +++
 net/8021q/vlan.c  |6 +++---
 net/core/skbuff.c |1 +
 net/dccp/ipv6.c   |7 +++
 net/ipv4/tcp_output.c |9 ++---
 net/ipv6/tcp_ipv6.c   |2 ++
 net/netlink/af_netlink.c  |   17 +++--
 net/netlink/af_netlink.h  |1 +
 net/sctp/ipv6.c   |2 ++
 net/sctp/socket.c |4 
 security/integrity/ima/ima_appraise.c |3 +++
 16 files changed, 61 insertions(+), 28 deletions(-)

Cong Wang (1):
  vlan: fix a use-after-free in vlan_device_event()

Corey Minyard (1):
  ipmi: fix unsigned long underflow

Eric Dumazet (1):
  tcp: do not mangle skb->cb[] in tcp_make_synack()

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Greg Kroah-Hartman (1):
  Linux 3.18.84

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Jan Harkes (1):
  coda: fix 'kernel memory exposure attempt' in fsync

Jason A. Donenfeld (1):
  af_netlink: ensure that NLMSG_DONE never fails in dumps

Roberto Sassu (1):
  ima: do not update security.ima if appraisal status is not INTEGRITY_PASS

WANG Cong (1):
  ipv6/dccp: do not inherit ipv6_mc_list from parent

Xin Long (1):
  sctp: do not peel off an assoc from one netns to another one

Ye Yin (1):
  netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

alex chen (1):
  ocfs2: should wait dio before inode lock in ocfs2_setattr()



signature.asc
Description: PGP signature


Re: Linux 3.18.84

2017-11-24 Thread Greg KH
diff --git a/Makefile b/Makefile
index 8a1e51e5b0cf..107b5778b864 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 3
 PATCHLEVEL = 18
-SUBLEVEL = 83
+SUBLEVEL = 84
 EXTRAVERSION =
 NAME = Diseased Newt
 
diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
b/drivers/char/ipmi/ipmi_msghandler.c
index f816211f062f..63164ff66bb4 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4010,7 +4010,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struct ipmi_recv_msg 
*recv_msg,
 }
 
 static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
  int slot, unsigned long *flags,
  unsigned int *waiting_msgs)
 {
@@ -4023,8 +4024,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
if (!ent->inuse)
return;
 
-   ent->timeout -= timeout_period;
-   if (ent->timeout > 0) {
+   if (timeout_period < ent->timeout) {
+   ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4091,7 +4092,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
}
 }
 
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+unsigned long timeout_period)
 {
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index b1b9ebafb354..a3b2e23921bf 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x0800, /* first descriptor */
RXLSD = 0x0400, /* last descriptor */
ErrorSummary = 0x80,/* error summary */
-   RUNT = 0x40,/* runt packet received */
-   LONG = 0x20,/* long packet received */
+   RUNTPKT = 0x40, /* runt packet received */
+   LONGPKT = 0x20, /* long packet received */
FAE = 0x10, /* frame align error */
CRC = 0x08, /* crc error */
RXER = 0x04,/* receive error */
@@ -1633,7 +1633,7 @@ static int netdev_rx(struct net_device *dev)
   dev->name, rx_status);
 
dev->stats.rx_errors++; /* end of a packet. */
-   if (rx_status & (LONG | RUNT))
+   if (rx_status & (LONGPKT | RUNTPKT))
dev->stats.rx_length_errors++;
if (rx_status & RXER)
dev->stats.rx_frame_errors++;
diff --git a/fs/coda/upcall.c b/fs/coda/upcall.c
index 5bb6e27298a4..21dbff85829a 100644
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb, struct CodaFid *fid)
UPARG(CODA_FSYNC);
 
inp->coda_fsync.VFid = *fid;
-   error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
-   &outsize, inp);
+   error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
 
CODA_FREE(inp, insize);
return error;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2adcb9876e91..6c6fa10a82ca 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1151,6 +1151,13 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
*attr)
dquot_initialize(inode);
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+   /*
+* Here we should wait dio to finish before inode lock
+* to avoid a deadlock between ocfs2_setattr() and
+* ocfs2_dio_end_io_write()
+*/
+   inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1170,8 +1177,6 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
*attr)
if (status)
goto bail_unlock;
 
-   inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2ff757f2d3a3..e5ba0236047e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3117,6 +3117,13 @@ static inline void nf_reset_trace(struct sk_buff *skb)
 #endif
 }
 
+static inline void ipvs_reset(struct sk_buff *skb)
+{
+#if 

Re: Linux 4.4.101

2017-11-24 Thread Greg KH
diff --git a/Makefile b/Makefile
index 91dd7832f499..0d7b050427ed 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 4
 PATCHLEVEL = 4
-SUBLEVEL = 100
+SUBLEVEL = 101
 EXTRAVERSION =
 NAME = Blurry Fish Butt
 
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 210826d5bba5..9119722eb347 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -64,8 +64,7 @@ static void dump_mem(const char *lvl, const char *str, 
unsigned long bottom,
 
/*
 * We need to switch to kernel mode so that we can use __get_user
-* to safely read from kernel space.  Note that we now dump the
-* code first, just in case the backtrace kills us.
+* to safely read from kernel space.
 */
fs = get_fs();
set_fs(KERNEL_DS);
@@ -111,21 +110,12 @@ static void dump_backtrace_entry(unsigned long where)
print_ip_sym(where);
 }
 
-static void dump_instr(const char *lvl, struct pt_regs *regs)
+static void __dump_instr(const char *lvl, struct pt_regs *regs)
 {
unsigned long addr = instruction_pointer(regs);
-   mm_segment_t fs;
char str[sizeof(" ") * 5 + 2 + 1], *p = str;
int i;
 
-   /*
-* We need to switch to kernel mode so that we can use __get_user
-* to safely read from kernel space.  Note that we now dump the
-* code first, just in case the backtrace kills us.
-*/
-   fs = get_fs();
-   set_fs(KERNEL_DS);
-
for (i = -4; i < 1; i++) {
unsigned int val, bad;
 
@@ -139,8 +129,18 @@ static void dump_instr(const char *lvl, struct pt_regs 
*regs)
}
}
printk("%sCode: %s\n", lvl, str);
+}
 
-   set_fs(fs);
+static void dump_instr(const char *lvl, struct pt_regs *regs)
+{
+   if (!user_mode(regs)) {
+   mm_segment_t fs = get_fs();
+   set_fs(KERNEL_DS);
+   __dump_instr(lvl, regs);
+   set_fs(fs);
+   } else {
+   __dump_instr(lvl, regs);
+   }
 }
 
 static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
b/drivers/char/ipmi/ipmi_msghandler.c
index 25372dc381d4..5cb5e8ff0224 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struct ipmi_recv_msg 
*recv_msg,
 }
 
 static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
  int slot, unsigned long *flags,
  unsigned int *waiting_msgs)
 {
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
if (!ent->inuse)
return;
 
-   ent->timeout -= timeout_period;
-   if (ent->timeout > 0) {
+   if (timeout_period < ent->timeout) {
+   ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
}
 }
 
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+unsigned long timeout_period)
 {
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5dca77e0ffed..2cb34b0f3856 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3166,7 +3166,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
hash ^= (hash >> 16);
hash ^= (hash >> 8);
 
-   return hash;
+   return hash >> 1;
 }
 
 /*-- Device entry points */
diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index b1b9ebafb354..a3b2e23921bf 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x0800, /* first descriptor */
RXLSD = 0x0400, /* last descriptor */
ErrorSummary = 0x80,/* error summary */
-   RUNT = 0x40,/* runt packet received */
-   LONG = 0x20,/* long packet received */
+   RUNTPKT = 0x40, /* runt packet received */
+   LONGPKT = 0x20, /* long packet received */
FAE = 0x10, /* frame align error */
CRC = 0x08, /* crc error */
RXER = 0x04,/* receive error */
@@ -1633,7 +1633,7 @@ static int netdev_rx(struct net_device *dev)
   

Linux 4.4.101

2017-11-24 Thread Greg KH
I'm announcing the release of the 4.4.101 kernel.

All users of the 4.4 kernel series must upgrade.

The updated 4.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-4.4.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile  |2 -
 arch/arm64/kernel/traps.c |   26 ++--
 drivers/char/ipmi/ipmi_msghandler.c   |   10 ---
 drivers/net/bonding/bond_main.c   |2 -
 drivers/net/ethernet/fealnx.c |6 ++--
 drivers/nvme/host/pci.c   |2 -
 drivers/tty/serial/omap-serial.c  |2 -
 fs/coda/upcall.c  |3 --
 fs/ocfs2/file.c   |9 +--
 include/linux/mmzone.h|3 +-
 include/linux/page_idle.h |   43 --
 include/linux/skbuff.h|7 +
 mm/debug-pagealloc.c  |6 
 mm/page_alloc.c   |   33 ++
 mm/page_ext.c |4 ---
 mm/page_owner.c   |   16 
 mm/pagewalk.c |6 +++-
 mm/vmstat.c   |2 +
 net/8021q/vlan.c  |6 ++--
 net/core/skbuff.c |1 
 net/ipv4/tcp_output.c |9 +--
 net/netlink/af_netlink.c  |   17 -
 net/netlink/af_netlink.h  |1 
 net/sctp/ipv6.c   |2 +
 net/sctp/socket.c |4 +++
 security/integrity/ima/ima_appraise.c |3 ++
 26 files changed, 159 insertions(+), 66 deletions(-)

Cong Wang (1):
  vlan: fix a use-after-free in vlan_device_event()

Corey Minyard (1):
  ipmi: fix unsigned long underflow

Eric Dumazet (1):
  tcp: do not mangle skb->cb[] in tcp_make_synack()

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Greg Kroah-Hartman (1):
  Linux 4.4.101

Hangbin Liu (1):
  bonding: discard lowest hash bit for 802.3ad layer3+4

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Jaewon Kim (1):
  mm/page_ext.c: check if page_ext is not prepared

Jan Harkes (1):
  coda: fix 'kernel memory exposure attempt' in fsync

Jann Horn (1):
  mm/pagewalk.c: report holes in hugetlb ranges

Jason A. Donenfeld (1):
  af_netlink: ensure that NLMSG_DONE never fails in dumps

Keith Busch (1):
  nvme: Fix memory order on async queue deletion

Lukas Wunner (1):
  serial: omap: Fix EFR write on RTS deassertion

Mark Rutland (1):
  arm64: fix dump_instr when PAN and UAO are in use

Pavel Tatashin (1):
  mm/page_alloc.c: broken deferred calculation

Roberto Sassu (1):
  ima: do not update security.ima if appraisal status is not INTEGRITY_PASS

Xin Long (1):
  sctp: do not peel off an assoc from one netns to another one

Yang Shi (1):
  mm: check the return value of lookup_page_ext for all call sites

Ye Yin (1):
  netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

alex chen (1):
  ocfs2: should wait dio before inode lock in ocfs2_setattr()



signature.asc
Description: PGP signature


Re: Linux 4.9.65

2017-11-24 Thread Greg KH
diff --git a/Makefile b/Makefile
index d29cace0da6d..87a641515e9c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 4
 PATCHLEVEL = 9
-SUBLEVEL = 64
+SUBLEVEL = 65
 EXTRAVERSION =
 NAME = Roaring Lionus
 
diff --git a/crypto/dh.c b/crypto/dh.c
index 9d19360e7189..99e20fc63cc9 100644
--- a/crypto/dh.c
+++ b/crypto/dh.c
@@ -21,19 +21,12 @@ struct dh_ctx {
MPI xa;
 };
 
-static inline void dh_clear_params(struct dh_ctx *ctx)
+static void dh_clear_ctx(struct dh_ctx *ctx)
 {
mpi_free(ctx->p);
mpi_free(ctx->g);
-   ctx->p = NULL;
-   ctx->g = NULL;
-}
-
-static void dh_free_ctx(struct dh_ctx *ctx)
-{
-   dh_clear_params(ctx);
mpi_free(ctx->xa);
-   ctx->xa = NULL;
+   memset(ctx, 0, sizeof(*ctx));
 }
 
 /*
@@ -71,10 +64,8 @@ static int dh_set_params(struct dh_ctx *ctx, struct dh 
*params)
return -EINVAL;
 
ctx->g = mpi_read_raw_data(params->g, params->g_size);
-   if (!ctx->g) {
-   mpi_free(ctx->p);
+   if (!ctx->g)
return -EINVAL;
-   }
 
return 0;
 }
@@ -84,19 +75,24 @@ static int dh_set_secret(struct crypto_kpp *tfm, void *buf, 
unsigned int len)
struct dh_ctx *ctx = dh_get_ctx(tfm);
struct dh params;
 
+   /* Free the old MPI key if any */
+   dh_clear_ctx(ctx);
+
if (crypto_dh_decode_key(buf, len, ¶ms) < 0)
-   return -EINVAL;
+   goto err_clear_ctx;
 
if (dh_set_params(ctx, ¶ms) < 0)
-   return -EINVAL;
+   goto err_clear_ctx;
 
ctx->xa = mpi_read_raw_data(params.key, params.key_size);
-   if (!ctx->xa) {
-   dh_clear_params(ctx);
-   return -EINVAL;
-   }
+   if (!ctx->xa)
+   goto err_clear_ctx;
 
return 0;
+
+err_clear_ctx:
+   dh_clear_ctx(ctx);
+   return -EINVAL;
 }
 
 static int dh_compute_value(struct kpp_request *req)
@@ -154,7 +150,7 @@ static void dh_exit_tfm(struct crypto_kpp *tfm)
 {
struct dh_ctx *ctx = dh_get_ctx(tfm);
 
-   dh_free_ctx(ctx);
+   dh_clear_ctx(ctx);
 }
 
 static struct kpp_alg dh = {
diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
b/drivers/char/ipmi/ipmi_msghandler.c
index 172a9dc06ec9..5d509ccf1299 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struct ipmi_recv_msg 
*recv_msg,
 }
 
 static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
  int slot, unsigned long *flags,
  unsigned int *waiting_msgs)
 {
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
if (!ent->inuse)
return;
 
-   ent->timeout -= timeout_period;
-   if (ent->timeout > 0) {
+   if (timeout_period < ent->timeout) {
+   ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
}
 }
 
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+unsigned long timeout_period)
 {
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index cf76fc6149e5..fbb75514dfb4 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -666,6 +666,7 @@ static int dmatest_func(void *data)
 * free it this time?" dancing.  For now, just
 * leave it dangling.
 */
+   WARN(1, "dmatest: Kernel stack may be corrupted!!\n");
dmaengine_unmap_put(um);
result("test timed out", total_tests, src_off, dst_off,
   len, 0);
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5fa36ebc0640..63d61c084815 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3217,7 +3217,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
hash ^= (hash >> 16);
hash ^= (hash >> 8);
 
-   return hash;
+   return hash >> 1;
 }
 
 /*-- Device entry points */
diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index c08bd763172a..a300ed48a7d8 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x0800, /* first descrip

Linux 4.9.65

2017-11-24 Thread Greg KH
I'm announcing the release of the 4.9.65 kernel.

All users of the 4.9 kernel series must upgrade.

The updated 4.9.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-4.9.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile  |2 +-
 crypto/dh.c   |   34 +++---
 drivers/char/ipmi/ipmi_msghandler.c   |   10 ++
 drivers/dma/dmatest.c |1 +
 drivers/net/bonding/bond_main.c   |2 +-
 drivers/net/ethernet/fealnx.c |6 +++---
 drivers/net/usb/asix_devices.c|4 ++--
 drivers/net/usb/cdc_ether.c   |2 +-
 drivers/net/usb/qmi_wwan.c|3 ++-
 drivers/net/vrf.c |2 +-
 drivers/tty/serial/8250/8250_fintek.c |3 +++
 drivers/tty/serial/omap-serial.c  |2 +-
 fs/coda/upcall.c  |3 +--
 fs/ocfs2/dlm/dlmrecovery.c|1 +
 fs/ocfs2/file.c   |9 +++--
 include/linux/mmzone.h|3 ++-
 include/linux/skbuff.h|7 +++
 mm/page_alloc.c   |   27 ++-
 mm/pagewalk.c |6 +-
 net/8021q/vlan.c  |6 +++---
 net/core/skbuff.c |1 +
 net/ipv4/tcp_nv.c |2 +-
 net/ipv4/tcp_output.c |9 ++---
 net/netlink/af_netlink.c  |   17 +++--
 net/netlink/af_netlink.h  |1 +
 net/sctp/ipv6.c   |5 +++--
 net/sctp/socket.c |4 
 security/integrity/ima/ima_appraise.c |3 +++
 28 files changed, 107 insertions(+), 68 deletions(-)

Adam Wallis (1):
  dmaengine: dmatest: warn user when dma test times out

Andrey Konovalov (1):
  net: usb: asix: fill null-ptr-deref in asix_suspend

Bjørn Mork (2):
  net: cdc_ether: fix divide by 0 on bad descriptors
  net: qmi_wwan: fix divide by 0 on bad descriptors

Changwei Ge (1):
  ocfs2: fix cluster hang after a node dies

Cong Wang (1):
  vlan: fix a use-after-free in vlan_device_event()

Corey Minyard (1):
  ipmi: fix unsigned long underflow

Eric Biggers (1):
  crypto: dh - Fix double free of ctx->p

Eric Dumazet (1):
  tcp: do not mangle skb->cb[] in tcp_make_synack()

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Greg Kroah-Hartman (1):
  Linux 4.9.65

Hangbin Liu (1):
  bonding: discard lowest hash bit for 802.3ad layer3+4

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Jan Harkes (1):
  coda: fix 'kernel memory exposure attempt' in fsync

Jann Horn (1):
  mm/pagewalk.c: report holes in hugetlb ranges

Jason A. Donenfeld (1):
  af_netlink: ensure that NLMSG_DONE never fails in dumps

Jeff Barnhill (1):
  net: vrf: correct FRA_L3MDEV encode type

Ji-Ze Hong (Peter Hong) (1):
  serial: 8250_fintek: Fix finding base_port with activated SuperIO

Konstantin Khlebnikov (1):
  tcp_nv: fix division by zero in tcpnv_acked()

Kristian Evensen (1):
  qmi_wwan: Add missing skb_reset_mac_header-call

Lukas Wunner (1):
  serial: omap: Fix EFR write on RTS deassertion

Pavel Tatashin (1):
  mm/page_alloc.c: broken deferred calculation

Roberto Sassu (1):
  ima: do not update security.ima if appraisal status is not INTEGRITY_PASS

Tudor-Dan Ambarus (1):
  crypto: dh - fix memleak in setkey

Xin Long (1):
  sctp: do not peel off an assoc from one netns to another one

Ye Yin (1):
  netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

alex chen (1):
  ocfs2: should wait dio before inode lock in ocfs2_setattr()



signature.asc
Description: PGP signature


Re: Linux 4.13.16

2017-11-24 Thread Greg KH
diff --git a/Makefile b/Makefile
index 3bd5d9d148d3..bc9a897e0431 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 4
 PATCHLEVEL = 13
-SUBLEVEL = 15
+SUBLEVEL = 16
 EXTRAVERSION =
 NAME = Fearless Coyote
 
diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c 
b/arch/x86/kernel/cpu/intel_cacheinfo.c
index c55fb2cb2acc..24f749324c0f 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -811,7 +811,24 @@ static int __cache_amd_cpumap_setup(unsigned int cpu, int 
index,
struct cacheinfo *this_leaf;
int i, sibling;
 
-   if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+   /*
+* For L3, always use the pre-calculated cpu_llc_shared_mask
+* to derive shared_cpu_map.
+*/
+   if (index == 3) {
+   for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
+   this_cpu_ci = get_cpu_cacheinfo(i);
+   if (!this_cpu_ci->info_list)
+   continue;
+   this_leaf = this_cpu_ci->info_list + index;
+   for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
+   if (!cpu_online(sibling))
+   continue;
+   cpumask_set_cpu(sibling,
+   &this_leaf->shared_cpu_map);
+   }
+   }
+   } else if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
unsigned int apicid, nshared, first, last;
 
this_leaf = this_cpu_ci->info_list + index;
@@ -839,19 +856,6 @@ static int __cache_amd_cpumap_setup(unsigned int cpu, int 
index,
&this_leaf->shared_cpu_map);
}
}
-   } else if (index == 3) {
-   for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
-   this_cpu_ci = get_cpu_cacheinfo(i);
-   if (!this_cpu_ci->info_list)
-   continue;
-   this_leaf = this_cpu_ci->info_list + index;
-   for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
-   if (!cpu_online(sibling))
-   continue;
-   cpumask_set_cpu(sibling,
-   &this_leaf->shared_cpu_map);
-   }
-   }
} else
return 0;
 
diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
b/drivers/char/ipmi/ipmi_msghandler.c
index 810b138f5897..c82d9fd2f05a 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struct ipmi_recv_msg 
*recv_msg,
 }
 
 static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
  int slot, unsigned long *flags,
  unsigned int *waiting_msgs)
 {
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
if (!ent->inuse)
return;
 
-   ent->timeout -= timeout_period;
-   if (ent->timeout > 0) {
+   if (timeout_period < ent->timeout) {
+   ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
}
 }
 
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+unsigned long timeout_period)
 {
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
diff --git a/drivers/char/tpm/tpm-dev-common.c 
b/drivers/char/tpm/tpm-dev-common.c
index 610638a80383..461bf0b8a094 100644
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -110,6 +110,12 @@ ssize_t tpm_common_write(struct file *file, const char 
__user *buf,
return -EFAULT;
}
 
+   if (in_size < 6 ||
+   in_size < be32_to_cpu(*((__be32 *) (priv->data_buffer + 2 {
+   mutex_unlock(&priv->buffer_mutex);
+   return -EINVAL;
+   }
+
/* atomic tpm command send and result receive. We only hold the ops
 * lock during this period so that the tpm can be unregistered even if
 * the char dev is held open.
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c99dc59d729b..76e8054bfc4e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3253,7 +3253,7 @@ u32 bond_xmit_hash(struct bonding

Linux 4.13.16

2017-11-24 Thread Greg KH
---
Please note, this is the LAST 4.13.y kernel to be released, it is now
end-of-life.

Move to 4.14.y now.
---

I'm announcing the release of the 4.13.16 kernel.

All users of the 4.13 kernel series must upgrade.

The updated 4.13.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-4.13.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile|2 -
 arch/x86/kernel/cpu/intel_cacheinfo.c   |   32 +---
 drivers/char/ipmi/ipmi_msghandler.c |   10 ---
 drivers/char/tpm/tpm-dev-common.c   |6 
 drivers/net/bonding/bond_main.c |2 -
 drivers/net/ethernet/broadcom/bcmsysport.c  |   10 ---
 drivers/net/ethernet/fealnx.c   |6 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   12 +++--
 drivers/net/ethernet/mellanox/mlx5/core/main.c  |7 +
 drivers/net/usb/asix_devices.c  |4 +--
 drivers/net/usb/cdc_ether.c |2 -
 drivers/net/usb/cdc_ncm.c   |4 +--
 drivers/net/usb/qmi_wwan.c  |3 +-
 drivers/net/vrf.c   |2 -
 drivers/net/vxlan.c |   31 +--
 drivers/tty/serial/8250/8250_fintek.c   |3 ++
 drivers/tty/serial/omap-serial.c|2 -
 fs/coda/upcall.c|3 --
 fs/ocfs2/dlm/dlmrecovery.c  |1 
 fs/ocfs2/file.c |9 +-
 include/linux/mmzone.h  |3 +-
 include/linux/skbuff.h  |7 +
 kernel/rcu/tree_plugin.h|2 -
 mm/page_alloc.c |   27 +---
 mm/page_ext.c   |4 ---
 mm/pagewalk.c   |6 +++-
 net/8021q/vlan.c|6 ++--
 net/core/skbuff.c   |1 
 net/ipv4/tcp_input.c|3 --
 net/ipv4/tcp_nv.c   |2 -
 net/ipv4/tcp_offload.c  |   12 +++--
 net/ipv4/tcp_output.c   |9 +-
 net/l2tp/l2tp_ip.c  |   24 ++
 net/l2tp/l2tp_ip6.c |   24 ++
 net/netlink/af_netlink.c|   17 
 net/netlink/af_netlink.h|1 
 net/sctp/ipv6.c |5 ++-
 net/sctp/socket.c   |4 +++
 security/integrity/ima/ima_appraise.c   |3 ++
 39 files changed, 178 insertions(+), 133 deletions(-)

Alexander Steffen (1):
  tpm-dev-common: Reject too short writes

Andrey Konovalov (1):
  net: usb: asix: fill null-ptr-deref in asix_suspend

Bjørn Mork (3):
  net: cdc_ether: fix divide by 0 on bad descriptors
  net: qmi_wwan: fix divide by 0 on bad descriptors
  net: cdc_ncm: GetNtbFormat endian fix

Changwei Ge (1):
  ocfs2: fix cluster hang after a node dies

Cong Wang (1):
  vlan: fix a use-after-free in vlan_device_event()

Corey Minyard (1):
  ipmi: fix unsigned long underflow

Eric Dumazet (2):
  tcp: do not mangle skb->cb[] in tcp_make_synack()
  tcp: gso: avoid refcount_t warning from tcp_gso_segment()

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Florian Fainelli (1):
  net: systemport: Correct IPG length settings

Greg Kroah-Hartman (1):
  Linux 4.13.16

Guillaume Nault (1):
  l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6

Hangbin Liu (1):
  bonding: discard lowest hash bit for 802.3ad layer3+4

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Huy Nguyen (1):
  net/mlx5: Cancel health poll before sending panic teardown command

Inbar Karmy (1):
  net/mlx5e: Set page to null in case dma mapping fails

Jaewon Kim (1):
  mm/page_ext.c: check if page_ext is not prepared

Jan Harkes (1):
  coda: fix 'kernel memory exposure attempt' in fsync

Jann Horn (1):
  mm/pagewalk.c: report holes in hugetlb ranges

Jason A. Donenfeld (1):
  af_netlink: ensure that NLMSG_DONE never fails in dumps

Jeff Barnhill (1):
  net: vrf: correct FRA_L3MDEV encode type

Ji-Ze Hong (Peter Hong) (1):
  serial: 8250_fintek: Fix finding base_port with activated SuperIO

Konstantin Khlebnikov (1):
  tcp_nv: fix division by zero in tcpnv_acked()

Kristian Evensen (1):
  qmi_wwan: Add missing skb_reset_mac_header-call

Lukas Wunner (1):
  se

Linux 4.14.2

2017-11-24 Thread Greg KH
I'm announcing the release of the 4.14.2 kernel.

All users of the 4.14 kernel series must upgrade.

The updated 4.14.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-4.14.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile  |2 +-
 block/bio.c   |1 +
 drivers/char/ipmi/ipmi_msghandler.c   |   10 ++
 drivers/char/ipmi/ipmi_si_intf.c  |   33 +++--
 drivers/char/tpm/tpm-dev-common.c |6 ++
 drivers/net/ethernet/fealnx.c |6 +++---
 drivers/net/usb/cdc_ncm.c |4 ++--
 drivers/net/vxlan.c   |   31 +--
 drivers/tty/serial/8250/8250_fintek.c |3 +++
 drivers/tty/serial/omap-serial.c  |2 +-
 fs/coda/upcall.c  |3 +--
 fs/ocfs2/dlm/dlmrecovery.c|1 +
 fs/ocfs2/file.c   |9 +++--
 include/linux/mmzone.h|3 ++-
 kernel/rcu/tree_plugin.h  |2 +-
 mm/page_alloc.c   |   27 ++-
 mm/page_ext.c |4 
 mm/pagewalk.c |6 +-
 net/netlink/af_netlink.c  |   17 +++--
 net/netlink/af_netlink.h  |1 +
 net/sctp/ipv6.c   |5 +++--
 security/integrity/ima/ima_appraise.c |3 +++
 22 files changed, 112 insertions(+), 67 deletions(-)

Alexander Steffen (1):
  tpm-dev-common: Reject too short writes

Bjørn Mork (1):
  net: cdc_ncm: GetNtbFormat endian fix

Changwei Ge (1):
  ocfs2: fix cluster hang after a node dies

Corey Minyard (2):
  ipmi: fix unsigned long underflow
  ipmi: Prefer ACPI system interfaces over SMBIOS ones

Eric W. Biederman (1):
  net/sctp: Always set scope_id in sctp_inet6_skb_msgname

Greg Kroah-Hartman (1):
  Linux 4.14.2

Huacai Chen (1):
  fealnx: Fix building error on MIPS

Jaewon Kim (1):
  mm/page_ext.c: check if page_ext is not prepared

Jan Harkes (1):
  coda: fix 'kernel memory exposure attempt' in fsync

Jann Horn (1):
  mm/pagewalk.c: report holes in hugetlb ranges

Jason A. Donenfeld (1):
  af_netlink: ensure that NLMSG_DONE never fails in dumps

Ji-Ze Hong (Peter Hong) (1):
  serial: 8250_fintek: Fix finding base_port with activated SuperIO

Lukas Wunner (1):
  serial: omap: Fix EFR write on RTS deassertion

Michael Lyle (1):
  bio: ensure __bio_clone_fast copies bi_partno

Neeraj Upadhyay (1):
  rcu: Fix up pending cbs check in rcu_prepare_for_idle

Pavel Tatashin (1):
  mm/page_alloc.c: broken deferred calculation

Roberto Sassu (1):
  ima: do not update security.ima if appraisal status is not INTEGRITY_PASS

Xin Long (1):
  vxlan: fix the issue that neigh proxy blocks all icmpv6 packets

alex chen (1):
  ocfs2: should wait dio before inode lock in ocfs2_setattr()



signature.asc
Description: PGP signature


Re: Linux 4.14.2

2017-11-24 Thread Greg KH
diff --git a/Makefile b/Makefile
index 01f9df1af256..75d89dc2b94a 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 VERSION = 4
 PATCHLEVEL = 14
-SUBLEVEL = 1
+SUBLEVEL = 2
 EXTRAVERSION =
 NAME = Petit Gorille
 
diff --git a/block/bio.c b/block/bio.c
index 101c2a9b5481..33fa6b4af312 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 * so we don't set nor calculate new physical/hw segment counts here
 */
bio->bi_disk = bio_src->bi_disk;
+   bio->bi_partno = bio_src->bi_partno;
bio_set_flag(bio, BIO_CLONED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_write_hint = bio_src->bi_write_hint;
diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
b/drivers/char/ipmi/ipmi_msghandler.c
index 810b138f5897..c82d9fd2f05a 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struct ipmi_recv_msg 
*recv_msg,
 }
 
 static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
  int slot, unsigned long *flags,
  unsigned int *waiting_msgs)
 {
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
if (!ent->inuse)
return;
 
-   ent->timeout -= timeout_period;
-   if (ent->timeout > 0) {
+   if (timeout_period < ent->timeout) {
+   ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t intf, struct 
seq_table *ent,
}
 }
 
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+unsigned long timeout_period)
 {
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 36f47e8d06a3..bc3984ffe867 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -3424,7 +3424,7 @@ static inline void wait_for_timer_and_thread(struct 
smi_info *smi_info)
del_timer_sync(&smi_info->si_timer);
 }
 
-static int is_new_interface(struct smi_info *info)
+static struct smi_info *find_dup_si(struct smi_info *info)
 {
struct smi_info *e;
 
@@ -3439,24 +3439,36 @@ static int is_new_interface(struct smi_info *info)
 */
if (info->slave_addr && !e->slave_addr)
e->slave_addr = info->slave_addr;
-   return 0;
+   return e;
}
}
 
-   return 1;
+   return NULL;
 }
 
 static int add_smi(struct smi_info *new_smi)
 {
int rv = 0;
+   struct smi_info *dup;
 
mutex_lock(&smi_infos_lock);
-   if (!is_new_interface(new_smi)) {
-   pr_info(PFX "%s-specified %s state machine: duplicate\n",
-   ipmi_addr_src_to_str(new_smi->addr_source),
-   si_to_str[new_smi->si_type]);
-   rv = -EBUSY;
-   goto out_err;
+   dup = find_dup_si(new_smi);
+   if (dup) {
+   if (new_smi->addr_source == SI_ACPI &&
+   dup->addr_source == SI_SMBIOS) {
+   /* We prefer ACPI over SMBIOS. */
+   dev_info(dup->dev,
+"Removing SMBIOS-specified %s state machine in 
favor of ACPI\n",
+si_to_str[new_smi->si_type]);
+   cleanup_one_si(dup);
+   } else {
+   dev_info(new_smi->dev,
+"%s-specified %s state machine: duplicate\n",
+ipmi_addr_src_to_str(new_smi->addr_source),
+si_to_str[new_smi->si_type]);
+   rv = -EBUSY;
+   goto out_err;
+   }
}
 
pr_info(PFX "Adding %s-specified %s state machine\n",
@@ -3865,7 +3877,8 @@ static void cleanup_one_si(struct smi_info *to_clean)
poll(to_clean);
schedule_timeout_uninterruptible(1);
}
-   disable_si_irq(to_clean, false);
+   if (to_clean->handlers)
+   disable_si_irq(to_clean, false);
while (to_clean->curr_msg || (to_clean->si_state != SI_NORMAL)) {
poll(to_clean);
schedule_timeout_uninterruptible(1);
diff --git a/drivers/char/tpm/tpm-dev-common.c 
b/drivers/char/tpm/tpm-dev-comm

[PATCH v2] ARM64: crypto: do not call crypto_unregister_skcipher twice on error

2017-11-24 Thread Corentin Labbe
When a cipher fails to register in aes_init(), the error path goes thought
aes_exit() then crypto_unregister_skciphers().
Since aes_exit calls also crypto_unregister_skcipher, this triggers a
refcount_t: underflow; use-after-free.

Signed-off-by: Corentin Labbe 
---
Changes since v1:
- Instead of duplicate code from aes_exit() minus crypto_unregister_skciphers, 
simply use it and return after
as suggested by Ard Biesheuvel
 arch/arm64/crypto/aes-glue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..2fa850e86aa8 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -665,6 +665,7 @@ static int __init aes_init(void)
 
 unregister_simds:
aes_exit();
+   return err;
 unregister_ciphers:
crypto_unregister_skciphers(aes_algs, ARRAY_SIZE(aes_algs));
return err;
-- 
2.13.6



Re: [PATCH v2 13/13] nubus: Add support for the driver model

2017-11-24 Thread Greg Kroah-Hartman
On Fri, Nov 24, 2017 at 10:40:20AM +1100, Finn Thain wrote:
> On Thu, 23 Nov 2017, Greg Kroah-Hartman wrote:
> 
> > On Thu, Nov 23, 2017 at 11:24:38AM +1100, Finn Thain wrote:
> > > On Mon, 20 Nov 2017, I wrote:
> > > 
> > > > > You need to free up the memory allocated, and I don't see that 
> > > > > happening here ... The kernel should yell at you ...
> > > 
> > > > 
> > > > WARN(1, KERN_ERR "Device '%s' does not have a release() 
> > > > "
> > > > "function, it is broken and must be fixed.\n",
> > > > dev_name(dev));
> > > > 
> > > > This won't fire unless device_del() is called, right?
> > > 
> > > Sorry, I should have written, "This won't fire unless 
> > > device_unregister() is called, right?" -- though I guess it could be 
> > > any call to put_device().
> > > 
> > > If need be I can add code to cleanly tear down the bus devices and the 
> > > associated linked lists and procfs structures, just prior to kernel 
> > > termination, as a kernel exitcall. But I don't see this pattern in 
> > > use.
> > 
> > When the kernel shuts down, no, the devices are not removed.
> > 
> > But what happens when the bus code is unloaded if it is built as a 
> > module?  The devices will be removed then.  Or they should be.
> > 
> 
> This bus driver is not a module.

It can not be built as a module ever?

> > So please implement the remove device code path,
> 
> OK.
> 
> > just because some other busses are buggy that way does not mean you need 
> > to duplicate their incorrect behavior.
> > 
> 
> Actually, I think the bug is in porting.txt, when it says "Optionally, the 
> bus driver may set the device's name and release fields."

Yes, it's not required for a bus, but _someone_ has to set the device
release function.  If it's not the bus, it can be the class, or the
device type, otherwise you will trip the warning message in
device_release() when the device finally gets freed.

thanks,

greg k-h


Re: [kernel-hardening] [PATCH v3 1/2] Protected FIFOs and regular files

2017-11-24 Thread Salvatore Mesoraca
2017-11-23 23:43 GMT+01:00 Tobin C. Harding :
> On Wed, Nov 22, 2017 at 09:01:45AM +0100, Salvatore Mesoraca wrote:
>
> Please take these comments in all humility, my English is a long way
> from perfect. These are English grammar comments only. If this is viewed
> as trivial please stop reading now and ignore.

Any help is always greatly appreciated!
And I like your proposed changes, they sound better to me too.
Thank you for your time,

Salvatore


Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Greg KH
On Fri, Nov 24, 2017 at 11:44:02AM +0800, Kai Heng Feng wrote:
> 
> 
> > On 23 Nov 2017, at 5:24 PM, Greg KH  wrote:
> > 
> > On Thu, Nov 23, 2017 at 04:53:41PM +0800, Kai Heng Feng wrote:
> >> 
> >> What I want to do here is to finding this connection:
> >> Realtek r8153 <-> SMSC hub (USD ID: 0424:5537) <-> 
> >> ASMedia XHCI controller (PCI ID: 1b21:1142).
> >> 
> >> Is there a safer way to do this?
> > 
> > Nope!  You can't do that at all from within a USB driver, sorry.  As you
> > really should not care at all :)
> 
> Got it :)
> 
> The r8153 in Dell TB dock has version information, RTL_VER_05.
> We can use it to check for workaround, but many working RTL_VER_05 devices
> will also be affected.
> Do you think it’s an acceptable compromise?

I think all of the users of this device that is working just fine for
them would not like that to happen :(

> >> I have a r8153 <-> USB 3.0 dongle which work just fine. I can’t find any 
> >> information to differentiate them. Hence I want to use the connection to
> >> identify if r8153 is on a Dell TB dock.
> > 
> > Are you sure there is nothing different in the version or release number
> > of the device?  'lsusb -v' shows the exact same information for both
> > devices?
> 
> Yes. I attached `lsusb -v` for r8153 on Dell TB dock, on a RJ45 <-> USB 3.0 
> dongle,
> and on a RJ45 <-> USB Type-C dongle.

The bcdDevice is different between the dock device and the "real"
device, why not use that?

> >> Yes. From what I know, ASMedia is working on it, but not sure how long it
> >> will take. In the meantime, I’d like to workaround this issue for the 
> >> users.
> > 
> > Again, it's a host controller bug, it should be fixed there, don't try
> > to paper over the real issue in different individual drivers.
> > 
> > I think I've seen various patches on the linux-usb list for this
> > controller already, have you tried them?
> 
> Yes. These patches are all in mainline Linux now.

Then there is still a bug.  Who as ASMedia is working on this, have they
posted anything to the linux-usb mailing list about it?

> >> Actually no.
> >> I just plugged r8153 dongle into the same hub, surprisingly the issue
> >> doesn’t happen in this scenario.
> > 
> > Then something seems to be wrong with the device itself, as that would
> > be the same exact electrical/logical path, right?
> 
> I have no idea why externally plugged one doesn’t have this issue.
> Maybe it’s related how it’s wired inside the Dell TB dock...

Maybe.  Have you tried using usbmon to see what the data streams are for
the two devices and where they have problems and diverge?  Is the dock
device doing something different in response to something from the host
that the "real" device does not do?

thanks,

greg k-h


Re: [PATCH] MAINTAINERS: change maintainer for Rockchip drm drivers

2017-11-24 Thread Heiko Stuebner
Hi Daniel,

[somehow my email address seems to have gotten lost, so
only saw this by chance on the list itself now.
I've also re-added Sandy to the recipients]

Am Montag, 20. November 2017, 08:48:48 CET schrieb Daniel Vetter:
> On Mon, Nov 13, 2017 at 06:15:31PM +0800, Mark Yao wrote:
> > For personal reasons, Mark Yao will leave rockchip,
> > can not continue maintain drm/rockchip, Sandy Huang
> > will take over the drm/rockchip.
> > 
> > Cc: Sandy Huang 
> > Cc: Heiko Stuebner 
> > 
> > Signed-off-by: Mark Yao 
> 
> Since rockchip is in drm-misc that means we need a new maintainer who also
> has drm-misc commit rights. Sandy doesn't yet, so if Sandy is going to be
> the new maintainer we need to fix that.
> 
> Also, Heiko, are you interested in becoming co-maintainer? With commit
> rights and all.

I always feel somewhat inadequate judging the fast-paced drm stuff, as in
the past once I got my head wrapped around something, drm always
somehow moved another mile already ;-) .

But somewhere I read that drm-pace for big changes is supposed to slow a
bit now that atomic modesetting is done in a lot of places, so we could give
co-maintainership for the Rockchip-drm a try - with Sandy wearing the
actual hat for big changes though ;-) .


Heiko


> -Daniel
> 
> > ---
> >  MAINTAINERS | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 0d77f22..31bf080 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -4627,7 +4627,7 @@ F:
> > Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt
> >  F: Documentation/devicetree/bindings/display/renesas,du.txt
> >  
> >  DRM DRIVERS FOR ROCKCHIP
> > -M: Mark Yao 
> > +M: Sandy Huang 
> >  L: dri-de...@lists.freedesktop.org
> >  S: Maintained
> >  F: drivers/gpu/drm/rockchip/
> 
> 




RE: [PATCH 4/4] dm: convert table_device.count from atomic_t to refcount_t

2017-11-24 Thread Reshetova, Elena

> Dne 20.10.2017 v 09:37 Elena Reshetova napsal(a):
> > atomic_t variables are currently used to implement reference
> > counters with the following properties:
> >   - counter is initialized to 1 using atomic_set()
> >   - a resource is freed upon counter reaching zero
> >   - once counter reaches zero, its further
> > increments aren't allowed
> >   - counter schema uses basic atomic operations
> > (set, inc, inc_not_zero, dec_and_test, etc.)
> >
> > Such atomic variables should be converted to a newly provided
> > refcount_t type and API that prevents accidental counter overflows
> > and underflows. This is important since overflows and underflows
> > can lead to use-after-free situation and be exploitable.
> >
> > The variable table_device.count is used as pure reference counter.
> > Convert it to refcount_t and fix up the operations.
> >
> > Suggested-by: Kees Cook 
> > Reviewed-by: David Windsor 
> > Reviewed-by: Hans Liljestrand 
> > Signed-off-by: Elena Reshetova 
> > ---
> >   drivers/md/dm.c | 12 +++-
> >   1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 4be8532..be12f3f 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -24,6 +24,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >
> >   #define DM_MSG_PREFIX "core"
> >
> > @@ -98,7 +99,7 @@ struct dm_md_mempools {
> >
> >   struct table_device {
> > struct list_head list;
> > -   atomic_t count;
> > +   refcount_t count;
> > struct dm_dev dm_dev;
> >   };
> >
> > @@ -685,10 +686,11 @@ int dm_get_table_device(struct mapped_device *md,
> dev_t dev, fmode_t mode,
> >
> > format_dev_t(td->dm_dev.name, dev);
> >
> > -   atomic_set(&td->count, 0);
> > +   refcount_set(&td->count, 1);
> > list_add(&td->list, &md->table_devices);
> > +   } else {
> > +   refcount_inc(&td->count);
> > }
> > -   atomic_inc(&td->count);
> > mutex_unlock(&md->table_devices_lock);
> >
> 
> 
> NACK
> 
> This patch (2a0b4682e09d76466f7b8f5e347ae2ff02f033af) currently breaks
> accounting of opened devices.
> 
> I.e.   multisegment device  (target with 3 segments is not properly accounted)

Could you please explain what exactly happens (i.e. missing/wrong increment?)
or provide a error dump?
By looking at the code, I don't see where the change in the reference counting
could have caused this. 

Best Regards,
Elena.

> 
> 
> Patch needs reworking and users of 'dm' and 4.15-rc0 kernel should rather
> switch back to 4.14 ATM as it's unclear which other parts can be affected.
> 
> Zdenek
> 
> > *result = &td->dm_dev;
> > @@ -701,7 +703,7 @@ void dm_put_table_device(struct mapped_device *md,
> struct dm_dev *d)
> > struct table_device *td = container_of(d, struct table_device,
> dm_dev);
> >
> > mutex_lock(&md->table_devices_lock);
> > -   if (atomic_dec_and_test(&td->count)) {
> > +   if (refcount_dec_and_test(&td->count)) {
> > close_table_device(td, md);
> > list_del(&td->list);
> > kfree(td);
> > @@ -718,7 +720,7 @@ static void free_table_devices(struct list_head 
> > *devices)
> > struct table_device *td = list_entry(tmp, struct
> table_device, list);
> >
> > DMWARN("dm_destroy: %s still exists with %d
> references",
> > -  td->dm_dev.name, atomic_read(&td->count));
> > +  td->dm_dev.name, refcount_read(&td->count));
> > kfree(td);
> > }
> >   }
> >



Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Greg KH
On Fri, Nov 24, 2017 at 09:28:05AM +0100, Greg KH wrote:
> On Fri, Nov 24, 2017 at 11:44:02AM +0800, Kai Heng Feng wrote:
> > 
> > 
> > > On 23 Nov 2017, at 5:24 PM, Greg KH  wrote:
> > > 
> > > On Thu, Nov 23, 2017 at 04:53:41PM +0800, Kai Heng Feng wrote:
> > >> 
> > >> What I want to do here is to finding this connection:
> > >> Realtek r8153 <-> SMSC hub (USD ID: 0424:5537) <-> 
> > >> ASMedia XHCI controller (PCI ID: 1b21:1142).
> > >> 
> > >> Is there a safer way to do this?
> > > 
> > > Nope!  You can't do that at all from within a USB driver, sorry.  As you
> > > really should not care at all :)
> > 
> > Got it :)
> > 
> > The r8153 in Dell TB dock has version information, RTL_VER_05.
> > We can use it to check for workaround, but many working RTL_VER_05 devices
> > will also be affected.
> > Do you think it’s an acceptable compromise?
> 
> I think all of the users of this device that is working just fine for
> them would not like that to happen :(
> 
> > >> I have a r8153 <-> USB 3.0 dongle which work just fine. I can’t find any 
> > >> information to differentiate them. Hence I want to use the connection to
> > >> identify if r8153 is on a Dell TB dock.
> > > 
> > > Are you sure there is nothing different in the version or release number
> > > of the device?  'lsusb -v' shows the exact same information for both
> > > devices?
> > 
> > Yes. I attached `lsusb -v` for r8153 on Dell TB dock, on a RJ45 <-> USB 3.0 
> > dongle,
> > and on a RJ45 <-> USB Type-C dongle.
> 
> The bcdDevice is different between the dock device and the "real"
> device, why not use that?

Also the MAC address is different, can you just trigger off of Dell's
MAC address space instead of the address space of the dongle device?

thanks,

greg k-h


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-24 Thread Peter Zijlstra
On Thu, Nov 23, 2017 at 10:31:29PM -0800, Alexei Starovoitov wrote:
> unfortunately 32-bit is more screwed than it seems:
> 
> $ cat align.c
> #include 
> 
> struct S {
>   unsigned long long a;
> } s;
> 
> struct U {
>   unsigned long long a;
> } u;
> 
> int main()
> {
> printf("%d, %d\n", sizeof(unsigned long long),
>__alignof__(unsigned long long));
> printf("%d, %d\n", sizeof(s), __alignof__(s));
> printf("%d, %d\n", sizeof(u), __alignof__(u));
> }
> $ gcc -m32 align.c
> $ ./a.out
> 8, 8
> 8, 4
> 8, 4

*blink* how is that even correct? I understood the spec to say the
alignment of composite types should be the max alignment of any of its
member types (otherwise it cannot guarantee the alignment of its
members).

> so we have to use __aligned_u64 in uapi.

Ideally yes, but effectively it most often doesn't matter.

> Otherwise, yes, we could have used config1 and config2 to pass pointers
> to the kernel, but since they're defined as __u64 already we cannot
> change them and have to do this ugly dance around 'config' field.

I don't understand the reasoning why you cannot use them. Even if they
are not naturally aligned on x86_32, why would it matter?

x86_32 needs two loads in any case, but there is no concurrency, so
split loads is not a problem. Add to that that 'intptr_t' on ILP32
is in fact only a single u32 and thus the other u32 will always be 0.

So yes, alignment is screwy, but I really don't see who cares and why it
would matter in practise.

Please explain.


Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories

2017-11-24 Thread Salvatore Mesoraca
2017-11-22 14:22 GMT+01:00 Matthew Wilcox :
> On Wed, Nov 22, 2017 at 09:01:46AM +0100, Salvatore Mesoraca wrote:
>> +An O_CREAT open missing the O_EXCL flag in a sticky directory is,
>> +often, a bug or a synthom of the fact that the program is not
>> +using appropriate procedures to access sticky directories.
>> +This protection allow to detect and possibly block these unsafe
>> +open invocations, even if the files don't exist yet.
>> +Though should be noted that, sometimes, it's OK to open a file
>> +with O_CREAT and without O_EXCL (e.g. shared lock files based
>> +on flock()), for this reason values above 2 should be set
>> +with care.
>> +
>> +When set to "0" the protection is disabled.
>> +
>> +When set to "1", notify about O_CREAT open missing the O_EXCL flag
>> +in world writable sticky directories.
>> +
>> +When set to "2", notify about O_CREAT open missing the O_EXCL flag
>> +in world or group writable sticky directories.
>> +
>> +When set to "3", block O_CREAT open missing the O_EXCL flag
>> +in world writable sticky directories and notify (but don't block)
>> +in group writable sticky directories.
>> +
>> +When set to "4", block O_CREAT open missing the O_EXCL flag
>> +in world writable and group writable sticky directories.
>
> This seems insufficiently flexible.  For example, there is no way for me
> to specify that I want to block O_CREAT without O_EXCL in world-writable,
> but not be notified about O_CREAT without O_EXCL in group-writable.

I understand your concern, I did it like this because I wanted to keep the
interface as simple as possible. But, maybe, more flexibility it's better.

> And maybe I want to be notified that blocking has happened?

I didn't write it explicitly in the doc, but you will always be notified
that blocking has happened. On the other hand you don't have any way to
suppress those notification.

> Why not make it bits?  So:
>
> 0 => notify in world
> 1 => block in world
> 2 => notify in group
> 3 => block in group
>
> So you'd have the following meaningful values:
>
>  0 - permit all (your option 0)
>  1 - notify world; permit group (your option 1)
>  2 - block world; permit group
>  3 - block,notify world; permit group
>  4 - permit world; notify group (?)
>  5 - notify world; notify group (your option 2)
>  6 - block world; notify group (your option 3)
>  7 - block,notify world; notify group
>  8 - permit world; block group (?)
>  9 - notify world; block group (?)
> 10 - block world; block group (your option 4)
> 11 - block,notify world; block group
> 12 - permit world; block, notify group (?)
> 13 - notify world; block, notify group (?)
> 14 - block world; block, notify group
> 15 - block, notify world; block, notify group
>
> Some of these don't make a lot of sense (marked with ?), but I don't see
> the harm in permitting a sysadmin to do something that seems nonsensical
> to me.

I like your idea of using "bits" this way, even if it will allow sysadmins
to set values that don't make too much sense.
Thank you very much for your suggestions,

Salvatore


[PATCH v2 1/1] Input: ims-pcu - fix typo in an error log

2017-11-24 Thread Zhen Lei
1. change "to" to "too". 
2. move ")" to the front of "\n", which discovered by Joe Perches.

Signed-off-by: Zhen Lei 
Reviewed-by: Joe Perches 
---
 drivers/input/misc/ims-pcu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/input/misc/ims-pcu.c b/drivers/input/misc/ims-pcu.c
index ae47312..253ae8e 100644
--- a/drivers/input/misc/ims-pcu.c
+++ b/drivers/input/misc/ims-pcu.c
@@ -1651,7 +1651,7 @@ static void ims_pcu_buffers_free(struct ims_pcu *pcu)
return union_desc;

dev_err(&intf->dev,
-   "Union descriptor to short (%d vs %zd\n)",
+   "Union descriptor too short (%d vs %zd)\n",
union_desc->bLength, sizeof(*union_desc));
return NULL;
}
--
1.8.3




Re: [PATCH v3 2/2] Protected O_CREAT open in sticky directories

2017-11-24 Thread Salvatore Mesoraca
2017-11-22 17:51 GMT+01:00 Alan Cox :
> On Wed, 22 Nov 2017 09:01:46 +0100
> Salvatore Mesoraca  wrote:
>
>> Disallows O_CREAT open missing the O_EXCL flag, in world or
>> group writable directories, even if the file doesn't exist yet.
>> With few exceptions (e.g. shared lock files based on flock())
>
> Enough exceptions to make it a bad idea.
>
> Firstly if you care this much *stop* having shared writable directories.
> We have namespaces, you don't need them. You can give every user their
> own /tmp etc.
>
> The rest of this only make sense on a per application and directory basis
> because there are valid use cases, and that means it wants to be part of
> an existing LSM security module where you've got the context required and
> you can attach it to a specific directory and/or process.

I think that this feature should be intended more as a "debugging" feature
than as a "security" one.
When the feature implemented in the first patch is enabled, this restriction
doesn't improve security at all and it's not supposed to do it.
The first patch blocks attacks that exploit some unsafe usage of sticky
directories.
This patch, instead, doesn't block actual attacks: it detects (and maybe
blocks) the bad code that can be exploited for the attacks blocked by #1,
even if no one is attacking you in that moment.
This looks like a useful feature to me, even if you already use more
sophisticated security apparatus like LSMs or namespaces, because it makes
easy to find real vulnerabilities in software: the commit message of
patch #1 has a short list of some CVEs that this feature can detect.
Also, being just a sysctl away, it's within anyone's reach.
Probably the "debugging" goal wasn't clear from my previous message,
I'm sorry for the misunderstanding.
Thank you very much for your time,

Salvatore


Re: [kernel-hardening] [PATCH v3 2/2] Protected O_CREAT open in sticky directories

2017-11-24 Thread Salvatore Mesoraca
2017-11-23 23:57 GMT+01:00 Tobin C. Harding :
> On Wed, Nov 22, 2017 at 09:01:46AM +0100, Salvatore Mesoraca wrote:
>
> Same caveat about this being English language comments only as for patch
> 1/2. Please ignore if this is too trivial. My grammar is a long way from
> perfect, especially please feel free to ignore my placement of commas,
> they are often wrong.

As I wrote before: any help is always welcome.
Thank you,

Salvatore


Re: [PATCH 2/4] ASoC: wm2000: One function call less in wm2000_i2c_probe() after error detection

2017-11-24 Thread Charles Keepax
On Fri, Nov 24, 2017 at 08:37:41AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 24 Nov 2017 07:45:59 +0100
> 
> The release_firmware() function was called in a few cases by the
> wm2000_i2c_probe() function during error handling even if
> the passed variable contained a null pointer.
> 
> * Adjust jump targets according to the Linux coding style convention.
> 
> * Delete the label "out" and an initialisation for the variable "fw"
>   at the beginning which became unnecessary with this refactoring.
> 
> Signed-off-by: Markus Elfring 
> ---

Acked-by: Charles Keepax 

Thanks,
Charles


Re: [PATCH 4/4] ASoC: wm2000: Improve a size determination in wm2000_i2c_probe()

2017-11-24 Thread Charles Keepax
On Fri, Nov 24, 2017 at 08:40:22AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 24 Nov 2017 08:18:14 +0100
> 
> Replace the specification of a data structure by a pointer dereference
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Acked-by: Charles Keepax 

Thanks,
Charles


Re: [PATCH 3/4] ASoC: wm2000: Fix a typo in a comment line

2017-11-24 Thread Charles Keepax
On Fri, Nov 24, 2017 at 08:39:02AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 24 Nov 2017 08:02:57 +0100
> 
> Delete a duplicate character in a word of this description.
> 
> Signed-off-by: Markus Elfring 
> ---

Acked-by: Charles Keepax 

Thanks,
Charles


Re: [PATCH] Add slowpath enter/exit trace events

2017-11-24 Thread peter enderborg
On 11/23/2017 03:01 PM, Michal Hocko wrote:
> I am not sure adding a probe on a production system will fly in many
> cases. A static tracepoint would be much easier in that case. But I
> agree there are other means to accomplish the same thing. My main point
> was to have an easy out-of-the-box way to check latencies. But that is
> not something I would really insist on.
>
In android tracefs (or part of it) is the way for the system to control to what 
developers can access to the linux system on commercial devices.  So it is very 
much used on production systems, it is even  a requirement from google to be 
certified as android.  Things like dmesg is not.  However, this probe is at the 
moment not in that scope. 

My point is that you need to condense the information as much as possible but 
still be useful before making the effort to copy it to userspace.  And  for 
this the trace-event are very useful for small systems since the cost is very 
low for events where no one is listening.



Re: [PATCH v3] drm: bridge: synopsys/dw-hdmi: Enable cec clock

2017-11-24 Thread Hans Verkuil
On 11/24/2017 09:04 AM, Archit Taneja wrote:
> Hi,
> 
> On 11/20/2017 06:00 PM, Hans Verkuil wrote:
>> I didn't see this merged for 4.15, is it too late to include this?
>> All other changes needed to get CEC to work on rk3288 and rk3399 are all 
>> merged.
> 
> Sorry for the late reply. I was out last week.
> 
> Dave recently sent the second pull request for 4.15, so I think it would be 
> hard to get it
> in the merge window now. We could target it for the 4.15-rcs since it is 
> preventing the
> feature from working. Is it possible to rephrase the commit message a bit so 
> that it's clear
> that we need it for CEC to work?

While it is not my patch I would propose something like this:

"Support the "cec" optional clock. The documentation already mentions "cec"
optional clock and it is used by several boards, but currently the driver
doesn't enable it, thus preventing cec from working on those boards.

And even worse: a /dev/cecX device will appear for those boards, but it
won't be functioning without configuring this clock."

I hadn't realized that last sentence until I started thinking about it,
but this patch is really needed.

Regards,

Hans

> 
> Thanks,
> Archit
> 
>>
>> Regards,
>>
>>  Hans
>>
>> On 10/26/2017 08:19 PM, Pierre-Hugues Husson wrote:
>>> The documentation already mentions "cec" optional clock, but
>>> currently the driver doesn't enable it.
>>>
>>> Changes:
>>> v3:
>>> - Drop useless braces
>>>
>>> v2:
>>> - Separate ENOENT errors from others
>>> - Propagate other errors (especially -EPROBE_DEFER)
>>>
>>> Signed-off-by: Pierre-Hugues Husson 
>>> ---
>>>   drivers/gpu/drm/bridge/synopsys/dw-hdmi.c | 25 +
>>>   1 file changed, 25 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c 
>>> b/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
>>> index bf14214fa464..d82b9747a979 100644
>>> --- a/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
>>> +++ b/drivers/gpu/drm/bridge/synopsys/dw-hdmi.c
>>> @@ -138,6 +138,7 @@ struct dw_hdmi {
>>> struct device *dev;
>>> struct clk *isfr_clk;
>>> struct clk *iahb_clk;
>>> +   struct clk *cec_clk;
>>> struct dw_hdmi_i2c *i2c;
>>>   
>>> struct hdmi_data_info hdmi_data;
>>> @@ -2382,6 +2383,26 @@ __dw_hdmi_probe(struct platform_device *pdev,
>>> goto err_isfr;
>>> }
>>>   
>>> +   hdmi->cec_clk = devm_clk_get(hdmi->dev, "cec");
>>> +   if (PTR_ERR(hdmi->cec_clk) == -ENOENT) {
>>> +   hdmi->cec_clk = NULL;
>>> +   } else if (IS_ERR(hdmi->cec_clk)) {
>>> +   ret = PTR_ERR(hdmi->cec_clk);
>>> +   if (ret != -EPROBE_DEFER)
>>> +   dev_err(hdmi->dev, "Cannot get HDMI cec clock: %d\n",
>>> +   ret);
>>> +
>>> +   hdmi->cec_clk = NULL;
>>> +   goto err_iahb;
>>> +   } else {
>>> +   ret = clk_prepare_enable(hdmi->cec_clk);
>>> +   if (ret) {
>>> +   dev_err(hdmi->dev, "Cannot enable HDMI cec clock: %d\n",
>>> +   ret);
>>> +   goto err_iahb;
>>> +   }
>>> +   }
>>> +
>>> /* Product and revision IDs */
>>> hdmi->version = (hdmi_readb(hdmi, HDMI_DESIGN_ID) << 8)
>>>   | (hdmi_readb(hdmi, HDMI_REVISION_ID) << 0);
>>> @@ -2518,6 +2539,8 @@ __dw_hdmi_probe(struct platform_device *pdev,
>>> cec_notifier_put(hdmi->cec_notifier);
>>>   
>>> clk_disable_unprepare(hdmi->iahb_clk);
>>> +   if (hdmi->cec_clk)
>>> +   clk_disable_unprepare(hdmi->cec_clk);
>>>   err_isfr:
>>> clk_disable_unprepare(hdmi->isfr_clk);
>>>   err_res:
>>> @@ -2541,6 +2564,8 @@ static void __dw_hdmi_remove(struct dw_hdmi *hdmi)
>>>   
>>> clk_disable_unprepare(hdmi->iahb_clk);
>>> clk_disable_unprepare(hdmi->isfr_clk);
>>> +   if (hdmi->cec_clk)
>>> +   clk_disable_unprepare(hdmi->cec_clk);
>>>   
>>> if (hdmi->i2c)
>>> i2c_del_adapter(&hdmi->i2c->adap);
>>>
>>
> 



[PATCH] x86/xen: support early interrupts in xen pv guests

2017-11-24 Thread Juergen Gross
Add early interrupt handlers activated by idt_setup_early_handler() to
the handlers supported by Xen pv guests. This will allow for early
WARN() calls not crashing the guest.

Suggested-by: Andy Lutomirski 
Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/segment.h | 12 
 arch/x86/mm/extable.c  |  4 +++-
 arch/x86/xen/enlighten_pv.c| 37 -
 arch/x86/xen/xen-asm_64.S  | 14 ++
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index b20f9d623f9c..8f09012b92e7 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -236,11 +236,23 @@
  */
 #define EARLY_IDT_HANDLER_SIZE 9
 
+/*
+ * xen_early_idt_handler_array is for Xen pv guests: for each entry in
+ * early_idt_handler_array it contains a prequel in the form of
+ * pop %rcx; pop %r11; jmp early_idt_handler_array[i]; summing up to
+ * max 8 bytes.
+ */
+#define XEN_EARLY_IDT_HANDLER_SIZE 8
+
 #ifndef __ASSEMBLY__
 
 extern const char 
early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
 extern void early_ignore_irq(void);
 
+#if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
+extern const char 
xen_early_idt_handler_array[NUM_EXCEPTION_VECTORS][XEN_EARLY_IDT_HANDLER_SIZE];
+#endif
+
 /*
  * Load a segment. Fall back on loading the zero segment if something goes
  * wrong.  This variant assumes that loading zero fully clears the segment.
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 3321b446b66c..88754bfd425f 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -1,6 +1,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -212,8 +213,9 @@ void __init early_fixup_exception(struct pt_regs *regs, int 
trapnr)
 * Old CPUs leave the high bits of CS on the stack
 * undefined.  I'm not sure which CPUs do this, but at least
 * the 486 DX works this way.
+* Xen pv domains are not using the default __KERNEL_CS.
 */
-   if (regs->cs != __KERNEL_CS)
+   if (!xen_pv_domain() && regs->cs != __KERNEL_CS)
goto fail;
 
/*
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 5b2b3f3f6531..f2414c6c5e7c 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -622,7 +622,7 @@ static struct trap_array_entry trap_array[] = {
{ simd_coprocessor_error,  xen_simd_coprocessor_error,  false },
 };
 
-static bool get_trap_addr(void **addr, unsigned int ist)
+static bool __ref get_trap_addr(void **addr, unsigned int ist)
 {
unsigned int nr;
bool ist_okay = false;
@@ -644,6 +644,14 @@ static bool get_trap_addr(void **addr, unsigned int ist)
}
}
 
+   if (nr == ARRAY_SIZE(trap_array) &&
+   *addr >= (void *)early_idt_handler_array[0] &&
+   *addr < (void *)early_idt_handler_array[NUM_EXCEPTION_VECTORS]) {
+   nr = (*addr - (void *)early_idt_handler_array[0]) /
+EARLY_IDT_HANDLER_SIZE;
+   *addr = (void *)xen_early_idt_handler_array[nr];
+   }
+
if (WARN_ON(ist != 0 && !ist_okay))
return false;
 
@@ -1262,6 +1270,21 @@ asmlinkage __visible void __init xen_start_kernel(void)
xen_setup_gdt(0);
 
xen_init_irq_ops();
+
+   /* Let's presume PV guests always boot on vCPU with id 0. */
+   per_cpu(xen_vcpu_id, 0) = 0;
+
+   /*
+* Setup xen_vcpu early because idt_setup_early_handler needs it for
+* local_irq_disable(), irqs_disabled().
+*
+* Don't do the full vcpu_info placement stuff until we have
+* the cpu_possible_mask and a non-dummy shared_info.
+*/
+   xen_vcpu_info_reset(0);
+
+   idt_setup_early_handler();
+
xen_init_capabilities();
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -1295,18 +1318,6 @@ asmlinkage __visible void __init xen_start_kernel(void)
 */
acpi_numa = -1;
 #endif
-   /* Let's presume PV guests always boot on vCPU with id 0. */
-   per_cpu(xen_vcpu_id, 0) = 0;
-
-   /*
-* Setup xen_vcpu early because start_kernel needs it for
-* local_irq_disable(), irqs_disabled().
-*
-* Don't do the full vcpu_info placement stuff until we have
-* the cpu_possible_mask and a non-dummy shared_info.
-*/
-   xen_vcpu_info_reset(0);
-
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));
 
local_irq_disable();
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 8a10c9a9e2b5..417b339e5c8e 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -15,6 +15,7 @@
 
 #include 
 
+#include 
 #include 
 
 .macro xen_pv_trap name
@@ -54,6 +55,19 @@ xen_pv_trap entry_INT80_compat
 #endif
 xen_pv_trap hypervisor_callback
 
+   __INIT
+ENTRY(xen_early_idt_hand

Re: [PATCH] ASoC: amd: added error checks in dma driver

2017-11-24 Thread Mukunda,Vijendar



On Friday 24 November 2017 01:41 PM, Guenter Roeck wrote:

On Fri, Nov 24, 2017 at 3:07 AM, Mukunda,Vijendar
 wrote:



On Thursday 23 November 2017 10:59 PM, Mark Brown wrote:

On Thu, Nov 23, 2017 at 08:59:43AM -0800, Guenter Roeck wrote:

On Thu, Nov 23, 2017 at 8:30 AM, Vijendar Mukunda
 wrote:

added error checks in acp dma driver
Signed-off-by: Vijendar Mukunda 
Signed-off-by: Akshu Agrawal 
Signed-off-by: Guenter Roeck 

This is inappropriate.

Specifically: if Guenter wasn't involved in writing or forwarding the
patch he shouldn't have a signoff in there, and if you're the one
sending the mail you should be the last person in the chain of signoffs.
Please see SubmittingPatches for details of what a signoff means and why
they're important.


   This patch was implemented on top of changes implemented by Guenter.
   There is a separate thread - RE: [PATCH] ASoC: amd: Add error checking
   to probe function in which Guenter posted changes.

That was my patch. This is yours.

Guenter

Got it , Let your patch go as it is. Will submit a fresh patch for 
additional
error checks in acp dma driver.



   Got it, apologies will post changes as v2 version.





[tip:sched/urgent] sched/debug: Fix task state recording/printout

2017-11-24 Thread tip-bot for Thomas Gleixner
Commit-ID:  3f5fe9fef5b2da06b6319fab8123056da5217c3f
Gitweb: https://git.kernel.org/tip/3f5fe9fef5b2da06b6319fab8123056da5217c3f
Author: Thomas Gleixner 
AuthorDate: Wed, 22 Nov 2017 13:05:48 +0100
Committer:  Ingo Molnar 
CommitDate: Fri, 24 Nov 2017 08:39:12 +0100

sched/debug: Fix task state recording/printout

The recent conversion of the task state recording to use task_state_index()
broke the sched_switch tracepoint task state output.

task_state_index() returns surprisingly an index (0-7) which is then
printed with __print_flags() applying bitmasks. Not really working and
resulting in weird states like 'prev_state=t' instead of 'prev_state=I'.

Use TASK_REPORT_MAX instead of TASK_STATE_MAX to report preemption. Build a
bitmask from the return value of task_state_index() and store it in
entry->prev_state, which makes __print_flags() work as expected.

Signed-off-by: Thomas Gleixner 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: sta...@vger.kernel.org
Fixes: efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state 
printing")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1711221304180.1751@nanos
Signed-off-by: Ingo Molnar 
---
 include/trace/events/sched.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 306b31d..bc01e06 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -116,9 +116,9 @@ static inline long __trace_sched_switch_state(bool preempt, 
struct task_struct *
 * RUNNING (we will not have dequeued if state != RUNNING).
 */
if (preempt)
-   return TASK_STATE_MAX;
+   return TASK_REPORT_MAX;
 
-   return task_state_index(p);
+   return 1 << task_state_index(p);
 }
 #endif /* CREATE_TRACE_POINTS */
 
@@ -164,7 +164,7 @@ TRACE_EVENT(sched_switch,
{ 0x40, "P" }, { 0x80, "I" }) :
  "R",
 
-   __entry->prev_state & TASK_STATE_MAX ? "+" : "",
+   __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );
 


Re: [PATCH] x86/xen: support early interrupts in xen pv guests

2017-11-24 Thread Juergen Gross
Sorry, Andy, forgot to Cc: you...

On 24/11/17 09:42, Juergen Gross wrote:
> Add early interrupt handlers activated by idt_setup_early_handler() to
> the handlers supported by Xen pv guests. This will allow for early
> WARN() calls not crashing the guest.
> 
> Suggested-by: Andy Lutomirski 
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/segment.h | 12 
>  arch/x86/mm/extable.c  |  4 +++-
>  arch/x86/xen/enlighten_pv.c| 37 -
>  arch/x86/xen/xen-asm_64.S  | 14 ++
>  4 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
> index b20f9d623f9c..8f09012b92e7 100644
> --- a/arch/x86/include/asm/segment.h
> +++ b/arch/x86/include/asm/segment.h
> @@ -236,11 +236,23 @@
>   */
>  #define EARLY_IDT_HANDLER_SIZE 9
>  
> +/*
> + * xen_early_idt_handler_array is for Xen pv guests: for each entry in
> + * early_idt_handler_array it contains a prequel in the form of
> + * pop %rcx; pop %r11; jmp early_idt_handler_array[i]; summing up to
> + * max 8 bytes.
> + */
> +#define XEN_EARLY_IDT_HANDLER_SIZE 8
> +
>  #ifndef __ASSEMBLY__
>  
>  extern const char 
> early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
>  extern void early_ignore_irq(void);
>  
> +#if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
> +extern const char 
> xen_early_idt_handler_array[NUM_EXCEPTION_VECTORS][XEN_EARLY_IDT_HANDLER_SIZE];
> +#endif
> +
>  /*
>   * Load a segment. Fall back on loading the zero segment if something goes
>   * wrong.  This variant assumes that loading zero fully clears the segment.
> diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
> index 3321b446b66c..88754bfd425f 100644
> --- a/arch/x86/mm/extable.c
> +++ b/arch/x86/mm/extable.c
> @@ -1,6 +1,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -212,8 +213,9 @@ void __init early_fixup_exception(struct pt_regs *regs, 
> int trapnr)
>* Old CPUs leave the high bits of CS on the stack
>* undefined.  I'm not sure which CPUs do this, but at least
>* the 486 DX works this way.
> +  * Xen pv domains are not using the default __KERNEL_CS.
>*/
> - if (regs->cs != __KERNEL_CS)
> + if (!xen_pv_domain() && regs->cs != __KERNEL_CS)
>   goto fail;
>  
>   /*
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 5b2b3f3f6531..f2414c6c5e7c 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -622,7 +622,7 @@ static struct trap_array_entry trap_array[] = {
>   { simd_coprocessor_error,  xen_simd_coprocessor_error,  false },
>  };
>  
> -static bool get_trap_addr(void **addr, unsigned int ist)
> +static bool __ref get_trap_addr(void **addr, unsigned int ist)
>  {
>   unsigned int nr;
>   bool ist_okay = false;
> @@ -644,6 +644,14 @@ static bool get_trap_addr(void **addr, unsigned int ist)
>   }
>   }
>  
> + if (nr == ARRAY_SIZE(trap_array) &&
> + *addr >= (void *)early_idt_handler_array[0] &&
> + *addr < (void *)early_idt_handler_array[NUM_EXCEPTION_VECTORS]) {
> + nr = (*addr - (void *)early_idt_handler_array[0]) /
> +  EARLY_IDT_HANDLER_SIZE;
> + *addr = (void *)xen_early_idt_handler_array[nr];
> + }
> +
>   if (WARN_ON(ist != 0 && !ist_okay))
>   return false;
>  
> @@ -1262,6 +1270,21 @@ asmlinkage __visible void __init xen_start_kernel(void)
>   xen_setup_gdt(0);
>  
>   xen_init_irq_ops();
> +
> + /* Let's presume PV guests always boot on vCPU with id 0. */
> + per_cpu(xen_vcpu_id, 0) = 0;
> +
> + /*
> +  * Setup xen_vcpu early because idt_setup_early_handler needs it for
> +  * local_irq_disable(), irqs_disabled().
> +  *
> +  * Don't do the full vcpu_info placement stuff until we have
> +  * the cpu_possible_mask and a non-dummy shared_info.
> +  */
> + xen_vcpu_info_reset(0);
> +
> + idt_setup_early_handler();
> +
>   xen_init_capabilities();
>  
>  #ifdef CONFIG_X86_LOCAL_APIC
> @@ -1295,18 +1318,6 @@ asmlinkage __visible void __init xen_start_kernel(void)
>*/
>   acpi_numa = -1;
>  #endif
> - /* Let's presume PV guests always boot on vCPU with id 0. */
> - per_cpu(xen_vcpu_id, 0) = 0;
> -
> - /*
> -  * Setup xen_vcpu early because start_kernel needs it for
> -  * local_irq_disable(), irqs_disabled().
> -  *
> -  * Don't do the full vcpu_info placement stuff until we have
> -  * the cpu_possible_mask and a non-dummy shared_info.
> -  */
> - xen_vcpu_info_reset(0);
> -
>   WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));
>  
>   local_irq_disable();
> diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
> index 8a10c9a9e2b5..417b339e5c8e 100644
> --- a/arch/x86/xen/xen-asm_6

Re: [RFC PATCH 0/2] mm: introduce MAP_FIXED_SAFE

2017-11-24 Thread Michal Hocko
Are there any more concerns? So far the biggest one was the name. The
other which suggests a flag as a modifier has been sorted out hopefully.
Is there anymore more before we can consider this for merging? Well
except for man page update which I will prepare of course. Can we target
this to 4.16?

On Thu 16-11-17 13:14:38, Michal Hocko wrote:
> [Ups, managed to screw the subject - fix it]
> 
> On Thu 16-11-17 11:18:58, Michal Hocko wrote:
> > Hi,
> > this has started as a follow up discussion [1][2] resulting in the
> > runtime failure caused by hardening patch [3] which removes MAP_FIXED
> > from the elf loader because MAP_FIXED is inherently dangerous as it
> > might silently clobber and existing underlying mapping (e.g. stack). The
> > reason for the failure is that some architectures enforce an alignment
> > for the given address hint without MAP_FIXED used (e.g. for shared or
> > file backed mappings).
> > 
> > One way around this would be excluding those archs which do alignment
> > tricks from the hardening [4]. The patch is really trivial but it has
> > been objected, rightfully so, that this screams for a more generic
> > solution. We basically want a non-destructive MAP_FIXED.
> > 
> > The first patch introduced MAP_FIXED_SAFE which enforces the given
> > address but unlike MAP_FIXED it fails with ENOMEM if the given range
> > conflicts with an existing one. The flag is introduced as a completely
> > new flag rather than a MAP_FIXED extension because of the backward
> > compatibility. We really want a never-clobber semantic even on older
> > kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> > flags evaluation because we do not EINVAL on unknown flags. On those
> > kernels we would simply use the traditional hint based semantic so the
> > caller can still get a different address (which sucks) but at least not
> > silently corrupt an existing mapping. I do not see a good way around
> > that. Except we won't export expose the new semantic to the userspace at
> > all. It seems there are users who would like to have something like that
> > [5], though. Atomic address range probing in the multithreaded programs
> > sounds like an interesting thing to me as well, although I do not have
> > any specific usecase in mind.
> > 
> > The second patch simply replaces MAP_FIXED use in elf loader by
> > MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> > follow. Actually real MAP_FIXED usages should be docummented properly
> > and they should be more of an exception.
> > 
> > Does anybody see any fundamental reasons why this is a wrong approach?
> > 
> > Diffstat says
> >  arch/alpha/include/uapi/asm/mman.h   |  2 ++
> >  arch/metag/kernel/process.c  |  6 +-
> >  arch/mips/include/uapi/asm/mman.h|  2 ++
> >  arch/parisc/include/uapi/asm/mman.h  |  2 ++
> >  arch/powerpc/include/uapi/asm/mman.h |  1 +
> >  arch/sparc/include/uapi/asm/mman.h   |  1 +
> >  arch/tile/include/uapi/asm/mman.h|  1 +
> >  arch/xtensa/include/uapi/asm/mman.h  |  2 ++
> >  fs/binfmt_elf.c  | 12 
> >  include/uapi/asm-generic/mman.h  |  1 +
> >  mm/mmap.c| 11 +++
> >  11 files changed, 36 insertions(+), 5 deletions(-)
> > 
> > [1] http://lkml.kernel.org/r/20171107162217.382cd...@canb.auug.org.au
> > [2] http://lkml.kernel.org/r/1510048229.12079.7.ca...@abdul.in.ibm.com
> > [3] http://lkml.kernel.org/r/20171023082608.6167-1-mho...@kernel.org
> > [4] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk...@dhcp22.suse.cz
> > [5] http://lkml.kernel.org/r/87efp1w7vy@concordia.ellerman.id.au
> > 
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Kai Heng Feng


> On 24 Nov 2017, at 4:28 PM, Greg KH  wrote:
> 
> The bcdDevice is different between the dock device and the "real"
> device, why not use that?

Yea, I’ll poke around and see if bcdDevice alone can be a good predicate.

> Then there is still a bug.  Who as ASMedia is working on this, have they
> posted anything to the linux-usb mailing list about it?

I think they are doing this internally. I’ll advice them to ask questions here 
if
they encounter any problem.

> Maybe.  Have you tried using usbmon to see what the data streams are for
> the two devices and where they have problems and diverge?  Is the dock
> device doing something different in response to something from the host
> that the "real" device does not do?

No I haven’t.
Not really sure how do debug network packets over USB. I’ll do some research
on the topic.

Kai-Heng


[PATCH] atm: nicstar: use the setup_timer helper

2017-11-24 Thread Colin King
From: Colin Ian King 

Replace init_timer and two explicit assignments with the setup_timer
helper.

Signed-off-by: Colin Ian King 
---
 drivers/atm/nicstar.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/atm/nicstar.c b/drivers/atm/nicstar.c
index a9702836cbae..335447ed0ba4 100644
--- a/drivers/atm/nicstar.c
+++ b/drivers/atm/nicstar.c
@@ -284,10 +284,8 @@ static int __init nicstar_init(void)
XPRINTK("nicstar: nicstar_init() returned.\n");
 
if (!error) {
-   init_timer(&ns_timer);
+   setup_timer(&ns_timer, ns_poll, 0UL);
ns_timer.expires = jiffies + NS_POLL_PERIOD;
-   ns_timer.data = 0UL;
-   ns_timer.function = ns_poll;
add_timer(&ns_timer);
}
 
-- 
2.14.1



Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Kai Heng Feng

> Also the MAC address is different, can you just trigger off of Dell's
> MAC address space instead of the address space of the dongle device?

A really good idea, never thought of this. Thanks for the hint :)
Still, I need to ask Dell folks to get all the answers.

Kai-Heng



Re: [PATCH 1/3] scsi: arcmsr: Add driver module parameter msi_enable

2017-11-24 Thread Ching Huang
On Fri, 2017-11-24 at 04:45 +0800, Ching Huang wrote:
> Hello Dan,
> 
> On Thu, 2017-11-23 at 13:44 +0300, Dan Carpenter wrote:
> > On Thu, Nov 23, 2017 at 09:27:19AM +0800, Ching Huang wrote:
> > > From: Ching Huang 
> > > 
> > > Add module parameter msi_enable to has a chance to disable msi interrupt 
> > > if it does not work properly.
> > > 
> > > Signed-off-by: Ching Huang 
> > > ---
> > > 
> > > diff -uprN a/drivers/scsi/arcmsr/arcmsr_hba.c 
> > > b/drivers/scsi/arcmsr/arcmsr_hba.c
> > > --- a/drivers/scsi/arcmsr/arcmsr_hba.c2017-11-23 14:29:26.0 
> > > +0800
> > > +++ b/drivers/scsi/arcmsr/arcmsr_hba.c2017-11-23 16:02:28.0 
> > > +0800
> > > @@ -75,6 +75,10 @@ MODULE_DESCRIPTION("Areca ARC11xx/12xx/1
> > >  MODULE_LICENSE("Dual BSD/GPL");
> > >  MODULE_VERSION(ARCMSR_DRIVER_VERSION);
> > >  
> > > +static int msi_enable = 1;
> > > +module_param(msi_enable, int, S_IRUGO);
> >  ^^^
> > checkpatch.pl will complain that this should be 0444
> S_IRUGO value is 00444, defined in  -> .
>  A. It will be not a issue.
> > 
> > > +MODULE_PARM_DESC(msi_enable, " Enable MSI interrupt(0 ~ 1), 
> > > msi_enable=1(enable), =0(disable)");
> >  ^
> > Remove the extra space
> OK
> > 
> > > +
> > >  static int host_can_queue = ARCMSR_DEFAULT_OUTSTANDING_CMD;
> > >  module_param(host_can_queue, int, S_IRUGO);
> > >  MODULE_PARM_DESC(host_can_queue, " adapter queue depth(32 ~ 1024), 
> > > default is 128");
> > > @@ -831,11 +835,15 @@ arcmsr_request_irq(struct pci_dev *pdev,
> > >   pr_info("arcmsr%d: msi-x enabled\n", acb->host->host_no);
> > >   flags = 0;
> > >   } else {
> > > - nvec = pci_alloc_irq_vectors(pdev, 1, 1,
> > > - PCI_IRQ_MSI | PCI_IRQ_LEGACY);
> > > + if (msi_enable == 1)
> > > + nvec = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI);
> > > + else
> > > + nvec = pci_alloc_irq_vectors(pdev, 1, 1, 
> > > PCI_IRQ_LEGACY);
> > >   if (nvec < 1)
> > >   return FAILED;
> > 
> > I feel like we should try PCI_IRQ_MSI then if it fails we could fall
> > back to PCI_IRQ_LEGACY.  Originally, it worked like this and now it just
> > fails unless you toggle the module param.  It's a regression.
> update as below
> ---
>   unsigned int irq_flag;
>   irq_flag = PCI_IRQ_LEGACY;
>   if (msi_enable == 1)
>   irq_flag |= PCI_IRQ_MSI;
>   nvec = pci_alloc_irq_vectors(pdev, 1, 1, irq_flag);
> > >  
> > > + if (msi_enable == 1)
> > > + pr_info("arcmsr%d: msi enabled\n", acb->host->host_no);
> > 
> > This printk could be improved.  Use dev_info(&pdev->dev, for a start.
> > I know that the other prints don't use this, but we could use it one
> > time then slowly add more users until more are using dev_info() than
> > pr_info() and then someone will decide to clean up the old users.
> update as below
> ---
>   if (msi_enable == 1)
>   dev_info(&pdev->dev, "msi enabled\n");
> 
> > 
> > regards,
> > dan carpenter
> > 
> 
Dan,.

This new patch apply to mkp/scsi.git 4.16/scsi-queue
---

diff -uprN a/drivers/scsi/arcmsr/arcmsr_hba.c b/drivers/scsi/arcmsr/arcmsr_hba.c
--- a/drivers/scsi/arcmsr/arcmsr_hba.c  2017-11-23 14:29:26.0 +0800
+++ b/drivers/scsi/arcmsr/arcmsr_hba.c  2017-11-24 15:16:20.0 +0800
@@ -75,6 +75,10 @@ MODULE_DESCRIPTION("Areca ARC11xx/12xx/1
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_VERSION(ARCMSR_DRIVER_VERSION);
 
+static int msi_enable = 1;
+module_param(msi_enable, int, S_IRUGO);
+MODULE_PARM_DESC(msi_enable, "Enable MSI interrupt(0 ~ 1), 
msi_enable=1(enable), =0(disable)");
+
 static int host_can_queue = ARCMSR_DEFAULT_OUTSTANDING_CMD;
 module_param(host_can_queue, int, S_IRUGO);
 MODULE_PARM_DESC(host_can_queue, " adapter queue depth(32 ~ 1024), default is 
128");
@@ -831,11 +835,17 @@ arcmsr_request_irq(struct pci_dev *pdev,
pr_info("arcmsr%d: msi-x enabled\n", acb->host->host_no);
flags = 0;
} else {
-   nvec = pci_alloc_irq_vectors(pdev, 1, 1,
-   PCI_IRQ_MSI | PCI_IRQ_LEGACY);
+   if (msi_enable == 1) {
+   nvec = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI);
+   if (nvec == 1) {
+   dev_info(&pdev->dev, "msi enabled\n");
+   goto msi_int1;
+   }
+   }
+   nvec = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_LEGACY);
if (nvec < 1)
return FAILED;
-
+msi_int1:
flags = IRQF_SHARED;
}
 




Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]

2017-11-24 Thread Sagar Arun Kamble



On 11/24/2017 12:29 AM, Thomas Gleixner wrote:

On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:

We needed inputs on possible optimization that can be done to
timecounter/cyclecounter structures/usage.
This mail is in response to review of patch
https://patchwork.freedesktop.org/patch/188448/.

As Chris's observation below, about dozen of timecounter users in the kernel
have below structures
defined individually:

spinlock_t lock;
struct cyclecounter cc;
struct timecounter tc;

Can we move lock and cc to tc? That way it will be convenient.
Also it will allow unifying the locking/overflow watchdog handling across all
drivers.

Looks like none of the timecounter usage sites has a real need to separate
timecounter and cyclecounter.


Yes. Will share patch for this change.


The lock is a different question. The locking of the various drivers
differs and I have no idea how you want to handle that. Just sticking the
lock into the datastructure and then not making use of it in the
timercounter code and leave it to the callsites does not make sense.


Most of the locks are held around timecounter_read. In some instances it is 
held when cyclecounter is
updated standalone or is updated along with timecounter calls.
Was thinking if we move the lock in timecounter functions, drivers just have to 
do locking around its
operations on cyclecounter. But then another problem I see is there are 
variation of locking calls
like lock_irqsave, lock_bh, write_lock_irqsave (some using rwlock_t). Should 
this all locking be left
to driver only then?


Thanks,

tglx


Thanks
Sagar



Re: [PATCH v2 2/2] s390/virtio: add BSD license to virtio-ccw

2017-11-24 Thread Cornelia Huck
On Fri, 24 Nov 2017 07:21:09 +0200
"Michael S. Tsirkin"  wrote:

> The original intent of the virtio header relicensing
> from 2008 was to make sure anyone can implement compatible
> devices/drivers. The virtio-ccw was omitted by mistake.
> 
> We have an ack from the only contributor as well as the
> maintainer from IBM, so it's not too late to fix that.
> 
> Make it dual-licensed with GPLv2, as the whole kernel is GPL2.
> 
> Acked-by: Christian Borntraeger 
> Acked-by: Cornelia Huck 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/s390/include/uapi/asm/virtio-ccw.h | 32 +++-
>  1 file changed, 27 insertions(+), 5 deletions(-)

As noted by Thomas, patch 1 is not needed anymore. Who will merge this
one?


[PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/3966a6edf6fd45deca4cf52a9b9276402499dda9.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-   return FIX_GDT_REMAP_BEGIN + cpu;
+   return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.14.1



[PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2008dd0f8ccb..1bd43f044c62 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, 
long error_code)
regs->cs == __KERNEL_CS &&
regs->ip == (unsigned long)native_irq_return_iret)
{
-   struct pt_regs *normal_regs = task_pt_regs(current);
+   struct pt_regs *normal_regs =
+   (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 
1;
 
/* Fake a #GP(0) from userspace. */
memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -390,7 +391,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, 
long error_code)
 *
 *   Processors update CR2 whenever a page fault is detected. If a
 *   second page fault occurs while an earlier page fault is being
-*   deliv- ered, the faulting linear address of the second fault will
+*   delivered, the faulting linear address of the second fault will
 *   overwrite the contents of CR2 (replacing the previous
 *   address). These updates to CR2 occur even if the page fault
 *   results in a double fault or occurs during the delivery of a
-- 
2.14.1



[PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use [1].

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
available so that it can still be used for a few selected kernel mappings
which must be visible to userspace, when KAISER is enabled, like the
entry/exit code and data.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen 
Reviewed-by: Borislav Petkov 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003441.63ddf...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pgtable_types.h | 14 +-
 arch/x86/mm/pageattr.c   | 16 
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 9e9b05fc4860..1fc2f22b9002 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -180,8 +180,20 @@ enum page_cache_mode {
 #define PAGE_READONLY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER |   \
 _PAGE_ACCESSED)
 
+/*
+ * Disable global pages for anything using the default
+ * __PAGE_KERNEL* macros.  PGE will still be enabled
+ * and _PAGE_GLOBAL may still be used carefully.
+ */
+#ifdef CONFIG_KAISER
+#define __PAGE_KERNEL_GLOBAL   0
+#else
+#define __PAGE_KERNEL_GLOBAL   _PAGE_GLOBAL
+#endif
+
 #define __PAGE_KERNEL_EXEC \
-   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED |  \
+__PAGE_KERNEL_GLOBAL)
 #define __PAGE_KERNEL  (__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO   (__PAGE_KERNEL & ~_PAGE_RW)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 3fe68483463c..ffe584fa1f5e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -585,9 +585,9 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 * for the ancient hardware that doesn't support it.
 */
if (pgprot_val(req_prot) & _PAGE_PRESENT)
-   pgprot_val(req_prot) |= _PAGE_PSE | _PAGE_GLOBAL;
+   pgprot_val(req_prot) |= _PAGE_PSE | __PAGE_KERNEL_GLOBAL;
else
-   pgprot_val(req_prot) &= ~(_PAGE_PSE | _PAGE_GLOBAL);
+   pgprot_val(req_prot) &= ~(_PAGE_PSE | __PAGE_KERNEL_GLOBAL);
 
req_prot = canon_pgprot(req_prot);
 
@@ -705,9 +705,9 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, 
unsigned long address,
 * for the ancient hardware that doesn't support it.
 */
if (pgprot_val(ref_prot) & _PAGE_PRESENT)
-   pgprot_val(ref_prot) |= _PAGE_GLOBAL;
+   pgprot_val(ref_prot) |= __PAGE_KERNEL_GLOBAL;
else
-   pgprot_val(ref_prot) &= ~_PAGE_GLOBAL;
+   pgprot_val(ref_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
/*
 * Get the target pfn from the original entry:
@@ -938,9 +938,9 @@ static void populate_pte(struct cpa_data *cpa,
 * support it.
 */
if (pgprot_val(pgprot) & _PAGE_PRESENT)
-   pgprot_val(pgprot) |= _PAGE_GLOBAL;
+   pgprot_val(pgprot) |= __PAGE_KERNEL_GLOBAL;
else
-   pgprot_val(pgprot) &= ~_PAGE_GLOBAL;
+   pgprot_val(pgprot) &= ~__PAGE_KERNEL_GLOBAL;
 
pgprot = canon_pgprot(pgprot);
 
@@ -1242,9 +1242,9 @@ static int __change_page_attr(struct cpa_data *cpa, int 
primary)
 * support it.
 */
if (pgprot_val(new_prot) & _PAGE_PRESENT)
-   pgprot_val(new_prot) |= _PAGE_GLOBAL;
+   pgprot_val(new_prot) |= __PAGE_KERNEL_GLOBAL;
else
-   pgprot_val(new_prot) &= ~_PAGE_GLOBAL;
+   pgprot_val(new_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
/*
 * We need to keep the pfn from the existing PTE,
-- 
2.14.

[PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version

2017-11-24 Thread Ingo Molnar
This is a linear series of patches of the latest entry-stack plus Kaiser
bits from Andy Lutomirski (v3 series from today) and Dave Hansen
(kaiser-414-tipwip-20171123 version), on top of latest tip:x86/urgent 
(12a78d43de76),
plus fixes - for easier review.

The code should be the latest posted by Andy and Dave.

Any bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.

Thanks,

Ingo

Andy Lutomirski (19):
  x86/entry/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/kasan/64: Teach KASAN about the cpu_entry_area
  x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/entry: Remap the TSS into the cpu entry area
  x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/entry/64: Use a percpu trampoline stack for IDT entries
  x86/entry/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a percpu SYSCALL entry trampoline
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: Print the offending IP in the stack overflow warning
  x86/entry/64: Move the IST stacks into cpu_entry_area
  x86/entry/64: Remove the SYSENTER stack canary
  x86/entry: Clean up SYSENTER_stack code

Dave Hansen (22):
  x86/mm/kaiser: Disable global pages by default with KAISER
  x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  x86/mm/kaiser: Introduce user-mapped per-cpu areas
  x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  x86/mm/kaiser: Allow NX poison to be set in p4d/pgd
  x86/mm/kaiser: Make sure static PGDs are 8k in size
  x86/mm/kaiser: Map CPU entry area
  x86/mm/kaiser: Map dynamically-allocated LDTs
  x86/mm/kaiser: Map espfix structures
  x86/mm/kaiser: Map entry stack variable
  x86/mm: Move CR3 construction functions
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Put mmu-to-h/w ASID translation in one place
  x86/mm: Allow flushing for future ASID switches
  x86/mm/kaiser: Use PCID feature to make user and kernel switches faster
  x86/mm/kaiser: Disable native VSYSCALL
  x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime
  x86/mm/kaiser: Add a function to check for KAISER being enabled
  x86/mm/kaiser: Un-poison PGDs at runtime
  x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  x86/mm/kaiser: Add Kconfig

Hugh Dickins (1):
  x86/mm/kaiser: Map virtually-addressed performance monitoring buffers

Masami Hiramatsu (1):
  x86/decoder: Add new TEST instruction pattern

 Documentation/x86/kaiser.txt| 162 
 arch/x86/Kconfig|   8 +
 arch/x86/boot/compressed/pagetable.c|   6 +
 arch/x86/entry/calling.h|  89 
 arch/x86/entry/entry_32.S   |   6 +-
 arch/x86/entry/entry_64.S   | 215 --
 arch/x86/entry/entry_64_compat.S|  39 +-
 arch/x86/events/intel/ds.c  |  49 ++-
 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm/desc.h |  13 +-
 arch/x86/include/asm/fixmap.h   |  58 ++-
 arch/x86/include/asm/kaiser.h   |  68 +++
 arch/x86/include/asm/mmu_context.h  |  29 +-
 arch/x86/include/asm/pgtable.h  |  19 +-
 arch/x86/include/asm/pgtable_64.h   | 146 +++
 arch/x86/include/asm/pgtable_types.h|  25 +-
 arch/x86/include/asm/processor.h|  49 ++-
 arch/x86/include/asm/stacktrace.h   |   3 +
 arch/x86/include/asm/switch_to.h|   2 +-
 arch/x86/include/asm/thread_info.h  |   2 +-
 arch/x86/include/asm/tlbflush.h | 208 --
 arch/x86/include/asm/traps.h|   1 -
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/asm-offsets.c   |   7 +
 arch/x86/kernel/asm-offsets_32.c|   5 -
 arch/x86/kernel/asm-offsets_64.c|   1 +
 arch/x86/kernel/cpu/common.c| 139 +--
 arch/x86/kernel/doublefault.c   |  36 +-
 arch/x86/kernel/dumpstack.c |  42 +-
 arch/x86/kernel/dumpstack_32.c  |   6 +
 arch/x86/kernel/dumpstack_64.c  |   6 +
 arch/x86/kernel/espfix_64.c |  27 +-
 arch/x86/kernel/head_64.S   |  30 +-
 arch/x86/kernel/irq.c   |  12 -
 arch/x86/kernel/irq_64.c|   4 +-
 arch/x86/kernel/ldt.c   |  25 +-
 arch/x86/kernel/process.c   |  15 +-
 arch/x86/kernel/process_64.c|   3 +-
 arch/x86/kernel/tra

[PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

Normally, a process has a NULL mm->context.ldt.  But, there is a
syscall for a process to set a new one.  If a process does that,
the LDT be mapped into the user page tables, just like the
default copy.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003455.27539...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/ldt.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 1c1eae961340..d6ab1144fdbf 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -57,11 +58,21 @@ static void flush_ldt(void *__mm)
refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+   if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+   vfree_atomic(ldt->entries);
+   else
+   free_page((unsigned long)ldt->entries);
+   kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. 
*/
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
struct ldt_struct *new_ldt;
unsigned int alloc_size;
+   int ret;
 
if (num_entries > LDT_ENTRIES)
return NULL;
@@ -89,6 +100,12 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int 
num_entries)
return NULL;
}
 
+   ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+__PAGE_KERNEL | _PAGE_GLOBAL);
+   if (ret) {
+   __free_ldt_struct(new_ldt);
+   return NULL;
+   }
new_ldt->nr_entries = num_entries;
return new_ldt;
 }
@@ -115,12 +132,10 @@ static void free_ldt_struct(struct ldt_struct *ldt)
if (likely(!ldt))
return;
 
+   kaiser_remove_mapping((unsigned long)ldt->entries,
+ ldt->nr_entries * LDT_ENTRY_SIZE);
paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-   if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-   vfree_atomic(ldt->entries);
-   else
-   free_page((unsigned long)ldt->entries);
-   kfree(ldt);
+   __free_ldt_struct(ldt);
 }
 
 /*
-- 
2.14.1



[PATCH 30/43] x86/mm/kaiser: Map espfix structures

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

There is some rather arcane code to help when an IRET returns
to 16-bit segments.  It is referred to as the "espfix" code.
This consists of a few per-cpu variables:

espfix_stack: tells us where the stack is allocated
  (the bottom)
espfix_waddr: tells us to where %rsp may be pointed
  (the top)

These are in addition to the stack itself.  All three things must
be mapped for the espfix code to function.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  A switch to the kernel page tables could
be performed instead of mapping these structures, but mapping
them is simpler and less likely to break the assembly.  To switch
over to the kernel copy, additional temporary storage would be
required which is in short supply in this context.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003457.eb854...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/espfix_64.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4780dba2cc59..8bb116d73aaa 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,7 +42,6 @@
 #include 
 #include 
 #include 
-#include 
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,10 @@ void init_espfix_ap(int cpu)
per_cpu(espfix_stack, cpu) = addr;
per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
  + (addr & ~PAGE_MASK);
+   /*
+* _PAGE_GLOBAL is not really required.  This is not a hot
+* path, but we do it here for consistency.
+*/
+   kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
+   __PAGE_KERNEL | _PAGE_GLOBAL);
 }
-- 
2.14.1



[PATCH 43/43] x86/mm/kaiser: Add Kconfig

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

It is absurd that KAISER should depend on SMP, but
apparently nobody has tried a UP build before: which
breaks on implicit declaration of function
'per_cpu_offset' in arch/x86/mm/kaiser.c.

Now, you would expect that to be trivially fixed up; but
looking at the System.map when that block is #ifdef'ed
out of kaiser_init(), I see that in a UP build
__per_cpu_user_mapped_end is precisely at
__per_cpu_user_mapped_start, and the items carefully
gathered into that section for user-mapping on SMP,
dispersed elsewhere on UP.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003524.88c90...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 security/Kconfig | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..99b530d0dd9e 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
  implement socket and networking access controls.
  If you are unsure how to answer this question, answer N.
 
+config KAISER
+   bool "Remove the kernel mapping in user mode"
+   depends on X86_64 && SMP && !PARAVIRT
+   help
+ This feature reduces the number of hardware side channels by
+ ensuring that the majority of kernel addresses are not mapped
+ into userspace.
+
+ See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
bool "Infiniband Security Hooks"
depends on SECURITY && INFINIBAND
-- 
2.14.1



[PATCH 31/43] x86/mm/kaiser: Map entry stack variable

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

There are times where the kernel is entered but there is no
safe stack, like at SYSCALL entry.  To obtain a safe stack, we
have to clobber %rsp and store the clobbered value in
'rsp_scratch'.

Map this to userspace to allow us to do this stack switch before
the CR3 switch.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003459.c0ff1...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/process_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bafe65b08697..9a0220aa2bf9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,7 +59,7 @@
 #include 
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
-- 
2.14.1



[PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

Currently, all of the checks for KAISER are compile-time checks.

Runtime checks are needed for turning it on/off at runtime.

Add a function to do that.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003518.b7d81...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/kaiser.h | 5 +
 include/linux/kaiser.h| 5 +
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 040cb096d29d..35f12a8a7071 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -56,6 +56,11 @@ extern void kaiser_remove_mapping(unsigned long start, 
unsigned long size);
  */
 extern void kaiser_init(void);
 
+static inline bool kaiser_active(void)
+{
+   extern int kaiser_enabled;
+   return kaiser_enabled;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 77db4230a0dd..a3d28d00d555 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -28,5 +28,10 @@ static inline int kaiser_add_mapping(unsigned long addr, 
unsigned long size,
 static inline void kaiser_add_mapping_cpu_entry(int cpu)
 {
 }
+
+static inline bool kaiser_active(void)
+{
+   return 0;
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1



[PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

With KAISER Kernel PGDs that map userspace are "poisoned" with
the NX bit.  This ensures that if a kernel->user CR3 switch is
missed, userspace crashes instead of running in an unhardened
state.

This code will be needed in a moment when KAISER is turned
on and off at runtime.

Note that an __ASSEMBLY__ #ifdef is now required since kaiser.h
is indirectly included into assembly.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003521.a90ac...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pgtable_64.h | 16 +++-
 arch/x86/mm/kaiser.c  | 38 ++
 include/linux/kaiser.h|  3 ++-
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index c239839e92bd..89bde2091af1 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PGTABLE_64_H
 
 #include 
+#include 
 #include 
 
 #ifndef __ASSEMBLY__
@@ -199,6 +200,18 @@ static inline bool pgd_userspace_access(pgd_t pgd)
return pgd.pgd & _PAGE_USER;
 }
 
+static inline void kaiser_poison_pgd(pgd_t *pgd)
+{
+   if (pgd->pgd & _PAGE_PRESENT)
+   pgd->pgd |= _PAGE_NX;
+}
+
+static inline void kaiser_unpoison_pgd(pgd_t *pgd)
+{
+   if (pgd->pgd & _PAGE_PRESENT)
+   pgd->pgd &= ~_PAGE_NX;
+}
+
 /*
  * Take a PGD location (pgdp) and a pgd value that needs
  * to be set there.  Populates the shadow and returns
@@ -222,7 +235,8 @@ static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, 
pgd_t pgd)
 * wrong CR3 value, userspace will crash
 * instead of running.
 */
-   pgd.pgd |= _PAGE_NX;
+   if (kaiser_active())
+   kaiser_poison_pgd(&pgd);
}
} else if (pgd_userspace_access(*pgdp)) {
/*
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 968d5b62d597..06966b111280 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -501,6 +501,9 @@ static ssize_t kaiser_enabled_write_file(struct file *file,
if (enable > 1)
return -EINVAL;
 
+   if (kaiser_enabled == enable)
+   return count;
+
WRITE_ONCE(kaiser_enabled, enable);
return count;
 }
@@ -518,3 +521,38 @@ static int __init create_kaiser_enabled(void)
return 0;
 }
 late_initcall(create_kaiser_enabled);
+
+enum poison {
+   KAISER_POISON,
+   KAISER_UNPOISON
+};
+void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
+{
+   int i = 0;
+
+   for (i = 0; i < PTRS_PER_PGD; i++) {
+   pgd_t *pgd = &pgd_page[i];
+
+   /* Stop once we hit kernel addresses: */
+   if (!pgdp_maps_userspace(pgd))
+   break;
+
+   if (do_poison == KAISER_POISON)
+   kaiser_poison_pgd(pgd);
+   else
+   kaiser_unpoison_pgd(pgd);
+   }
+
+}
+
+void kaiser_poison_pgds(enum poison do_poison)
+{
+   struct page *page;
+
+   spin_lock(&pgd_lock);
+   list_for_each_entry(page, &pgd_list, lru) {
+   pgd_t *pgd = (pgd_t *)page_address(page);
+   kaiser_poison_pgd_page(pgd, do_poison);
+   }
+   spin_unlock(&pgd_lock);
+}
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index a3d28d00d555..83d465599646 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -4,7 +4,7 @@
 #ifdef CONFIG_KAISER
 #include 
 #else
-
+#ifndef __ASSEMBLY__
 /*
  * These stubs are used whenever CONFIG_KAISER is off, which
  * includes architectures that support KAISER, but have it
@@ -33,5 +33,6 @@ static inline bool kaiser_active(void)
 {
return 0;
 }
+#endif /* __ASSEMBLY__ */
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1



[PATCH 33/43] x86/mm: Move CR3 construction functions

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

For flushing the TLB, the ASID which has been programmed into the
hardware must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003502.cc87b...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/mmu_context.h | 29 +
 arch/x86/include/asm/tlbflush.h| 27 +++
 arch/x86/mm/tlb.c  |  8 
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 6d16d15d09a0..5e1a1ecb65c6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
return __pkru_allows_pkey(vma_pkey(vma), write);
 }
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-   if (static_cpu_has(X86_FEATURE_PCID)) {
-   VM_WARN_ON_ONCE(asid > 4094);
-   return __sme_pa(mm->pgd) | (asid + 1);
-   } else {
-   VM_WARN_ON_ONCE(asid != 0);
-   return __sme_pa(mm->pgd);
-   }
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-   VM_WARN_ON_ONCE(asid > 4094);
-   return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
@@ -317,7 +290,7 @@ static inline unsigned long build_cr3_noflush(struct 
mm_struct *mm, u16 asid)
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-   unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+   unsigned long cr3 = 
build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
/* For now, be very restrictive about when this can be called. */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 509046cfa5ce..df28f1a61afa 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,33 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+   if (static_cpu_has(X86_FEATURE_PCID)) {
+   VM_WARN_ON_ONCE(asid > 4094);
+   return __sme_pa(pgd) | (asid + 1);
+   } else {
+   VM_WARN_ON_ONCE(asid != 0);
+   return __sme_pa(pgd);
+   }
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+   VM_WARN_ON_ONCE(asid > 4094);
+   return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include 
 #else
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3118392cdf75..e629dbda01a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * isn't free.
 */
 #ifdef CONFIG_DEBUG_VM
-   if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+   if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) 
{
/*
 * If we were to BUG here, we'd be very likely to kill
 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm

[PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers

2017-11-24 Thread Ingo Molnar
From: Hugh Dickins 

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins 
Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003500.7ec0d...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 

x86/mm: Fix Kaiser build on 32-bit, backmerge to: x86/mm/kaiser: Map 
virtually-addressed performance monitoring buffers

Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/ds.c | 49 ++
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b6f8bd..61388b01962d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3,11 +3,15 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE24
 
@@ -279,6 +283,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+   unsigned int order = get_order(size);
+   struct page *page;
+   unsigned long addr;
+
+   page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+   if (!page)
+   return NULL;
+   addr = (unsigned long)page_address(page);
+   if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+   __free_pages(page, order);
+   addr = 0;
+   }
+   return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+   if (!buffer)
+   return;
+   kaiser_remove_mapping((unsigned long)buffer, size);
+   free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -289,7 +318,7 @@ static int alloc_pebs_buffer(int cpu)
if (!x86_pmu.pebs)
return 0;
 
-   buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+   buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
if (unlikely(!buffer))
return -ENOMEM;
 
@@ -300,7 +329,7 @@ static int alloc_pebs_buffer(int cpu)
if (x86_pmu.intel_cap.pebs_format < 2) {
ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
if (!ibuffer) {
-   kfree(buffer);
+   dsfree(buffer, x86_pmu.pebs_buffer_size);
return -ENOMEM;
}
per_cpu(insn_buffer, cpu) = ibuffer;
@@ -326,7 +355,8 @@ static void release_pebs_buffer(int cpu)
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
 
-   kfree((void *)(unsigned long)ds->pebs_buffer_base);
+   dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+   x86_pmu.pebs_buffer_size);
ds->pebs_buffer_base = 0;
 }
 
@@ -340,7 +370,7 @@ static int alloc_bts_buffer(int cpu)
if (!x86_pmu.bts)
return 0;
 
-   buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+   buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
if (unlikely(!buffer)) {
WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
return -ENOMEM;
@@ -366,19 +396,15 @@ static void release_bts_buffer(int cpu)
if (!ds || !x86_pmu.bts)
return;
 
-   kfree((void *)(unsigned long)ds->bts_buffer_base);
+   dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-   int no

[PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

First, it's nice to remove the magic numbers.

Second, KAISER is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell
out this new restriction.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003504.57edb...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/tlbflush.h | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index df28f1a61afa..3101581c5da0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,19 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KAISER_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
+ * to account for them being zero-based.  Another -1 is because ASID 0
+ * is reserved for use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1< 4094);
+   VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
return __sme_pa(pgd) | (asid + 1);
} else {
VM_WARN_ON_ONCE(asid != 0);
@@ -98,7 +111,7 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-   VM_WARN_ON_ONCE(asid > 4094);
+   VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
-- 
2.14.1



[PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one programmed into the hardware that goes from 1->6

This consolidates the locations where converting beween the two
(by doing +1) to a single place which gives us a nice place to
comment.  KAISER will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003506.67e81...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/tlbflush.h | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3101581c5da0..24b27eb5904c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -88,21 +88,26 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  */
 #define MAX_ASID_AVAILABLE ((1< MAX_ASID_AVAILABLE);
+   /*
+* If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+* bits.  This serves two purposes.  It prevents a nasty situation in
+* which PCID-unaware code saves CR3, loads some other value (with PCID
+* == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+* the saved ASID was nonzero.  It also means that any bugs involving
+* loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+* deterministically.
+*/
+   return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
if (static_cpu_has(X86_FEATURE_PCID)) {
-   VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-   return __sme_pa(pgd) | (asid + 1);
+   return __sme_pa(pgd) | kern_asid(asid);
} else {
VM_WARN_ON_ONCE(asid != 0);
return __sme_pa(pgd);
@@ -112,7 +117,8 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-   return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+   VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+   return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
-- 
2.14.1



[PATCH 36/43] x86/mm: Allow flushing for future ASID switches

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

If changing the page tables in such a way that an invalidation of
all contexts (aka. PCIDs / ASIDs) is required, they can be
actively invalidated by:

 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KAISER) at the time that invalidation is required.  Instead of
actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation
to be required.

At the next context-switch, look for this indicator
('all_other_ctxs_invalid' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003507.e8c32...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/tlbflush.h | 47 -
 arch/x86/mm/tlb.c   | 35 ++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 24b27eb5904c..bb5ba71038ee 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -184,6 +184,17 @@ struct tlb_state {
 */
bool is_lazy;
 
+   /*
+* If set we changed the page tables in such a way that we
+* needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+* This tells us to go invalidate all the non-loaded ctxs[]
+* on the next context switch.
+*
+* The current ctx was kept up-to-date as it ran and does not
+* need to be invalidated.
+*/
+   bool all_other_ctxs_invalid;
+
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
 * disabling interrupts when modifying either one.
@@ -261,6 +272,19 @@ static inline unsigned long cr4_read_shadow(void)
return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+   /*
+* With global pages, all of the shared kenel page tables
+* are set as _PAGE_GLOBAL.  We have no shared nonglobals
+* and nothing to do here.
+*/
+   if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+   return;
+
+   this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -290,6 +314,10 @@ static inline void __native_flush_tlb(void)
preempt_disable();
native_write_cr3(__native_read_cr3());
preempt_enable();
+   /*
+* Does not need tlb_flush_shared_nonglobals() since the CR3 write
+* without PCIDs flushes all non-globals.
+*/
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -335,24 +363,23 @@ static inline void __native_flush_tlb_single(unsigned 
long addr)
 
 static inline void __flush_tlb_all(void)
 {
-   if (boot_cpu_has(X86_FEATURE_PGE))
+   if (boot_cpu_has(X86_FEATURE_PGE)) {
__flush_tlb_global();
-   else
+   } else {
__flush_tlb();
-
-   /*
-* Note: if we somehow had PCID but not PGE, then this wouldn't work --
-* we'd end up flushing kernel translations for the current ASID but
-* we might fail to flush kernel translations for other cached ASIDs.
-*
-* To avoid this issue, we force PCID off if PGE is off.
-*/
+   tlb_flush_shared_nonglobals();
+   }
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
+   /*
+* Invalidate other address spaces inaccessible to single-page
+* invalidation:
+*/
+   tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL  -1UL
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e629dbda01a0..81941f1690fa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,38 @@
  * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+

[PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

The KAISER code attempts to "poison" the user portion of the kernel page
tables.  It detects entries that it wants that it wants to poison in two
ways:
 * Looking for addresses >= PAGE_OFFSET
 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KAISER-enforced restriction, _PAGE_USER is never
set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not
even see the memory any more.  Disable it whenever KAISER is enabled.

Also add some help text about how KAISER might affect the emulation
case as well.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003513.10cad...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 09dcc94c4484..d23cd2902b10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2249,6 +2249,9 @@ choice
 
config LEGACY_VSYSCALL_NATIVE
bool "Native"
+   # The VSYSCALL page comes from the kernel page tables
+   # and is not available when KAISER is enabled.
+   depends on ! KAISER
help
  Actual executable code is located in the fixed vsyscall
  address mapping, implementing time() efficiently. Since
@@ -2266,6 +2269,11 @@ choice
  exploits. This configuration is recommended when userspace
  still uses the vsyscall area.
 
+ When KAISER is enabled, the vsyscall area will become
+ unreadable.  This emulation option still works, but KAISER
+ will make it harder to do things like trace code using the
+ emulation.
+
config LEGACY_VSYSCALL_NONE
bool "None"
help
-- 
2.14.1



[PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
  the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
  flush done for each.  For instance, what is currently a
  single instruction without KAISER:

invpcid_flush_one(current_pcid, addr);

  becomes this with KAISER:

invpcid_flush_one(current_kern_pcid, addr);
invpcid_flush_one(current_user_pcid, addr);

  and this without INVPCID:

__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003509.ec42d...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/calling.h|  25 +++--
 arch/x86/entry/entry_64.S   |   1 +
 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm/pgtable_types.h|  11 +++
 arch/x86/include/asm/tlbflush.h | 137 +++-
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kvm/x86.c  |   3 +-
 arch/x86/mm/init.c  |  75 ++-
 arch/x86/mm/tlb.c   

[PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

This will be used in a few patches.  Right now, it's not wired up
to do anything useful.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003517.8eab7...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/kaiser.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 4665dd724efb..968d5b62d597 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -470,3 +471,50 @@ void kaiser_remove_mapping(unsigned long start, unsigned 
long size)
 */
__native_flush_tlb_global();
 }
+
+int kaiser_enabled = 1;
+static ssize_t kaiser_enabled_read_file(struct file *file, char __user 
*user_buf,
+size_t count, loff_t *ppos)
+{
+   char buf[32];
+   unsigned int len;
+
+   len = sprintf(buf, "%d\n", kaiser_enabled);
+   return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t kaiser_enabled_write_file(struct file *file,
+const char __user *user_buf, size_t count, loff_t *ppos)
+{
+   char buf[32];
+   ssize_t len;
+   unsigned int enable;
+
+   len = min(count, sizeof(buf) - 1);
+   if (copy_from_user(buf, user_buf, len))
+   return -EFAULT;
+
+   buf[len] = '\0';
+   if (kstrtoint(buf, 0, &enable))
+   return -EINVAL;
+
+   if (enable > 1)
+   return -EINVAL;
+
+   WRITE_ONCE(kaiser_enabled, enable);
+   return count;
+}
+
+static const struct file_operations fops_kaiser_enabled = {
+   .read = kaiser_enabled_read_file,
+   .write = kaiser_enabled_write_file,
+   .llseek = default_llseek,
+};
+
+static int __init create_kaiser_enabled(void)
+{
+   debugfs_create_file("kaiser-enabled", S_IRUSR | S_IWUSR,
+   arch_debugfs_dir, NULL, &fops_kaiser_enabled);
+   return 0;
+}
+late_initcall(create_kaiser_enabled);
-- 
2.14.1



[PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

The KAISER CR3 switches are expensive for many reasons.  Not all systems
benefit from the protection provided by KAISER.  Some of them can not
pay the high performance cost.

This patch adds a debugfs file.  To disable KAISER, you do:

echo 0 > /sys/kernel/debug/x86/kaiser-enabled

and to re-enable it, you can:

echo 1 > /sys/kernel/debug/x86/kaiser-enabled

This is a *minimal* implementation.  There are certainly plenty of
optimizations that can be done on top of this by using ALTERNATIVES
among other things.

This does, however, completely remove all the KAISER-based CR3 writes.
This permits a paravirtualized system that can not tolerate CR3
writes to theoretically survive with CONFIG_KAISER=y, albeit with
/sys/kernel/debug/x86/kaiser-enabled=0.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003523.28ffb...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/calling.h | 12 +
 arch/x86/mm/kaiser.c | 70 +---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 66af80514197..89ccf7ae0e23 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -209,19 +209,29 @@ For 32-bit we have the following conventions - kernel is 
built with
orq $(KAISER_SWITCH_MASK), \reg
 .endm
 
+.macro JUMP_IF_KAISER_OFF  label
+   testq   $1, kaiser_asm_do_switch
+   jz  \label
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+   JUMP_IF_KAISER_OFF  .Lswitch_done_\@
mov %cr3, \scratch_reg
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+   JUMP_IF_KAISER_OFF  .Lswitch_done_\@
mov %cr3, \scratch_reg
ADJUST_USER_CR3 \scratch_reg
mov \scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+   JUMP_IF_KAISER_OFF  .Ldone_\@
movq%cr3, %r\scratch_reg
movq%r\scratch_reg, \save_reg
/*
@@ -244,11 +254,13 @@ For 32-bit we have the following conventions - kernel is 
built with
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+   JUMP_IF_KAISER_OFF  .Ldone_\@
/*
 * The CR3 write could be avoided when not changing its value,
 * but would require a CR3 read *and* a scratch register.
 */
movq\save_reg, %cr3
+.Ldone_\@:
 .endm
 
 #else /* CONFIG_KAISER=n: */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 06966b111280..1eb27b410556 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -43,6 +43,9 @@
 
 #define KAISER_WALK_ATOMIC  0x1
 
+__aligned(PAGE_SIZE)
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+
 /*
  * At runtime, the only things we map are some things for CPU
  * hotplug, and stacks for new processes.  No two CPUs will ever
@@ -395,6 +398,9 @@ void __init kaiser_init(void)
 
kaiser_init_all_pgds();
 
+   kaiser_add_user_map_early(&kaiser_asm_do_switch, PAGE_SIZE,
+ __PAGE_KERNEL | _PAGE_GLOBAL);
+
for_each_possible_cpu(cpu) {
void *percpu_vaddr = __per_cpu_user_mapped_start +
 per_cpu_offset(cpu);
@@ -483,6 +489,56 @@ static ssize_t kaiser_enabled_read_file(struct file *file, 
char __user *user_buf
return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+enum poison {
+   KAISER_POISON,
+   KAISER_UNPOISON
+};
+void kaiser_poison_pgds(enum poison do_poison);
+
+void kaiser_do_disable(void)
+{
+   /* Make sure the kernel PGDs are usable by userspace: */
+   kaiser_poison_pgds(KAISER_UNPOISON);
+
+   /*
+* Make sure all the CPUs have the poison clear in their TLBs.
+* This also functions as a barrier to ensure that everyone
+* sees the unpoisoned PGDs.
+*/
+   flush_tlb_all();
+
+   /* Tell the assembly code to stop switching CR3. */
+   kaiser_asm_do_switch[0] = 0;
+
+   /*
+* Make sure everybody does an interrupt.  This means that
+* they have gone through a SWITCH_TO_KERNEL_CR3 amd are no
+* longer running on the userspace CR3.  If we did not do
+* this, we might have CPUs running on the shadow page tables
+* that then enter the kernel and think they do *not* need to
+* switch.
+*/
+   flush_tlb_all();
+}
+
+void kaiser_do_enable(void)
+{
+   /* Tell the assembly code to start switchi

[PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-cpu space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-cpu debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003442.2d047...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/calling.h | 65 
 arch/x86/entry/entry_64.S| 44 +--
 arch/x86/entry/entry_64_compat.S | 32 +++-
 3 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3fd8bc560fae..e1650da01323 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include 
 #include 
+#include 
 
 /*
 
@@ -187,6 +188,70 @@ For 32-bit we have the following conventions - kernel is 
built with
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1 in kernel */
SWAPGS
xorl%ebx, %ebx
-1: ret
+
+1:
+   SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+   ret
 END(paranoid_entry)
 
 /*
@@ -1261,6 +1286,7 @@ ENTRY(paranoid_exit)
testl   %ebx, %ebx  /* swapgs needed? */
jnz .Lparanoid_exit_no_swapgs
TRACE_IRQS_IRETQ
+   RESTORE_CR3 %r14
SWAPGS_UNSAFE_STACK
jmp .Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1289,6 +1315,9 @@ ENTRY(error_entry)
 */
SWAPGS
 
+   /* We have user CR3.  Change to kernel CR3. */
+   SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+
 .Lerror_entry_from_usermode_after_

[PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003444.196cb...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 include/asm-generic/vmlinux.lds.h |  7 +++
 include/linux/percpu-defs.h   | 30 ++
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index bdcd1caae092..e12168936d3f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -826,7 +826,14 @@
  */
 #define PERCPU_INPUT(cacheline)
\
VMLINUX_SYMBOL(__per_cpu_start) = .;\
+   VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;\
*(.data..percpu..first) \
+   . = ALIGN(cacheline);   \
+   *(.data..percpu..user_mapped)   \
+   *(.data..percpu..user_mapped..shared_aligned)   \
+   . = ALIGN(PAGE_SIZE);   \
+   *(.data..percpu..user_mapped..page_aligned) \
+   VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;  \
. = ALIGN(PAGE_SIZE);   \
*(.data..percpu..page_aligned)  \
. = ALIGN(cacheline);   \
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 2d2096ba1cfe..752513674295 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)
\
+   DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name) \
+   DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \
+   DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION 
PER_CPU_SHARED_ALIGNED_SECTION) \
+   cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)  \
+   DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION 
PER_CPU_SHARED_ALIGNED_SECTION) \
+   cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)\
DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)\
cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)
\
DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")\
__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned 
and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)   \
+   DECLARE_PER_CPU_SECTION(type, name, 
USER_MAPPED_SECTION"..page_aligned") \
+   __aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)\
+   DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") 
\
+   __aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
-- 
2.14.1



[PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

Here's a short summary of the things mapped to userspace:
 * The gdt_page's virtual address is pointed to by the LGDT instruction.
   It is needed to define the segments.  Deeply required by CPU to run.
 * cpu_tss tells the CPU, among other things, where the new stacks are
   after user<->kernel transitions.  Needed by the CPU to make ring
   transitions.
 * exception_stacks are needed at interrupt and exception entry
   so that there is storage for, among other things, some temporary
   space to permit clobbering a register to load the kernel CR3.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003445.df9ea...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/desc.h  | 2 +-
 arch/x86/include/asm/processor.h | 2 +-
 arch/x86/kernel/cpu/common.c | 4 ++--
 arch/x86/kernel/process.c| 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index aab4fe9f49f8..300090d1c209 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -46,7 +46,7 @@ struct gdt_page {
struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 54f3ee3bc8a0..83dd7c97ba5d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -359,7 +359,7 @@ struct tss_struct {
unsigned long   io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f9c7e6852874..3b6920c9fef7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu = {
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
/*
 * We need valid kernel segments for data and code in long mode too
@@ -515,7 +515,7 @@ static const unsigned int 
exception_stack_sizes[N_EXCEPTION_STACKS] = {
  [DEBUG_STACK - 1] = DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6a04287f222b..9365b4f965e0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,7 +47,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, 
cpu_tss) = {
.x86_tss = {
/*
 * .sp0 is only used when entering ring 0 from a lower
-- 
2.14.1



[PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when running userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever entering the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When switching back to user
mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

=== Page Table Poisoning ===

KAISER has two copies of the page tables: one for the kernel and
one for when running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and page table pages are allocated.
These updates of the userspace *portion* of the tables need to be
reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at the
address that is being updated.  If it is 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003447.1db39...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/kaiser.txt | 162 +
 arch/x86/boot/compressed/pagetable.c |   6 +
 arch/x86/entry/calling.h |   1 +
 arch/x86/include/asm/kaiser.h|  57 +
 arch/x86/include/asm/pgtable.h   |   5 +
 arch/x86/include/asm/pgtable_64.h| 132 +++
 arch/x86/kernel/espfix_64.c  |  17 ++
 arch/x86/kernel/head_64.S|  14 +-
 arch/x86/mm/Makefile |   1 +
 arch/x86/mm/kaiser.c | 441 +++
 arch/x86/mm/pageattr.c   |   2 +-
 arch/x86/mm/pgtable.c|  16 +-
 include/linux/kaiser.h   |  29 +++
 init/main.c  |   3 +
 kernel/fork.c|   1 +
 15 files changed, 881 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
new file mode 100644
index ..745c4be39b92
--- /dev/null
+++ b/Documentation/x86/kaiser.txt
@@ -0,0 +1,162 @@
+Overview
+
+
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When the kernel is entered via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).  There are a few unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in kaiser.c).
+
+This helps to ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It can be
+enabled by setting CONFIG_KAISER=y
+
+Page Table Management
+=
+
+When KAISER is enabled, the kernel manages two sets of page
+tables.  The first copy is very similar to what would be present
+for a kernel without KAISER.  This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The second (shadow) is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy.  It maps a only
+the kernel data needed to enter and exit the kernel.
+
+The shadow is populated by the kaiser_add_*() functions.  Only
+kernel data which has been explicity mapped will appear in the
+shadow copy.  These calls are rare at runtime.
+
+For a new userspace mapping, the kernel makes the entries in its
+page tables like normal.  The o

[PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

A few PGDs come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all
8k-aligned, but they must also be 8k in *size*.

The original KAISER patch did not do this.  It probably just
lucked out that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003450.76492...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/head_64.S | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 43d1cffd1fcf..58087ab1782e 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -342,11 +342,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL   512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL   0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -365,6 +378,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+   .fill   KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
.fill   512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -379,6 +393,7 @@ NEXT_PGD_PAGE(init_top_pgt)
.orginit_top_pgt + PGD_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+   .fill   KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
.quad   level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -391,6 +406,7 @@ NEXT_PAGE(level2_ident_pgt)
 #else
 NEXT_PGD_PAGE(init_top_pgt)
.fill   512,8,0
+   .fill   KAISER_USER_PGD_FILL,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.14.1



[PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

The user portion of the kernel page tables use the NX bit to
poison them for userspace.  But, that trips the p4d/pgd_bad()
checks.  Make sure it does not do that.

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003448.c6ab3...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/pgtable.h | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d3901124143f..9cceaf6c0405 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,12 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long 
address)
 
 static inline int p4d_bad(p4d_t p4d)
 {
-   return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+   unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+   if (IS_ENABLED(CONFIG_KAISER))
+   ignore_flags |= _PAGE_NX;
+
+   return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -880,7 +885,12 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long 
address)
 
 static inline int pgd_bad(pgd_t pgd)
 {
-   return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+   unsigned long ignore_flags = _PAGE_USER;
+
+   if (IS_ENABLED(CONFIG_KAISER))
+   ignore_flags |= _PAGE_NX;
+
+   return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
-- 
2.14.1



[PATCH 28/43] x86/mm/kaiser: Map CPU entry area

2017-11-24 Thread Ingo Molnar
From: Dave Hansen 

There is now a special 'struct cpu_entry' area that contains all
of the data needed to enter the kernel.  It's mapped in the fixmap
area and contains:

 * The GDT (hardware segment descriptor)
 * The TSS (thread information structure that points the hardware
   to the various stacks, and contains the entry stack).
 * The entry trampoline code itself
 * The exception stacks (aka IRQ stacks)

Signed-off-by: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Daniel Gruss 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Josh Poimboeuf 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Michael Schwarz 
Cc: Moritz Lipp 
Cc: Peter Zijlstra 
Cc: Richard Fellner 
Cc: Thomas Gleixner 
Cc: linux...@kvack.org
Link: http://lkml.kernel.org/r/20171123003453.d4cb3...@viggo.jf.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/kaiser.h |  6 ++
 arch/x86/kernel/cpu/common.c  |  4 
 arch/x86/mm/kaiser.c  | 31 +++
 include/linux/kaiser.h|  3 +++
 4 files changed, 44 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 3c2cc71b4058..040cb096d29d 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -33,6 +33,12 @@
 extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
  unsigned long flags);
 
+/**
+ *  kaiser_add_mapping_cpu_entry - map the cpu entry area
+ *  @cpu: the CPU for which the entry area is being mapped
+ */
+extern void kaiser_add_mapping_cpu_entry(int cpu);
+
 /**
  *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
  *  @addr: the start address of the range
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3b6920c9fef7..d6bcf397b00d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -584,6 +585,9 @@ static inline void setup_cpu_entry_area(int cpu)
__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
+   /* CPU 0's mapping is done in kaiser_init() */
+   if (cpu)
+   kaiser_add_mapping_cpu_entry(cpu);
 }
 
 /* Load the original GDT from the per-cpu structure */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 7f7561e9971d..4665dd724efb 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -353,6 +353,26 @@ static void __init kaiser_init_all_pgds(void)
WARN_ON(__ret); \
 } while (0)
 
+void kaiser_add_mapping_cpu_entry(int cpu)
+{
+   kaiser_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
+ __PAGE_KERNEL_RO);
+
+   /* includes the entry stack */
+   kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
+ sizeof(get_cpu_entry_area(cpu)->tss),
+ __PAGE_KERNEL | _PAGE_GLOBAL);
+
+   /* Entry code, so needs to be EXEC */
+   kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
+ 
sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
+ __PAGE_KERNEL_EXEC | _PAGE_GLOBAL);
+
+   kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->exception_stacks,
+
sizeof(get_cpu_entry_area(cpu)->exception_stacks),
+__PAGE_KERNEL | _PAGE_GLOBAL);
+}
+
 extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
 /*
  * If anything in here fails, we will likely die on one of the
@@ -390,6 +410,17 @@ void __init kaiser_init(void)
kaiser_add_user_map_early((void *)idt_descr.address,
  sizeof(gate_desc) * NR_VECTORS,
  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+   /*
+* We delay CPU 0's mappings because these structures are
+* created before the page allocator is up.  Deferring it
+* until here lets us use the plain page allocator
+* unconditionally in the page table code above.
+*
+* This is OK because kaiser_init() is called long before
+* we ever run userspace and need the KAISER mappings.
+*/
+   kaiser_add_mapping_cpu_entry(0);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 0fd800efa95c..77db4230a0dd 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -25,5 +25,8 @@ static inline int kaiser_add_mapping(unsigned long addr, 
unsigned long size,
return 0;
 }
 
+static inline void kaiser_add_mapping_cpu_entry(int cpu)
+{
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1



[PATCH 20/43] x86/entry: Clean up SYSENTER_stack code

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

The existing code was a mess, mainly because C arrays are nasty.
Turn SYSENTER_stack into a struct, add a helper to find it, and do
all the obvious cleanups this enables.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_32.S|  4 ++--
 arch/x86/entry/entry_64.S|  2 +-
 arch/x86/include/asm/fixmap.h|  5 +
 arch/x86/include/asm/processor.h |  6 +-
 arch/x86/kernel/asm-offsets.c|  6 ++
 arch/x86/kernel/cpu/common.c | 14 +++---
 arch/x86/kernel/dumpstack.c  |  7 +++
 7 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0ab316c46806..3629bcbf85a2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
/* Are we currently on the SYSENTER stack? */
movlPER_CPU_VAR(cpu_entry_area), %ecx
-   addl$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
+   addl$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
subl%eax, %ecx  /* ecx = (end of SYSENTER_stack) - esp */
cmpl$SIZEOF_SYSENTER_stack, %ecx
jb  .Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
/* Are we currently on the SYSENTER stack? */
movlPER_CPU_VAR(cpu_entry_area), %ecx
-   addl$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
+   addl$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
subl%eax, %ecx  /* ecx = (end of SYSENTER_stack) - esp */
cmpl$SIZEOF_SYSENTER_stack, %ecx
jb  .Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0cde243b7542..34e3110b0876 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 15cf010225c9..ceb04ab0a642 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -234,5 +234,10 @@ static inline struct cpu_entry_area 
*get_cpu_entry_area(int cpu)
return (struct cpu_entry_area 
*)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+   return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7743aedb82ea..54f3ee3bc8a0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -332,12 +332,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET   (offsetof(struct tss_struct, io_bitmap) 
- offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET   0x8000
 
+struct SYSENTER_stack {
+   unsigned long   words[64];
+};
+
 struct tss_struct {
/*
 * Space for the temporary SYSENTER stack, used for SYSENTER
 * and the entry trampoline as well.
 */
-   unsigned long   SYSENTER_stack[64];
+   struct SYSENTER_stack   SYSENTER_stack;
 
/*
 * The fixed hardware portion.  This must not cross a page boundary
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 61b1af88ac07..46c0995344aa 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
BLANK();
DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-   /* Offset from cpu_tss to SYSENTER_stack */
-   OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-   /* Size of SYSENTER_stack */
-   DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct 
*)0)->SYSENTER_stack));
+   OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+   DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
/* Layout info for cpu_entry_area */
OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b949e6ea0f9..f9c7e6852874 100644
--- a/arch

[PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a percpu trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_64.S| 67 ++--
 arch/x86/entry/entry_64_compat.S |  5 ++-
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/include/asm/traps.h |  1 -
 arch/x86/kernel/cpu/common.c |  6 ++--
 arch/x86/kernel/traps.c  | 18 +--
 6 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..7d47199f405f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
.macro interrupt func
cld
+
+   testb   $3, CS-ORIG_RAX(%rsp)
+   jz  1f
+   SWAPGS
+   callswitch_to_thread_stack
+1:
+
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS
SAVE_EXTRA_REGS
@@ -572,12 +579,8 @@ END(irq_entries_start)
jz  1f
 
/*
-* IRQ from user mode.  Switch to kernel gsbase and inform context
-* tracking that we're in kernel mode.
-*/
-   SWAPGS
-
-   /*
+* IRQ from user mode.
+*
 * We need to tell lockdep that IRQs are off.  We can't do this until
 * we fix gsbase, and we should do it before enter_from_user_mode
 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR  
irq_work_interrupt  smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+   UNWIND_HINT_FUNC
+
+   pushq   %rdi
+   movq%rsp, %rdi
+   movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
+   UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+   pushq   7*8(%rdi)   /* regs->ss */
+   pushq   6*8(%rdi)   /* regs->rsp */
+   pushq   5*8(%rdi)   /* regs->eflags */
+   pushq   4*8(%rdi)   /* regs->cs */
+   pushq   3*8(%rdi)   /* regs->ip */
+   pushq   2*8(%rdi)   /* regs->orig_ax */
+   pushq   8(%rdi) /* return address */
+   UNWIND_HINT_FUNC
+
+   movq(%rdi), %rdi
+   ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -848,11 +877,12 @@ ENTRY(\sym)
 
ALLOC_PT_GPREGS_ON_STACK
 
-   .if \paranoid
-   .if \paranoid == 1
+   .if \paranoid < 2
testb   $3, CS(%rsp)/* If coming from userspace, 
switch stacks */
-   jnz 1f
+   jnz .Lfrom_usermode_switch_stack_\@
.endif
+
+   .if \paranoid
callparanoid_entry
.else
callerror_entry
@@ -894,20 +924,15 @@ ENTRY(\sym)
jmp error_exit
.endif
 
-   .if \paranoid == 1
+   .if \paranoid < 2
/*
-* Paranoid entry from userspace.  Switch stacks and treat it
+* Entry from userspace.  Switch stacks and treat it
 * as a normal entry.  This means that paranoid handlers
 * run in real process context if user_mode(regs).
 */
-1:
+.Lfrom_usermode_switch_stack_\@:
callerror_entry
 
-
-   movq%rsp, %rdi  /* pt_regs pointer */
-   callsync_regs
-   movq%rax, %rsp  /* switch stack */
-
movq%rsp, %rdi  /* pt_regs pointer */
 
.if \has_error_code
@@ -1170,6 +1195,14 @@ ENTRY(error_entry)
SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+   /* Put us onto the real thread stack. */
+   popq%r12/* save return addr in %12 */
+   movq%rsp, %rdi  /* arg0 = pt_regs pointer */
+   callsync_regs
+   movq%rax, %rsp  /* switch stack */
+   ENCODE_FRAME_POINTER
+ 

[PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_64.S | 55 +++
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7d47199f405f..426b8c669d6a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
popq%rsi/* skip rcx */
popq%rdx
popq%rsi
+
+   /*
+* Now all regs are restored except RSP and RDI.
+* Save old stack pointer and switch to trampoline stack.
+*/
+   movq%rsp, %rdi
+   movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+   pushq   RSP-RDI(%rdi)   /* RSP */
+   pushq   (%rdi)  /* RDI */
+
+   /*
+* We are on the trampoline stack.  All regs except RDI are live.
+* We can do future final exit work right here.
+*/
+
popq%rdi
-   movqRSP-ORIG_RAX(%rsp), %rsp
+   popq%rsp
USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -633,10 +649,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
ud2
 1:
 #endif
-   SWAPGS
POP_EXTRA_REGS
-   POP_C_REGS
-   addq$8, %rsp/* skip regs->orig_ax */
+   popq%r11
+   popq%r10
+   popq%r9
+   popq%r8
+   popq%rax
+   popq%rcx
+   popq%rdx
+   popq%rsi
+
+   /*
+* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+* Save old stack pointer and switch to trampoline stack.
+*/
+   movq%rsp, %rdi
+   movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+   /* Copy the IRET frame to the trampoline stack. */
+   pushq   6*8(%rdi)   /* SS */
+   pushq   5*8(%rdi)   /* RSP */
+   pushq   4*8(%rdi)   /* EFLAGS */
+   pushq   3*8(%rdi)   /* CS */
+   pushq   2*8(%rdi)   /* RIP */
+
+   /* Push user RDI on the trampoline stack. */
+   pushq   (%rdi)
+
+   /*
+* We are on the trampoline stack.  All regs except RDI are live.
+* We can do future final exit work right here.
+*/
+
+   /* Restore RDI. */
+   popq%rdi
+   SWAPGS
INTERRUPT_RETURN
 
 
-- 
2.14.1



[PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_64.S | 48 +++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++-
 arch/x86/kernel/vmlinux.lds.S | 10 +
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..0cde243b7542 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+   .pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+   _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+   SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+   UNWIND_HINT_EMPTY
+   swapgs
+
+   /* Stash the user RSP. */
+   movq%rsp, RSP_SCRATCH
+
+   /* Load the top of the task stack into RSP */
+   movqCPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+   /* Start building the simulated IRET frame. */
+   pushq   $__USER_DS  /* pt_regs->ss */
+   pushq   RSP_SCRATCH /* pt_regs->sp */
+   pushq   %r11/* pt_regs->flags */
+   pushq   $__USER_CS  /* pt_regs->cs */
+   pushq   %rcx/* pt_regs->ip */
+
+   /*
+* x86 lacks a near absolute jump, and we can't jump to the real
+* entry text with a relative jump, so we fake it using retq.
+*/
+   pushq   $entry_SYSCALL_64_after_hwframe
+   retq
+END(entry_SYSCALL_64_trampoline)
+
+   .popsection
+
 ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3a42da14c2cb..7eb1b5490395 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 * of the TSS region.
 */
struct tss_struct tss;
+
+   char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
/

[PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

Now that the SYSENTER stack has a guard page, there's no need for a
canary to detect overflow after the fact.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/processor.h | 1 -
 arch/x86/kernel/dumpstack.c  | 3 +--
 arch/x86/kernel/process.c| 1 -
 arch/x86/kernel/traps.c  | 7 ---
 4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a09e5571a92..7743aedb82ea 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,7 +337,6 @@ struct tss_struct {
 * Space for the temporary SYSENTER stack, used for SYSENTER
 * and the entry trampoline as well.
 */
-   unsigned long   SYSENTER_stack_canary;
unsigned long   SYSENTER_stack[64];
 
/*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index bb61919c9335..9ce5fcf7d14d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *stack, struct 
stack_info *info)
int cpu = smp_processor_id();
struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-   /* Treat the canary as part of the stack for unwinding purposes. */
-   void *begin = &tss->SYSENTER_stack_canary;
+   void *begin = &tss->SYSENTER_stack;
void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
if ((void *)stack < begin || (void *)stack >= end)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 86e83762e3b3..6a04287f222b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, 
cpu_tss) = {
  */
.io_bitmap  = { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-   .SYSENTER_stack_canary  = STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cbc4272bb9dd..19475dbff068 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -801,13 +801,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long 
error_code)
debug_stack_usage_dec();
 
 exit:
-   /*
-* This is the most likely code path that involves non-trivial use
-* of the SYSENTER stack.  Check that we haven't overrun it.
-*/
-   WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-"Overran or corrupted SYSENTER stack\n");
-
ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1



[PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs 
*regs)
if (regs->sp >= estack_top && regs->sp <= estack_bottom)
return;
 
-   WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack 
(cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+   WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack 
(cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk 
top-bottom:%Lx-%Lx,ip:%pF)\n",
current->comm, curbase, regs->sp,
irq_stack_top, irq_stack_bottom,
-   estack_top, estack_bottom);
+   estack_top, estack_bottom, (void *)regs->ip);
 
if (sysctl_panic_on_stackoverflow)
panic("low stack detected by irq handler - check messages\n");
-- 
2.14.1



[PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/irq.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs 
*regs)
/* high bit used in ret_from_ code  */
unsigned vector = ~regs->orig_ax;
 
-   /*
-* NB: Unlike exception entries, IRQ entries do not reliably
-* handle context tracking in the low-level entry code.  This is
-* because syscall entries execute briefly with IRQs on before
-* updating context tracking state, so we can take an IRQ from
-* kernel mode with CONTEXT_USER.  The low-level entry code only
-* updates the context if we came from user mode, so we won't
-* switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-* code is cleaned up enough that we can cleanly defer enabling
-* IRQs.
-*/
-
entering_irq();
 
/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.14.1



[PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/processor.h   | 18 +-
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c  | 10 ++
 arch/x86/kernel/process_64.c   |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 48d44fae3d27..3a09e5571a92 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -305,7 +305,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
u32 reserved1;
u64 sp0;
+
+   /*
+* We store cpu_current_top_of_stack in sp1 so it's always accessible.
+* Linux does not use ring 1, so sp1 is not otherwise needed.
+*/
u64 sp1;
+
u64 sp2;
u64 reserved2;
u64 ist[7];
@@ -364,6 +370,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -535,12 +543,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-   return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-   /* sp0 on x86_32 is special in and around vm86 mode. */
+   /*
+*  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+*  and around vm86 mode and sp0 on x86_64 is special because of the
+*  entry trampoline.
+*/
return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * 
const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
OFFSET(TSS_ist, tss_struct, x86_tss.ist);
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+   OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 35d674157fda..86e83762e3b3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, 
cpu_tss) = {
 * Poison it.
 */
.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+   /*
+* .sp1 is cpu_current_top_of_stack.  The init task never
+* runs user code, but cpu_current_top_of_stack should still
+* be well defined before the first context switch.
+*/
+   .sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
.ss0 = __KERNEL_DS,
.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..bafe65b08697 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -462,6 +462,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct 
*next_p)
 * Switch the PDA and FPU contexts.
 */
this_cpu_write(current_task, next_p);
+   this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
/* Reload sp0. */
update_sp0(next_p);
-- 
2.14.1



[PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
cpu_entry_area.

Signed-off-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/fixmap.h | 10 ++
 arch/x86/kernel/cpu/common.c  | 40 +---
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7eb1b5490395..15cf010225c9 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -60,6 +60,16 @@ struct cpu_entry_area {
struct tss_struct tss;
 
char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+   /*
+* Exception stacks used for IST entries.
+*
+* In the future, this should have a separate slot for each stack
+* with guard pages between them.
+*/
+   char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + 
DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5a05db084659..6b949e6ea0f9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void 
*ptr, int pages, pgpr
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+ [0 ... N_EXCEPTION_STACKS - 1]= EXCEPTION_STKSZ,
+ [DEBUG_STACK - 1] = DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+   [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -557,6 +573,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+   BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+   BUILD_BUG_ON(sizeof(exception_stacks) !=
+sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+   set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+   &per_cpu(exception_stacks, cpu),
+   sizeof(exception_stacks) / PAGE_SIZE,
+   PAGE_KERNEL);
+
__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1407,20 +1431,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
- [0 ... N_EXCEPTION_STACKS - 1]= EXCEPTION_STKSZ,
- [DEBUG_STACK - 1] = DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-   [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1626,7 +1636,7 @@ void cpu_init(void)
 * set up and load the per-CPU TSS
 */
if (!oist->ist[0]) {
-   char *estacks = per_cpu(exception_stacks, cpu);
+   char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
estacks += exception_stack_sizes[v];
-- 
2.14.1



[PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new struct cpu_entry_area
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/desc.h   |  9 +
 arch/x86/include/asm/fixmap.h | 34 --
 arch/x86/kernel/cpu/common.c  | 14 +++---
 arch/x86/xen/mmu_pv.c |  2 +-
 4 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-   return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-   unsigned int idx = get_cpu_gdt_ro_index(cpu);
-   return (struct desc_struct *)__fix_to_virt(idx);
+   return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..0f4c92f02968 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,16 @@ extern unsigned long __FIXADDR_TOP;
 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ */
+struct cpu_entry_area {
+   char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +111,8 @@ enum fixed_addresses {
FIX_LNW_VRTC,
 #endif
/* Fixmap entries to remap the GDTs, one per processor. */
-   FIX_GDT_REMAP_BEGIN,
-   FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+   FIX_CPU_ENTRY_AREA_TOP,
+   FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + 
(CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
__end_of_permanent_fixed_addresses,
 
@@ -185,5 +195,25 @@ void __init *early_memremap_decrypted_wp(resource_size_t 
phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+   BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+   return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({  \
+   BUILD_BUG_ON(offset % PAGE_SIZE != 0);  \
+   __get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);   \
+   })
+
+#define get_cpu_entry_area_index(cpu, field)   \
+   __get_cpu_entry_area_offset_index((cpu), offsetof(struct 
cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+   return (struct cpu_entry_area 
*)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ccb5f66c4e5b..c0fb3eb37ee0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
/* On 64-bit systems, we use a read-only fixmap GDT. */
-   pgprot_t prot = PAGE_KERNEL_RO;
+   pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
/*
 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int cpu)
 * On Xen PV, the GDT must be read-only because the hypervisor requires
 * it.
 */
-   pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+   pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-   __set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_p

[PATCH 07/43] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/d40a2c5ae4539d64090849a374f3169ec492f4e2.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/desc.h  |  2 +-
 arch/x86/include/asm/processor.h |  4 ++--
 arch/x86/kernel/cpu/common.c |  8 
 arch/x86/kernel/doublefault.c| 36 +---
 arch/x86/power/cpu.c | 13 +++--
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned 
long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct 
x86_hw_tss *addr)
 {
struct desc_struct *d = get_cpu_gdt_rw(cpu);
tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 504a3bb4d5f0..c24456429c7d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86  boot_cpu_data;
 extern struct cpuinfo_x86  new_cpu_data;
 
-extern struct tss_struct   doublefault_tss;
+extern struct x86_hw_tss   doublefault_tss;
 extern __u32   cpu_caps_cleared[NCAPINTS];
 extern __u32   cpu_caps_set[NCAPINTS];
 
@@ -323,7 +323,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS 65536
 #define IO_BITMAP_BYTES(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET   offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET   (offsetof(struct tss_struct, io_bitmap) 
- offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET   0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0fb3eb37ee0..62cdc10a7d94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
}
}
 
-   t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+   t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
/*
 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 * task never enters user mode.
 */
-   set_tss_desc(cpu, t);
+   set_tss_desc(cpu, &t->x86_tss);
load_TR_desc();
 
load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 * task never enters user mode.
 */
-   set_tss_desc(cpu, t);
+   set_tss_desc(cpu, &t->x86_tss);
load_TR_desc();
 
load_mm_ldt(&init_mm);
 
-   t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+   t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-   .x86_tss = {
-   .sp0= STACK_START,
-   .ss0= __KERNEL_DS,
-   .ldt= 0,
-   .io_bitmap_base = INVALID_IO_BITMAP_OFFSET,
-
-   .ip = (unsigned long) doublefault_fn,
-   /* 0x2 bit is always set */
-   .flags  = X86_EFLAGS_SF | 0x2,
-   .sp = STACK_START,
-   .es = __USER_DS,
-   .cs = __KERNEL_CS,
-   .ss = __KERNEL_DS,
-   .ds = __USER_DS,
-   .fs = __KERNEL_PERCPU,
-
-   .__cr3  = __pa_nodebug(swapper_pg_dir),
-   }
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+   .sp0= STACK_START,
+   .ss0= __KERN

[PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Signed-off-by: Andy Lutomirski 
Cc: Alexander Potapenko 
Cc: Andrey Ryabinin 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: Dmitry Vyukov 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: kasan-...@googlegroups.com
Link: 
http://lkml.kernel.org/r/8407adf9126440d6467dade88fdb3e3b75fc1019.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/kasan_init_64.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..54561dce742e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
int i;
+   void *cpu_entry_area_begin, *cpu_entry_area_end;
 
 #ifdef CONFIG_KASAN_INLINE
register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,18 @@ void __init kasan_init(void)
  (unsigned long)kasan_mem_to_shadow(_end),
  early_pfn_to_nid(__pa(_stext)));
 
+   cpu_entry_area_begin = (void 
*)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
+   cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + 
PAGE_SIZE);
+
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-   (void *)KASAN_SHADOW_END);
+  kasan_mem_to_shadow(cpu_entry_area_begin));
+
+   kasan_populate_shadow((unsigned 
long)kasan_mem_to_shadow(cpu_entry_area_begin),
+ (unsigned 
long)kasan_mem_to_shadow(cpu_entry_area_end),
+   0);
+
+   kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
+  (void *)KASAN_SHADOW_END);
 
load_cr3(init_top_pgt);
__flush_tlb_all();
-- 
2.14.1



[PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/5454bb325cb30a70457a47b50f22317be65eba7d.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/dumpstack.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 5e7d10e8ca25..a8aa70c05489 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,24 +90,28 @@ void show_trace_log_lvl(struct task_struct *task, struct 
pt_regs *regs,
 * - task stack
 * - interrupt stack
 * - HW exception stacks (double fault, nmi, debug, mce)
+* - SYSENTER stack
 *
-* x86-32 can have up to three stacks:
+* x86-32 can have up to four stacks:
 * - task stack
 * - softirq stack
 * - hardirq stack
+* - SYSENTER stack
 */
for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, 
sizeof(long))) {
const char *stack_name;
 
-   /*
-* If we overflowed the task stack into a guard page, jump back
-* to the bottom of the usable stack.
-*/
-   if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-   stack = task_stack_page(task);
-
-   if (get_stack_info(stack, task, &stack_info, &visit_mask))
-   break;
+   if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+   /*
+* We weren't on a valid stack.  It's possible that
+* we overflowed a valid stack into a guard page.
+* See if the next page up is valid so that we can
+* generate some kind of backtrace if this happens.
+*/
+   stack = (unsigned long *)PAGE_ALIGN((unsigned 
long)stack);
+   if (get_stack_info(stack, task, &stack_info, 
&visit_mask))
+   break;
+   }
 
stack_name = stack_type_name(stack_info.type);
if (stack_name)
-- 
2.14.1



[PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/processor.h | 21 -
 arch/x86/kernel/cpu/common.c | 21 +
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c24456429c7d..48d44fae3d27 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -328,7 +328,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
/*
-* The hardware state:
+* Space for the temporary SYSENTER stack, used for SYSENTER
+* and the entry trampoline as well.
+*/
+   unsigned long   SYSENTER_stack_canary;
+   unsigned long   SYSENTER_stack[64];
+
+   /*
+* The fixed hardware portion.  This must not cross a page boundary
+* at risk of violating the SDM's advice and potentially triggering
+* errata.
 */
struct x86_hw_tss   x86_tss;
 
@@ -339,15 +348,9 @@ struct tss_struct {
 * be within the limit.
 */
unsigned long   io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-   /*
-* Space for the temporary SYSENTER stack.
-*/
-   unsigned long   SYSENTER_stack_canary;
-   unsigned long   SYSENTER_stack[64];
-} cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62cdc10a7d94..d173f6013467 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
__set_fixmap(get_cpu_entry_area_index(cpu, gdt), 
get_cpu_gdt_paddr(cpu), gdt_prot);
+
+   /*
+* The Intel SDM says (Volume 3, 7.2.1):
+*
+*  Avoid placing a page boundary in the part of the TSS that the
+*  processor reads during a task switch (the first 104 bytes). The
+*  processor may not correctly perform address translations if a
+*  boundary occurs in this area. During a task switch, the processor
+*  reads and writes into the first 104 bytes of each TSS (using
+*  contiguous physical addresses beginning with the physical address
+*  of the first byte of the TSS). So, after TSS access begins, if
+*  part of the 104 bytes is not physically contiguous, the processor
+*  will access incorrect information without generating a page-fault
+*  exception.
+*
+* There are also a lot of errata involving the TSS spanning a page
+* boundary.  Assert that we're not doing that.
+*/
+   BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+ offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.14.1



[PATCH 10/43] x86/entry: Remap the TSS into the cpu entry area

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_32.S |  6 --
 arch/x86/include/asm/fixmap.h |  7 +++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 38 --
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c  | 11 ++-
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
movl%esp, %eax  # pt_regs pointer
 
/* Are we currently on the SYSENTER stack? */
-   PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+   movlPER_CPU_VAR(cpu_entry_area), %ecx
+   addl$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
subl%eax, %ecx  /* ecx = (end of SYSENTER_stack) - esp */
cmpl$SIZEOF_SYSENTER_stack, %ecx
jb  .Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
movl%esp, %eax  # pt_regs pointer
 
/* Are we currently on the SYSENTER stack? */
-   PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+   movlPER_CPU_VAR(cpu_entry_area), %ecx
+   addl$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + 
SIZEOF_SYSENTER_stack, %ecx
subl%eax, %ecx  /* ecx = (end of SYSENTER_stack) - esp */
cmpl$SIZEOF_SYSENTER_stack, %ecx
jb  .Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 0f4c92f02968..3a42da14c2cb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -51,6 +51,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
char gdt[PAGE_SIZE];
+
+   /*
+* The GDT is just below cpu_tss and thus serves (on x86_64) as a
+* a read-only guard page for the SYSENTER stack at the bottom
+* of the TSS region.
+*/
+   struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
/* Size of SYSENTER_stack */
DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct 
*)0)->SYSENTER_stack));
+
+   /* Layout info for cpu_entry_area */
+   OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d173f6013467..c67742df569a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,19 @@ void load_percpu_segment(int cpu)
load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, 
pgprot_t prot)
+{
+   int i;
+
+   for (i = 0; i < pages; i++)
+   __set_fixmap(fixmap_index - i, per_cpu_ptr_to_phys(ptr + 
i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +544,15 @@ static inline void setup_cpu_entry_area(int cpu)
 */
BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
  offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+   BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+   set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+   &per_cpu(cpu_tss, cpu),
+   sizeof(struct tss_struct) / PAGE_SIZE,
+   PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+   this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1303,8 @@ void enable_sep_cpu(void)
wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
wrmsr(MSR_IA32_SYSENTER_ESP,
- (unsigned long)tss +

[PATCH 02/43] x86/entry/64: Allocate and enable the SYSENTER stack

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski 
Reviewed-by: Thomas Gleixner 
Reviewed-by: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/c37d6e68a73e1b5b1203e0e95b488fa8092b3cfb.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c| 5 +
 arch/x86/kernel/asm-offsets_32.c | 5 -
 arch/x86/kernel/cpu/common.c | 4 +++-
 arch/x86/kernel/process.c| 2 --
 arch/x86/kernel/traps.c  | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
/* Interrupts are off on entry. */
-   SWAPGS_UNSAFE_STACK
+   SWAPGS
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cc16fa882e3e..504a3bb4d5f0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 */
unsigned long   io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
/*
 * Space for the temporary SYSENTER stack.
 */
unsigned long   SYSENTER_stack_canary;
unsigned long   SYSENTER_stack[64];
-#endif
-
 } cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
BLANK();
DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+   /* Offset from cpu_tss to SYSENTER_stack */
+   OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+   /* Size of SYSENTER_stack */
+   DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct 
*)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
   offsetofend(struct tss_struct, SYSENTER_stack));
 
-   /* Offset from cpu_tss to SYSENTER_stack */
-   OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-   /* Size of SYSENTER_stack */
-   DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct 
*)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
BLANK();
OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa998ca8aa5a..ccb5f66c4e5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 */
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-   wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+   wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+   (unsigned long)this_cpu_ptr(&cpu_tss) +
+   offsetofend(struct tss_struct, SYSENTER_stack));
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 97fb3e5737f5..35d674157fda 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, 
cpu_tss) = {
  */
.io_bitmap  = { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
.SYSENTER_stack_canary  =

[PATCH 01/43] x86/decoder: Add new TEST instruction pattern

2017-11-24 Thread Ingo Molnar
From: Masami Hiramatsu 

The kbuild test robot reported this build warning:

  Warning: arch/x86/tools/test_get_len found difference at 
:8103dd2c

  Warning: 8103dd82: f6 09 d8 testb $0xd8,(%rcx)
  Warning: objdump says 3 bytes, but insn_get_length() says 2
  Warning: decoded and checked 1569014 instructions with 1 warnings

This sequence seems to be a new instruction not in the opcode map in the Intel 
SDM.

The instruction sequence is "F6 09 d8", means Group3(F6), 
MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of 
Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"

In that table, opcodes listed by the index REG bits as:

  000 001   010 011  100101110 111
 TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX

So, it seems TEST Ib is assigned to 001.

Add the new pattern.

Greg Kroah-Hartman 
Reported-by: kbuild test robot 
Signed-off-by: Masami Hiramatsu 
Cc: 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/lib/x86-opcode-map.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 12e377184ee4..c4d55919fac1 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -896,7 +896,7 @@ EndTable
 
 GrpTable: Grp3_1
 0: TEST Eb,Ib
-1:
+1: TEST Eb,Ib
 2: NOT Eb
 3: NEG Eb
 4: MUL AL,Eb
-- 
2.14.1



[PATCH 03/43] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack

2017-11-24 Thread Ingo Molnar
From: Andy Lutomirski 

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

PANIC: double fault, error_code: 0x0
...
RIP: 0010:do_error_trap+0x33/0x1c0
...
Call Trace:
Code: ...

With this patch, I get:

PANIC: double fault, error_code: 0x0
...
Call Trace:
 
 ? async_page_fault+0x36/0x60
 ? invalid_op+0x22/0x40
 ? async_page_fault+0x36/0x60
 ? sync_regs+0x3c/0x40
 ? sync_regs+0x2e/0x40
 ? error_entry+0x6c/0xd0
 ? async_page_fault+0x36/0x60
 
Code: ...

Signed-off-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/c32ce8b363e27fa9b4a4773297d5b4b0f4b39e94.1511497875.git.l...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c   | 19 +++
 arch/x86/kernel/dumpstack_32.c|  6 ++
 arch/x86/kernel/dumpstack_64.c|  6 ++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h 
b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..f8062bfd43a0 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
STACK_TYPE_TASK,
STACK_TYPE_IRQ,
STACK_TYPE_SOFTIRQ,
+   STACK_TYPE_SYSENTER,
STACK_TYPE_EXCEPTION,
STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..5e7d10e8ca25 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct 
*task,
return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+   struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+   /* Treat the canary as part of the stack for unwinding purposes. */
+   void *begin = &tss->SYSENTER_stack_canary;
+   void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+   if ((void *)stack < begin || (void *)stack >= end)
+   return false;
+
+   info->type  = STACK_TYPE_SYSENTER;
+   info->begin = begin;
+   info->end   = end;
+   info->next_sp   = NULL;
+
+   return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..5ff13a6b3680 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
if (type == STACK_TYPE_SOFTIRQ)
return "SOFTIRQ";
 
+   if (type == STACK_TYPE_SYSENTER)
+   return "SYSENTER";
+
return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct 
*task,
if (task != current)
goto unknown;
 
+   if (in_sysenter_stack(stack, info))
+   goto recursion_check;
+
if (in_hardirq_stack(stack, info))
goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..abc828f8c297 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
if (type == STACK_TYPE_IRQ)
return "IRQ";
 
+   if (type == STACK_TYPE_SYSENTER)
+   return "SYSENTER";
+
if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct 
*task,
if (in_irq_stack(stack, info))
goto recursion_check;
 
+   if (in_sysenter_stack(stack, info))
+   goto recursion_check;
+
goto unknown;
 
 recursion_check:
-- 
2.14.1



Re: [GIT PULL] Second batch of KVM changes for Linux 4.15

2017-11-24 Thread Paolo Bonzini
On 24/11/2017 04:50, Linus Torvalds wrote:
> On Mon, Nov 20, 2017 at 2:06 PM, Paolo Bonzini  wrote:
>>
>> I am not including the host side of AMD SEV, because it wouldn't have gotten
>> enough time in linux-next even with a "regular-length" merge window.  It
>> will be in 4.16.
> 
> So I pulled it, but then checked,
> 
> None of this was in linux-next 20171117 either,
> 
> So I unpulled it,

The ARM parts certainly were.

UMIP emulation wasn't, because I asked several times the x86 maintainers
to help with a topic branch but they didn't.

Everything else is bugfixes and confined to arch/x86/kvm.

Anyway, will submit again for 4.16.

Paolo



Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-24 Thread Jan Kara
On Fri 24-11-17 09:11:49, Michal Hocko wrote:
> On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> > On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > > for each struct page. So you are doubling the size. Who is going to
> > > > > enable this config option? You are moving this to page_ext in a later
> > > > > patch which is a good step but it doesn't go far enough because this
> > > > > still consumes those resources. Is there any problem to make this
> > > > > kernel command line controllable? Something we do for page_owner for
> > > > > example?
> > > > 
> > > > Sure. I will add it.
> > > > 
> > > > > Also it would be really great if you could give us some measures about
> > > > > the runtime overhead. I do not expect it to be very large but this is
> > > > 
> > > > The major overhead would come from the amount of additional memory
> > > > consumption for 'lockdep_map's.
> > > 
> > > yes
> > > 
> > > > Do you want me to measure the overhead by the additional memory
> > > > consumption?
> > > > 
> > > > Or do you expect another overhead?
> > > 
> > > I would be also interested how much impact this has on performance. I do
> > > not expect it would be too large but having some numbers for cache cold
> > > parallel kbuild or other heavy page lock workloads.
> > 
> > Hello Michal,
> > 
> > I measured 'cache cold parallel kbuild' on my qemu machine. The result
> > varies much so I cannot confirm, but I think there's no meaningful
> > difference between before and after applying crossrelease to page locks.
> > 
> > Actually, I expect little overhead in lock_page() and unlock_page() even
> > after applying crossreleas to page locks, but only expect a bit overhead
> > by additional memory consumption for 'lockdep_map's per page.
> > 
> > I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> > 
> >make clean
> >echo 3 > drop_caches
> >time make -j4
> 
> Maybe FS people will help you find a more representative workload. E.g.
> linear cache cold file read should be good as well. Maybe there are some
> tests in fstests (or how they call xfstests these days).

So a relatively good test of page handling costs is to mmap cache hot file
and measure time to fault in all the pages in the mapping. That way IO and
filesystem stays out of the way and you measure only page table lookup,
page handling (taking page ref and locking the page), and instantiation of
the new PTE. Out of this page handling is actually the significant part.

Honza
-- 
Jan Kara 
SUSE Labs, CR


BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

2017-11-24 Thread Jesús Rodríguez Acosta
Hi all:

I got a kernel panic after execute "systemctl reboot".

Nov 24 09:02:06 desktop systemd[713]: Stopped target Timers.
Nov 24 09:02:06 desktop systemd[713]: Reached target Shutdown.
Nov 24 09:02:06 desktop systemd[713]: Starting Exit the Session...
Nov 24 09:02:06 desktop kernel: [ cut here ]
Nov 24 09:02:06 desktop kernel: WARNING: CPU: 3 PID: 1 at
kernel/fork.c:414 __put_task_struct+0xd4/0x130
Nov 24 09:02:06 desktop kernel: Modules linked in: macvtap macvlan
vhost_net vhost tap algif_skcipher xt_CHECKSUM iptable_mangle
ipt_MASQUER
Nov 24 09:02:06 desktop kernel:  ptp snd_pcm mei crc_itu_t pps_core
button nfsd sch_fq_codel auth_rpcgss nfs_acl lockd nct6775 grace
hwmon_v
Nov 24 09:02:06 desktop kernel: CPU: 3 PID: 1 Comm: systemd Not
tainted 4.14.1-gentoo_pure-x64-op #3
Nov 24 09:02:06 desktop kernel: Hardware name: System manufacturer
System Product Name/P9X79 WS, BIOS 4802 06/02/2015
Nov 24 09:02:06 desktop kernel: task: 88081bef task.stack:
c9008000
Nov 24 09:02:06 desktop kernel: RIP: 0010:__put_task_struct+0xd4/0x130
Nov 24 09:02:06 desktop kernel: RSP: 0018:c900bdd8 EFLAGS: 00010246
Nov 24 09:02:06 desktop kernel: RAX:  RBX:
8807fa3476c0 RCX: 0001
Nov 24 09:02:06 desktop kernel: RDX: c900be38 RSI:
8807fa3476c0 RDI: 8807fa3476c0
Nov 24 09:02:06 desktop kernel: RBP: c900bf28 R08:
1000 R09: 0007
Nov 24 09:02:06 desktop kernel: R10: 55928ae63660 R11:
880815825006 R12: 880814701900
Nov 24 09:02:06 desktop kernel: R13: 1000 R14:
8807fa3476c0 R15: 880814d79f80
Nov 24 09:02:06 desktop kernel: FS:  7f1b5d6d88c0()
GS:88083fcc() knlGS:
Nov 24 09:02:06 desktop kernel: CS:  0010 DS:  ES:  CR0:
80050033
Nov 24 09:02:06 desktop kernel: CR2: 55928ae79088 CR3:
000819990006 CR4: 001626e0
Nov 24 09:02:06 desktop kernel: Call Trace:
Nov 24 09:02:06 desktop kernel:  css_task_iter_next+0x61/0x70
Nov 24 09:02:06 desktop kernel:  kernfs_seq_next+0x1e/0x50
Nov 24 09:02:06 desktop kernel:  ? cgroup_procs_show+0x21/0x30
Nov 24 09:02:06 desktop kernel:  seq_read+0x2b8/0x380
Nov 24 09:02:06 desktop kernel:  __vfs_read+0x1e/0x120
Nov 24 09:02:06 desktop kernel:  vfs_read+0x8c/0x130
Nov 24 09:02:06 desktop kernel:  SyS_read+0x3d/0x90
Nov 24 09:02:06 desktop kernel:  entry_SYSCALL_64_fastpath+0x13/0x94
Nov 24 09:02:06 desktop kernel: RIP: 0033:0x7f1b5ba9a1ad
Nov 24 09:02:06 desktop kernel: RSP: 002b:7ffcdf3da830 EFLAGS:
0293 ORIG_RAX: 
Nov 24 09:02:06 desktop kernel: RAX: ffda RBX:
005d RCX: 7f1b5ba9a1ad
Nov 24 09:02:06 desktop kernel: RDX: 1000 RSI:
55928ae6ab40 RDI: 005d
Nov 24 09:02:06 desktop kernel: RBP: 7f1b5d6d8730 R08:
1010 R09: 7f1b5d6d88c0
Nov 24 09:02:06 desktop kernel: R10: 55928ae63660 R11:
0293 R12: 
Nov 24 09:02:06 desktop kernel: R13: 0001 R14:
 R15: 0003
Nov 24 09:02:06 desktop kernel: Code: 48 85 d2 74 06 f0 ff 4a 4c 74 34
48 8b 3d 1d 24 09 01 48 89 ee e8 cd 42 10 00 48 89 df 5b 5d e9 f3 fe
Nov 24 09:02:06 desktop kernel: ---[ end trace 47a5e18e047da55e ]---
Nov 24 09:02:06 desktop kernel: BUG: unable to handle kernel NULL
pointer dereference at 0018
Nov 24 09:02:06 desktop kernel: IP: exit_creds+0x16/0x40
Nov 24 09:02:06 desktop kernel: PGD 0 P4D 0
Nov 24 09:02:06 desktop kernel: Oops: 0002 [#1] PREEMPT SMP
Nov 24 09:02:06 desktop kernel: Modules linked in: macvtap macvlan
vhost_net vhost tap algif_skcipher xt_CHECKSUM iptable_mangle
ipt_MASQUER
Nov 24 09:02:06 desktop kernel:  ptp snd_pcm mei crc_itu_t pps_core
button nfsd sch_fq_codel auth_rpcgss nfs_acl lockd nct6775 grace
hwmon_v
Nov 24 09:02:06 desktop kernel: CPU: 3 PID: 1 Comm: systemd Tainted: G
   W   4.14.1-gentoo_pure-x64-op #3
Nov 24 09:02:06 desktop kernel: Hardware name: System manufacturer
System Product Name/P9X79 WS, BIOS 4802 06/02/2015
Nov 24 09:02:06 desktop kernel: task: 88081bef task.stack:
c9008000
Nov 24 09:02:06 desktop kernel: RIP: 0010:exit_creds+0x16/0x40
Nov 24 09:02:06 desktop kernel: RSP: 0018:c900bdc8 EFLAGS: 00010246
Nov 24 09:02:06 desktop kernel: RAX: 0001 RBX:
8807fa3476c0 RCX: 
Nov 24 09:02:06 desktop kernel: RDX: 55ed9557 RSI:
 RDI: 0018
Nov 24 09:02:06 desktop kernel: RBP: c900bf28 R08:
1000 R09: 0007
Nov 24 09:02:06 desktop kernel: R10: 55928ae63660 R11:
880815825006 R12: 880814701900
Nov 24 09:02:06 desktop kernel: R13: 1000 R14:
8807fa3476c0 R15: 880814d79f80
Nov 24 09:02:06 desktop kernel: FS:  7f1b5d6d88c0()
GS:88083fcc() knlGS:
Nov 24 09:02:06 desktop kernel: CS:  0010 DS:  ES:  CR0:
80050033
No

Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64

2017-11-24 Thread Andrea Reale
Hi Arun,


On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
>  wrote:
>> [ ...]
> > Introduces memory hotplug functionality (hot-add) for arm64.
> > @@ -615,6 +616,44 @@ void __init paging_init(void)
> >   SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +
> > +struct mem_range {
> > +   phys_addr_t base;
> > +   phys_addr_t size;
> > +};
> > +
> > +static int __hotplug_paging(void *data)
> > +{
> > +   int flags = 0;
> > +   struct mem_range *section = data;
> > +
> > +   if (debug_pagealloc_enabled())
> > +   flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > +
> > +   __create_pgd_mapping(swapper_pg_dir, section->base,
> > +   __phys_to_virt(section->base), section->size,
> > +   PAGE_KERNEL, pgd_pgtable_alloc, flags);
> 
> Hello Andrea,
> 
> __hotplug_paging runs on stop_machine context.
> cpu stop callbacks must not sleep.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> 
> __create_pgd_mapping uses pgd_pgtable_alloc. which does
> __get_free_page(PGALLOC_GFP)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> 
> PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> 
> #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> #define GFP_KERNEL  (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> 
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> 
> and then BUG()

Well spotted, thanks for reporting the problem. One possible solution
would be to revert back to building the updated page tables on a copy
pgdir (as it was done in v1 of this patchset) and then replacing swapper
atomically with stop_machine.

Actually, I am not sure if stop_machine is strictly needed,
if we modify the swapper pgdir live: for example, in x86_64
kernel_physical_mapping_init, atomicity is ensured by spin-locking on
init_mm.page_table_lock.
https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
I'll spend some time investigating whoever else could be working
concurrently on the swapper pgdir.

Any suggestion or pointer is very welcome.

Thanks,
Andrea

> I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> 
> Regards,
> Arun
> 
> 
> > +
> > +   return 0;
> > +}
> > +
> > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +   struct mem_range section = {
> > +   .base = start,
> > +   .size = size,
> > +   };
> > +
> > +   stop_machine(__hotplug_paging, §ion, NULL);
> > +}
> > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > +
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> > --
> > 2.7.4
> >
> 



Re: [PATCH 1/1] stackdepot: interface to check entries and size of stackdepot.

2017-11-24 Thread Maninder Singh
Hi Michal,
  
> On Wed 22-11-17 16:17:41, Maninder Singh wrote:
> > This patch provides interface to check all the stack enteries
> > saved in stackdepot so far as well as memory consumed by stackdepot.
> > 
> > 1) Take current depot_index and offset to calculate end address for one
> > iteration of (/sys/kernel/debug/depot_stack/depot_entries).
> > 
> > 2) Fill end marker in every slab to point its end, and then use it while
> > traversing all the slabs of stackdepot.
> > 
> > "debugfs code inspired from page_onwer's way of printing BT"
> > 
> > checked on ARM and x86_64.
> > $cat /sys/kernel/debug/depot_stack/depot_size
> > Memory consumed by Stackdepot:208 KB
> > 
> > $ cat /sys/kernel/debug/depot_stack/depot_entries
> > stack count 1 backtrace
> >  init_page_owner+0x1e/0x210
> >  start_kernel+0x310/0x3cd
> >  secondary_startup_64+0xa5/0xb0
> >  0x
>  
> Why do we need this? Who is goging to use this information and what for?
> I haven't looked at the code but just the diffstat looks like this
> should better have a _very_ good justification to be considered for
> merging. To be honest with you I have hard time imagine how this can be
> useful other than debugging stack depot...

This interface can be used for multiple reasons as:

1) For debugging stackdepot for sure.
2) For checking all the unique allocation paths in system.
3) To check if any invalid stack is coming which is increasing 
stackdepot memory.
(https://lkml.org/lkml/2017/10/11/353)

Althoutgh this needs to be taken care in ARM as replied by maintainer, but with 
help
of this interface it was quite easy to check and we added workaround for saving 
memory.

4) At some point of time to check current memory consumed by stackdepot.
5) To check number of entries in stackdepot to decide stackdepot hash size for 
different systems. 
   For fewer entries hash table size can be reduced from 4MB. 

Thanks
Maninder Singh


[PATCH] perf test: Fix test 21 for s390x

2017-11-24 Thread Thomas Richter
Test case 21 (Number of exit events of a simple workload) fails
on s390x. The reason is the invalid sample frequency supplied for
this test. On s390x the minimum sample frequency is much higher
(see output of /proc/service_levels:

[root@s35lp76 linux-devel]# cat /proc/service_levels 
CPU-MF: Counter facility: version=3.5 authorization=002f
CPU-MF: Sampling facility: min_rate=18228 max_rate=170650536 cpu_speed=5208
...
[root@s35lp76 linux-devel]#
).

Supply a save sample frequency value for s390x to fix this.
The value will be adjusted by the s390x CPUMF frequency
convertion function to a value well below the
sysctl kernel.perf_event_max_sample_rate value.

Signed-off-by: Thomas Richter 
Reviewed-by: Hendrik Brueckner 
---
 tools/perf/tests/task-exit.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/tests/task-exit.c b/tools/perf/tests/task-exit.c
index bc4a7344e274..1073670dd1f2 100644
--- a/tools/perf/tests/task-exit.c
+++ b/tools/perf/tests/task-exit.c
@@ -84,7 +84,11 @@ int test__task_exit(struct test *test __maybe_unused, int 
subtest __maybe_unused
 
evsel = perf_evlist__first(evlist);
evsel->attr.task = 1;
+#ifdef __s390x__
+   evsel->attr.sample_freq = 100;
+#else
evsel->attr.sample_freq = 1;
+#endif
evsel->attr.inherit = 0;
evsel->attr.watermark = 0;
evsel->attr.wakeup_events = 1;
-- 
2.13.4



Re: Fwd: Why qemu with kvm enabled can boot kernel even if identity page map is not set correctly?

2017-11-24 Thread Paolo Bonzini
On 22/11/2017 03:58, 丁飞 wrote:
> -- Forwarded message --
> From: 丁飞 
> Date: Wed, Nov 22, 2017 at 12:58 AM
> Subject: Why qemu with kvm enabled can boot kernel even if identity
> page map is not set correctly?
> To: k...@vger.kernel.org
> 
> 
> Hi, KVM developers. Firstly, sorry if it's the wrong place to ask such
> a question!
> 
> In the early stages of boot process, kernel need identity mapped page setup
> when switching gdt
> [https://github.com/torvalds/linux/blob/ed30b147e1f6e396e70a52dbb6c7d66befedd786/arch/x86/kernel/head_64.S#L133-L137]
> as code here 
> [https://github.com/torvalds/linux/blob/ed30b147e1f6e396e70a52dbb6c7d66befedd786/arch/x86/kernel/head64.c#L98-L138]
> implies. That's why the first few entries
> of early_dynamic_pgts are set to map the kernel text range [_text, _end].
> But as we discussed about the role of the first few entries of
> early_dynamic_pgts,
> we delete them 
> [https://github.com/torvalds/linux/blob/ed30b147e1f6e396e70a52dbb6c7d66befedd786/arch/x86/kernel/head64.c#L98-L138]
> and recompile the kernel, then test it on qemu.
> 
> Without '-enable-kvm' option the kernel won't boot as we expected, but with 
> kvm
> option on, the kernel can boot and everything runs well, really to our 
> surprise.
> 
> So I guess there are something under the hood done by kvm, which doesn't obey
> the rules of how a real physical machine behaves.
> 
> I've setup a debug environment that the page table mis-configed kernel
> runs inside
> qemu, which nested inside vmware workstation with EPT enabled, and gdb
> on the host to debug the kernel kvm of vmware kernel.
> 
> But without any luck I've spent a whole day try to catch what is
> happening inside kvm,
> I still can't figure out the real magic point that jump through the
> broken page table.
> It seems that the code just jumps randomly.
> 
> Can anyone confirm what we've observed? Is it designed to be like that?
> Any details or explanation would be really appreciated!

I'm sorry, I don't know.  There are many differences in TLB behavior
between emulation and real hardware, those could be the culprit.

Paolo


Re: powerpc/vas, export chip_to_vas_id()

2017-11-24 Thread Michael Ellerman
On Mon, 2017-11-20 at 19:12:48 UTC, Sukadev Bhattiprolu wrote:
> >From 958f8db089f4b89407fc4b89bccd3eaef585aa96 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu 
> Date: Mon, 20 Nov 2017 12:53:15 -0600
> Subject: [PATCH 1/1] powerpc/vas, export chip_to_vas_id()
> 
> Export the symbol chip_to_vas_id() to fix a build failure when
> CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV=m.
> 
> Reported-by: Haren Myneni 
> Reported-by: Josh Boyer 
> Signed-off-by: Sukadev Bhattiprolu 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/62b49c42107efd973edffc75f4f874

cheers


[PATCH] perf annotate: Fix unnecessary memory allocation for s390x

2017-11-24 Thread Thomas Richter
This patch fixes a bug introduced with commit d9f8dfa9baf9
("perf annotate s390: Implement jump types for perf annotate").

Perf annotate displays annotated assembler output by reading
output of command objdump and parsing the disassembled lines. For
each shown mnemonic this function sequence is executed:

  disasm_line__new()
  |
  +--> disasm_line__init_ins()
   |
   +--> ins__find()
|
+--> arch->associate_instruction_ops()

The s390x specific function assigned to function pointer
associate_instruction_ops refers to function
s390__associate_ins_ops(). This function checks for supported
mnemonics and assigns a NULL pointer to unsupported mnemonics.
However even the NULL pointer is added to the architecture
dependend instruction array.

This leads to an extremely large architecture instruction array
(due to array resize logic in function arch__grow_instructions()).
Depending on the objdump output being parsed the array can end up
with several ten-thousand elements.

This patch checks if a mnemonic is supported and only adds
supported ones into the architecture instruction array. The
array does not contain elements with NULL pointers anymore.

Before the patch (With some debug printf output):
[root@s35lp76 perf]# time ./perf annotate --stdio > /tmp/xxxbb

real8m49.679s
user7m13.008s
sys 0m1.649s
[root@s35lp76 perf]# fgrep '__ins__find sorted:1 nr_instructions:'
/tmp/xxxbb | tail -1
__ins__find sorted:1 nr_instructions:87433 ins:0x341583c0
[root@s35lp76 perf]#

The number of different s390x branch/jump/call/return instructions
entered into the array is 87433.

After the patch (With some printf debug output:)

[root@s35lp76 perf]# time ./perf annotate --stdio > /tmp/xxxaa

real1m24.553s
user0m0.587s
sys 0m1.530s
[root@s35lp76 perf]# fgrep '__ins__find sorted:1 nr_instructions:'
/tmp/xxxaa | tail -1
__ins__find sorted:1 nr_instructions:56 ins:0x3f406570
[root@s35lp76 perf]#

The number of different s390x branch/jump/call/return instructions
entered into the array is 56 which is sensible.

Signed-off-by: Thomas Richter 
Reviewed-by: Hendrik Brueckner 
---
 tools/perf/arch/s390/annotate/instructions.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/arch/s390/annotate/instructions.c 
b/tools/perf/arch/s390/annotate/instructions.c
index c9a81673e8aa..89f0b6c00e3f 100644
--- a/tools/perf/arch/s390/annotate/instructions.c
+++ b/tools/perf/arch/s390/annotate/instructions.c
@@ -16,7 +16,8 @@ static struct ins_ops *s390__associate_ins_ops(struct arch 
*arch, const char *na
if (!strcmp(name, "br"))
ops = &ret_ops;
 
-   arch__associate_ins_ops(arch, name, ops);
+   if (ops)
+   arch__associate_ins_ops(arch, name, ops);
return ops;
 }
 
-- 
2.13.4



Re: [GIT PULL] UBI/UBIFS updates for 4.15-rc1

2017-11-24 Thread Richard Weinberger
Linus,

Am Freitag, 24. November 2017, 04:41:37 CET schrieb Linus Torvalds:
> On Thu, Nov 23, 2017 at 4:37 AM, Richard Weinberger  wrote:
> >   git://git.infradead.org/linux-ubifs.git tags/upstream-4.15-rc1
> 
> Similarly to the arch/um case, none of this seems to have been in
> linux-next, and is sent late in the merge window, so I'm skipping it.

It is since next-20171120 in linux-next.
I'm sorry for being late, will do a better job next time.

If you don't mind I'll send you a PR with bug fixes for 4.15-rc2.
Same for UML.

Thanks,
//richard



Re: [v2] powerpc: fix boot on BOOK3S_32 with CONFIG_STRICT_KERNEL_RWX

2017-11-24 Thread Michael Ellerman
On Tue, 2017-11-21 at 14:28:20 UTC, Christophe Leroy wrote:
> On powerpc32, patch_instruction() is called by apply_feature_fixups()
> which is called from early_init()
> 
> There is the following note in front of early_init():
>  * Note that the kernel may be running at an address which is different
>  * from the address that it was linked at, so we must use RELOC/PTRRELOC
>  * to access static data (including strings).  -- paulus
> 
> Therefore, slab_is_available() cannot be called yet, and
> text_poke_area must be addressed with PTRRELOC()
> 
> Fixes: 37bc3e5fd764f ("powerpc/lib/code-patching: Use alternate map
> for patch_instruction()")
> Reported-by: Meelis Roos 
> Cc: Balbir Singh 
> Signed-off-by: Christophe Leroy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/252eb55816a6f69ef9464cad303cdb

cheers


Re: [PATCH v7 3/5] bpf: add a bpf_override_function helper

2017-11-24 Thread Daniel Borkmann
On 11/22/2017 10:23 PM, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
> perfectly with it's kprobe functionality.  We could make sure errors are
> only triggered in specific call chains that we care about with very
> specific situations.  Accomplish this with the bpf_override_funciton
> helper.  This will modify the probe'd callers return value to the
> specified value and set the PC to an override function that simply
> returns, bypassing the originally probed function.  This gives us a nice
> clean way to implement systematic error injection for all of our code
> paths.
> 
> Acked-by: Alexei Starovoitov 
> Acked-by: Ingo Molnar 
> Signed-off-by: Josef Bacik 

Series looks good to me as well; BPF bits:

Acked-by: Daniel Borkmann 


  1   2   3   4   5   6   >