Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-03 Thread Santosh Shilimkar

Hi Dave,

On 10/2/2017 1:30 PM, Santosh Shilimkar wrote:

On 10/1/2017 10:56 PM, David Miller wrote:

From: David Miller 

[...]



Actually, reverted, this breaks the build.

net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or 
directory

  #include "ib.h"

Although I can't see how in the world this patch is causing such
an error.


Weird indeed. Will sort this out with Avinash. Thanks Dave.


I tried few build combinations on net-next but couldn't
reproduce the build failure. AFAIU, the build error can only
happen if for some reason the ib.h file got deleted
accidentally. I did delete ib.h file and saw below error

  CC [M]  net/rds/rdma_transport.o
net/rds/rdma_transport.c:38:16: error: ib.h: No such file or directory

Could it be the case for some reason the ib.h file got
deleted or mangled while applying the $subject patch ?

Regards,
Santosh


Re: [PATCH 1/6] f2fs: support issuing/waiting discard in range

2017-10-03 Thread Chao Yu
Hi Jaegeuk,

On 2017/10/4 4:16, Jaegeuk Kim wrote:
> On 09/19, Chao Yu wrote:
>> From: Chao Yu 
>>
>> Fstrim intends to trim invalid blocks of filesystem only with specified
>> range and granularity, but actually, it will issue all previous cached
>> discard commands which may be out-of-range and be with unmatched
>> granularity, it's unneeded.
>>
>> In order to fix above issues, this patch introduces new helps to support
>> to issue and wait discard in range and adds a new fstrim_list for tracking
>> in-flight discard from ->fstrim.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/f2fs.h|   1 +
>>  fs/f2fs/segment.c | 125 
>> +-
>>  2 files changed, 106 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 89b4927dcd79..cb13c7df6971 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -249,6 +249,7 @@ struct discard_cmd_control {
>>  struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */
>>  unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */
>>  struct list_head wait_list; /* store on-flushing entries */
>> +struct list_head fstrim_list;   /* in-flight discard from 
>> fstrim */
>>  wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
>>  unsigned int discard_wake;  /* to wake up discard thread */
>>  struct mutex cmd_lock;
>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> index dedf0209d820..088936c61905 100644
>> --- a/fs/f2fs/segment.c
>> +++ b/fs/f2fs/segment.c
>> @@ -845,9 +845,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi,
>>  
>>  /* this function is copied from blkdev_issue_discard from block/blk-lib.c */
>>  static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
>> -struct discard_cmd *dc)
>> +struct discard_cmd *dc, bool fstrim)
>>  {
>>  struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
>> +struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) :
>> +&(dcc->wait_list);
>>  struct bio *bio = NULL;
>>  
>>  if (dc->state != D_PREP)
>> @@ -869,7 +871,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info 
>> *sbi,
>>  bio->bi_end_io = f2fs_submit_discard_endio;
>>  bio->bi_opf |= REQ_SYNC;
>>  submit_bio(bio);
>> -list_move_tail(&dc->list, &dcc->wait_list);
>> +list_move_tail(&dc->list, wait_list);
>>  __check_sit_bitmap(sbi, dc->start, dc->start + dc->len);
>>  
>>  f2fs_update_iostat(sbi, FS_DISCARD, 1);
>> @@ -1054,6 +1056,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info 
>> *sbi,
>>  return 0;
>>  }
>>  
>> +static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi,
>> +unsigned int start, unsigned int end,
>> +unsigned int granularity)
>> +{
>> +struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
>> +struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
>> +struct rb_node **insert_p = NULL, *insert_parent = NULL;
>> +struct discard_cmd *dc;
>> +struct blk_plug plug;
>> +int issued;
>> +
>> +next:
>> +issued = 0;
>> +
>> +mutex_lock(&dcc->cmd_lock);
>> +f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root));
>> +
>> +dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root,
>> +NULL, start,
>> +(struct rb_entry **)&prev_dc,
>> +(struct rb_entry **)&next_dc,
>> +&insert_p, &insert_parent, true);
>> +if (!dc)
>> +dc = next_dc;
>> +
>> +blk_start_plug(&plug);
>> +
>> +while (dc && dc->lstart <= end) {
>> +struct rb_node *node;
>> +
>> +if (dc->len < granularity)
>> +goto skip;
>> +
>> +if (dc->state != D_PREP) {
>> +list_move_tail(&dc->list, &dcc->fstrim_list);
>> +goto skip;
>> +}
>> +
>> +__submit_discard_cmd(sbi, dc, true);
>> +
>> +if (++issued >= DISCARD_ISSUE_RATE) {
>> +start = dc->lstart + dc->len;
>> +
>> +blk_finish_plug(&plug);
>> +mutex_unlock(&dcc->cmd_lock);
>> +
>> +schedule();
>> +
>> +goto next;
>> +}
>> +skip:
>> +node = rb_next(&dc->rb_node);
>> +dc = rb_entry_safe(node, struct discard_cmd, rb_node);
>> +
>> +if (fatal_signal_pending(current))
>> +break;
>> +}
>> +
>> +blk_finish_plug(&plug);
>> +mutex_unlock(&dcc->cmd_lock);
>> +}
>> +
>>  static int __issue_disca

Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Nadav Amit
Jerome Glisse  wrote:

> On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> 
>> I'd like some more explanation about the inner working of "that new
>> user" as per comment above.
>> 
>> It would be enough to drop mmu_notifier_invalidate_range from above
>> without adding it to the filebacked case. The above gives higher prio
>> to the hypothetical and uncertain future case, than to the current
>> real filebacked case that doesn't need ->invalidate_range inside the
>> PT lock, or do you see something that might already need such
>> ->invalidate_range?
> 
> No i don't see any new user today that might need such invalidate but
> i was trying to be extra cautious as i have a tendency to assume that
> someone might do a patch that use try_to_unmap() without going through
> all the comments in the function and thus possibly using it in a an
> unexpected way from mmu_notifier callback point of view. I am fine
> with putting the burden on new user to get it right and adding an
> extra warning in the function description to try to warn people in a
> sensible way.

I must be missing something. After the PTE is changed, but before the
secondary TLB notification/invalidation - What prevents another thread from
changing the mappings (e.g., using munmap/mmap), and setting a new page
at that PTE?

Wouldn’t it end with the page being mapped without a secondary TLB flush in
between?

Nadav



Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 03, 2017 at 11:15:07PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, October 3, 2017 8:59:00 PM CEST Rafael J. Wysocki wrote:
> > On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote:
> > > Now that all filesystems which used to rely on kthread
> > > freezing have been converted to filesystem freeze/thawing
> > > we can remove the kernel kthread freezer.
> > > 
> > > Signed-off-by: Luis R. Rodriguez 
> > 
> > I like this one. :-)
> 
> However, suspend_freeze/thaw_processes() require some more work.
> 
> In particular, the freezing of workqueues is being removed here
> without a replacement.

Hrm, where do you see that? freeze_workqueues_busy() is still called on
try_to_freeze_tasks(). Likewise thaw_processes() also calls thaw_workqueues().
I did forget to nuke pm_nosig_freezing though.

Also as I have noted a few times now we have yet to determine if we can remove
all freezer calls on kthreads without causing a regression. Granted I'm being
overly careful here, I still would not be surprised to learn about a stupid use
case where this will be hard to remove now.

Only once this is done should this patch be considered. This will take time. So
I'd like more general consensus on long term approach:

  1) first address each fs to use its freezer calls on susend/hibernate / and 
thaw
 on resume. While at it, remove freezer API calls on their respective
 kthreads.
  2) In parallel address removing freezer API calls on non-IO kthreads, these
 should be trivial, but who knows what surprises lurk here
  3) Lookup for kthreads which seem to generate IO -- address / review if
 removal of the freezer API can be done somehow with a quescing. This
 is currently for example being done on SCSI / md.
  4) Only after all the above is done should we consider this patch or some
 form of it.

  Luis


Re: [f2fs-dev] [PATCH 1/6] f2fs: support issuing/waiting discard in range

2017-10-03 Thread Chao Yu
Hi Jaegeuk,

On 2017/10/4 4:52, Jaegeuk Kim wrote:
> Hi Chao,
> 
> Could you please rebase the following patches on top of dev-test?
> 
> f2fs-support-issuing-waiting-discard-in-range
> f2fs-wrap-discard-policy
> f2fs-split-discard-policy
> f2fs-reduce-cmd_lock-coverage-in-__issue_discard_cmd
> f2fs-trace-f2fs_remove_discard
> f2fs-give-up-CP_TRIMMED_FLAG-if-it-drops-discards
> 
> It seems we need to rearrange some of patches for better review.

No problem, let me rebase them. :)

Thanks,

> 
> Thanks,
> 
> On 10/03, Jaegeuk Kim wrote:
>> On 09/19, Chao Yu wrote:
>>> From: Chao Yu 
>>>
>>> Fstrim intends to trim invalid blocks of filesystem only with specified
>>> range and granularity, but actually, it will issue all previous cached
>>> discard commands which may be out-of-range and be with unmatched
>>> granularity, it's unneeded.
>>>
>>> In order to fix above issues, this patch introduces new helps to support
>>> to issue and wait discard in range and adds a new fstrim_list for tracking
>>> in-flight discard from ->fstrim.
>>>
>>> Signed-off-by: Chao Yu 
>>> ---
>>>  fs/f2fs/f2fs.h|   1 +
>>>  fs/f2fs/segment.c | 125 
>>> +-
>>>  2 files changed, 106 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>> index 89b4927dcd79..cb13c7df6971 100644
>>> --- a/fs/f2fs/f2fs.h
>>> +++ b/fs/f2fs/f2fs.h
>>> @@ -249,6 +249,7 @@ struct discard_cmd_control {
>>> struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */
>>> unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */
>>> struct list_head wait_list; /* store on-flushing entries */
>>> +   struct list_head fstrim_list;   /* in-flight discard from 
>>> fstrim */
>>> wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
>>> unsigned int discard_wake;  /* to wake up discard thread */
>>> struct mutex cmd_lock;
>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>>> index dedf0209d820..088936c61905 100644
>>> --- a/fs/f2fs/segment.c
>>> +++ b/fs/f2fs/segment.c
>>> @@ -845,9 +845,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi,
>>>  
>>>  /* this function is copied from blkdev_issue_discard from block/blk-lib.c 
>>> */
>>>  static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
>>> -   struct discard_cmd *dc)
>>> +   struct discard_cmd *dc, bool fstrim)
>>>  {
>>> struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
>>> +   struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) :
>>> +   &(dcc->wait_list);
>>> struct bio *bio = NULL;
>>>  
>>> if (dc->state != D_PREP)
>>> @@ -869,7 +871,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info 
>>> *sbi,
>>> bio->bi_end_io = f2fs_submit_discard_endio;
>>> bio->bi_opf |= REQ_SYNC;
>>> submit_bio(bio);
>>> -   list_move_tail(&dc->list, &dcc->wait_list);
>>> +   list_move_tail(&dc->list, wait_list);
>>> __check_sit_bitmap(sbi, dc->start, dc->start + dc->len);
>>>  
>>> f2fs_update_iostat(sbi, FS_DISCARD, 1);
>>> @@ -1054,6 +1056,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info 
>>> *sbi,
>>> return 0;
>>>  }
>>>  
>>> +static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi,
>>> +   unsigned int start, unsigned int end,
>>> +   unsigned int granularity)
>>> +{
>>> +   struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
>>> +   struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
>>> +   struct rb_node **insert_p = NULL, *insert_parent = NULL;
>>> +   struct discard_cmd *dc;
>>> +   struct blk_plug plug;
>>> +   int issued;
>>> +
>>> +next:
>>> +   issued = 0;
>>> +
>>> +   mutex_lock(&dcc->cmd_lock);
>>> +   f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root));
>>> +
>>> +   dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root,
>>> +   NULL, start,
>>> +   (struct rb_entry **)&prev_dc,
>>> +   (struct rb_entry **)&next_dc,
>>> +   &insert_p, &insert_parent, true);
>>> +   if (!dc)
>>> +   dc = next_dc;
>>> +
>>> +   blk_start_plug(&plug);
>>> +
>>> +   while (dc && dc->lstart <= end) {
>>> +   struct rb_node *node;
>>> +
>>> +   if (dc->len < granularity)
>>> +   goto skip;
>>> +
>>> +   if (dc->state != D_PREP) {
>>> +   list_move_tail(&dc->list, &dcc->fstrim_list);
>>> +   goto skip;
>>> +   }
>>> +
>>> +   __submit_discard_cmd(sbi, dc, true);
>>> +
>>> +   if (++issued >= DISCARD_ISSUE_RATE) {
>>> +   start = dc->lstart + dc-

Re: [f2fs-dev] [PATCH v4] f2fs: introduce discard_granularity sysfs entry

2017-10-03 Thread Chao Yu
Hi Ju Hyung, Jaegeuk,

On 2017/10/4 4:17, Jaegeuk Kim wrote:
> On 10/04, Ju Hyung Park wrote:
>> Hi Chao.
>>
>> Yep, that patch seems to have fixed it.
>> Doing "while true; do fstrim -v /; done" while "rm -rf"ing a 2GB
>> kbuild directory
>> (with lots of small .o files and stuff) ended flawlessly.

Thanks for the test. ;)

>>
>> I hope to see this patch merged with next 4.14 merge cycle.
> 
> Cool! I'll merge this patch and submit for 4.14. :)

Thanks for the quick merging. ;)

Thanks,

> 
> Thanks,
> 
>>
>> Thanks :)
>>
>> On Tue, Oct 3, 2017 at 12:59 AM, Chao Yu  wrote:
>>> Hi Park,
>>>
>>> Thanks for the report, could have a try with below patch:
>>>
>>> From 5fa30e8cdcb93f210e25142c48a884be383c6121 Mon Sep 17 00:00:00 2001
>>> From: Chao Yu 
>>> Date: Mon, 2 Oct 2017 02:50:16 +0800
>>> Subject: [PATCH] f2fs: fix potential panic during fstrim
>>>
>>> As Ju Hyung Park reported:
>>>
>>> "When 'fstrim' is called for manual trim, a BUG() can be triggered
>>> randomly with this patch.
>>>
>>> I'm seeing this issue on both x86 Desktop and arm64 Android phone.
>>>
>>> On x86 Desktop, this was caused during Ubuntu boot-up. I have a
>>> cronjob installed which calls 'fstrim -v /' during boot. On arm64
>>> Android, this was caused during GC looping with 1ms gc_min_sleep_time
>>> & gc_max_sleep_time."
>>>
>>> Root cause of this issue is that f2fs_wait_discard_bios can only be
>>> used by f2fs_put_super, because during put_super there must be no
>>> other referrers, so it can ignore discard entry's reference count
>>> when removing the entry, otherwise in other caller we will hit bug_on
>>> in __remove_discard_cmd as there may be other issuer added reference
>>> count in discard entry.
>>>
>>> Thread AThread B
>>> - issue_discard_thread
>>> - f2fs_ioc_fitrim
>>>  - f2fs_trim_fs
>>>   - f2fs_wait_discard_bios
>>>- __issue_discard_cmd
>>> - __submit_discard_cmd
>>>  - __wait_discard_cmd
>>>   - dc->ref++
>>>   - __wait_one_discard_bio
>>>- __wait_discard_cmd
>>> - __remove_discard_cmd
>>>  - f2fs_bug_on(sbi, dc->ref)
>>>
>>> Fixes: 969d1b180d987c2be02de890d0fff0f66a0e80de
>>> Reported-by: Ju Hyung Park 
>>> Signed-off-by: Chao Yu 
>>> ---
>>>  fs/f2fs/f2fs.h| 2 +-
>>>  fs/f2fs/segment.c | 6 +++---
>>>  fs/f2fs/super.c   | 2 +-
>>>  3 files changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>> index 9a7c90386947..4b4a72f392be 100644
>>> --- a/fs/f2fs/f2fs.h
>>> +++ b/fs/f2fs/f2fs.h
>>> @@ -2525,7 +2525,7 @@ void invalidate_blocks(struct f2fs_sb_info *sbi, 
>>> block_t addr);
>>>  bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr);
>>>  void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new);
>>>  void stop_discard_thread(struct f2fs_sb_info *sbi);
>>> -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi);
>>> +void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount);
>>>  void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control 
>>> *cpc);
>>>  void release_discard_addrs(struct f2fs_sb_info *sbi);
>>>  int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra);
>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>>> index dedf0209d820..e28245b7e44e 100644
>>> --- a/fs/f2fs/segment.c
>>> +++ b/fs/f2fs/segment.c
>>> @@ -1210,11 +1210,11 @@ void stop_discard_thread(struct f2fs_sb_info *sbi)
>>>  }
>>>
>>>  /* This comes from f2fs_put_super and f2fs_trim_fs */
>>> -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi)
>>> +void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount)
>>>  {
>>> __issue_discard_cmd(sbi, false);
>>> __drop_discard_cmd(sbi);
>>> -   __wait_discard_cmd(sbi, false);
>>> +   __wait_discard_cmd(sbi, !umount);
>>>  }
>>>
>>>  static void mark_discard_range_all(struct f2fs_sb_info *sbi)
>>> @@ -2244,7 +2244,7 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, struct 
>>> fstrim_range *range)
>>> }
>>> /* It's time to issue all the filed discards */
>>> mark_discard_range_all(sbi);
>>> -   f2fs_wait_discard_bios(sbi);
>>> +   f2fs_wait_discard_bios(sbi, false);
>>>  out:
>>> range->len = F2FS_BLK_TO_BYTES(cpc.trimmed);
>>> return err;
>>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>>> index 89f61eb3d167..933c3d529e65 100644
>>> --- a/fs/f2fs/super.c
>>> +++ b/fs/f2fs/super.c
>>> @@ -801,7 +801,7 @@ static void f2fs_put_super(struct super_block *sb)
>>> }
>>>
>>> /* be sure to wait for any on-going discard commands */
>>> -   f2fs_wait_discard_bios(sbi);
>>> +   f2fs_wait_discard_bios(sbi, true);
>>>
>>> if (f2fs_discard_en(sbi) && !sbi->discard_blks) {
>>> struct cp_control cpc = {
>>> --
>>> 2.14.1.145.gb3622a4ee
>>>
>>> On 2017/10/2 3:29, Ju Hy

[PATCH net] RDS: IB: Initialize max_items based on underlying device attributes

2017-10-03 Thread Avinash Repaka
Use max_1m_mrs/max_8k_mrs while setting max_items, as the former
variables are set based on the underlying device attricutes.

Signed-off-by: Avinash Repaka 
---
 net/rds/ib_rdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 9a3c54e..e678699 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -601,11 +601,11 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct 
rds_ib_device *rds_ibdev,
if (pool_type == RDS_IB_MR_1M_POOL) {
/* +1 allows for unaligned MRs */
pool->fmr_attr.max_pages = RDS_MR_1M_MSG_SIZE + 1;
-   pool->max_items = RDS_MR_1M_POOL_SIZE;
+   pool->max_items = rds_ibdev->max_1m_mrs;
} else {
/* pool_type == RDS_IB_MR_8K_POOL */
pool->fmr_attr.max_pages = RDS_MR_8K_MSG_SIZE + 1;
-   pool->max_items = RDS_MR_8K_POOL_SIZE;
+   pool->max_items = rds_ibdev->max_8k_mrs;
}
 
pool->max_free_pinned = pool->max_items * pool->fmr_attr.max_pages / 4;
-- 
2.4.11



Re: [RFC v3 7/7] platform/x86: intel_scu_ipc: Use generic Intel IPC device calls

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi Guenter,


On 09/28/2017 06:35 AM, Guenter Roeck wrote:

On 09/28/2017 05:55 AM, Alexandre Belloni wrote:
On 04/09/2017 at 22:37:27 -0700, 
sathyanarayanan.kuppusw...@linux.intel.com wrote:
From: Kuppuswamy Sathyanarayanan 



Removed redundant IPC helper functions and refactored the driver to use
generic IPC device driver APIs.

This patch also cleans-up SCU IPC user drivers to use APIs provided
by generic IPC driver.

Signed-off-by: Kuppuswamy Sathyanarayanan 



For the RTC part:
Acked-by: Alexandre Belloni 
 >

---
  arch/x86/include/asm/intel_scu_ipc.h    |  23 +-
  arch/x86/platform/intel-mid/intel-mid.c |  12 +-
  drivers/platform/x86/Kconfig    |   1 +
  drivers/platform/x86/intel_scu_ipc.c    | 483 
+---

  drivers/rtc/rtc-mrst.c  |  15 +-
  drivers/watchdog/intel-mid_wdt.c    |  12 +-
  drivers/watchdog/intel_scu_watchdog.c   |  17 +-
  7 files changed, 251 insertions(+), 312 deletions(-)


I think it would help in general to mention at least in the 
description which drivers are touched.

I will add this info in my next patch version.


I see calls to intel_ipc_dev_get() but not associated put calls. 
Missing the context, it may
well be that those are not necessary, but then how does the ipc code 
know that the device

reference is no longer needed ?
I also noticed this issue. Initially I thought we don't need to deal 
with ref count because in most
cases IPC client drivers are compiled as built-in drivers ( because 
these APIs are required during
early boot process to send IPC commands to ARC processor) and hence will 
have life-time until the

device is powered-off.

But after looking into Kconfig params, I noticed that at least 
intel_pmc_ipc and intel_punit_ipc drivers can be
compiled as modules. And hence we will have dangling pointer reference 
issue if we don't maintain

proper refcount for device references.

So, I have already fixed this problem by adding intel_ipc_dev_put() and  
devm_intel_ipc_dev_get() APIs.


This fix will be available in next version.


Guenter



diff --git a/arch/x86/include/asm/intel_scu_ipc.h 
b/arch/x86/include/asm/intel_scu_ipc.h

index 81d3d87..5842534 100644
--- a/arch/x86/include/asm/intel_scu_ipc.h
+++ b/arch/x86/include/asm/intel_scu_ipc.h
@@ -14,9 +14,19 @@
  #define IPCMSG_COLD_BOOT    0xF3
    #define IPCMSG_VRTC    0xFA /* Set vRTC device */
-    /* Command id associated with message IPCMSG_VRTC */
-    #define IPC_CMD_VRTC_SETTIME  1 /* Set time */
-    #define IPC_CMD_VRTC_SETALARM 2 /* Set alarm */
+
+/* Command id associated with message IPCMSG_VRTC */
+#define IPC_CMD_VRTC_SETTIME  1 /* Set time */
+#define IPC_CMD_VRTC_SETALARM 2 /* Set alarm */
+
+#define INTEL_SCU_IPC_DEV    "intel_scu_ipc"
+#define SCU_PARAM_LEN    2
+
+static inline void scu_cmd_init(u32 *cmd, u32 param1, u32 param2)
+{
+    cmd[0] = param1;
+    cmd[1] = param2;
+}
    /* Read single register */
  int intel_scu_ipc_ioread8(u16 addr, u8 *data);
@@ -45,13 +55,6 @@ int intel_scu_ipc_writev(u16 *addr, u8 *data, int 
len);

  /* Update single register based on the mask */
  int intel_scu_ipc_update_register(u16 addr, u8 data, u8 mask);
  -/* Issue commands to the SCU with or without data */
-int intel_scu_ipc_simple_command(int cmd, int sub);
-int intel_scu_ipc_command(int cmd, int sub, u32 *in, int inlen,
-  u32 *out, int outlen);
-int intel_scu_ipc_raw_command(int cmd, int sub, u8 *in, int inlen,
-  u32 *out, int outlen, u32 dptr, u32 sptr);
-
  /* I2C control api */
  int intel_scu_ipc_i2c_cntrl(u32 addr, u32 *data);
  diff --git a/arch/x86/platform/intel-mid/intel-mid.c 
b/arch/x86/platform/intel-mid/intel-mid.c

index 12a2725..27541e9 100644
--- a/arch/x86/platform/intel-mid/intel-mid.c
+++ b/arch/x86/platform/intel-mid/intel-mid.c
@@ -22,6 +22,7 @@
  #include 
  #include 
  #include 
+#include 
    #include 
  #include 
@@ -68,18 +69,23 @@ static void *(*get_intel_mid_ops[])(void) = 
INTEL_MID_OPS_INIT;

  enum intel_mid_cpu_type __intel_mid_cpu_chip;
  EXPORT_SYMBOL_GPL(__intel_mid_cpu_chip);
  +static struct intel_ipc_dev *ipc_dev;
+
  static void intel_mid_power_off(void)
  {
+    u32 cmds[SCU_PARAM_LEN] = {IPCMSG_COLD_OFF, 1};
  /* Shut down South Complex via PWRMU */
  intel_mid_pwr_power_off();
    /* Only for Tangier, the rest will ignore this command */
-    intel_scu_ipc_simple_command(IPCMSG_COLD_OFF, 1);
+    ipc_dev_simple_cmd(ipc_dev, cmds, SCU_PARAM_LEN);
  };
    static void intel_mid_reboot(void)
  {
-    intel_scu_ipc_simple_command(IPCMSG_COLD_BOOT, 0);
+    u32 cmds[SCU_PARAM_LEN] = {IPCMSG_COLD_BOOT, 0};
+
+    ipc_dev_simple_cmd(ipc_dev, cmds, SCU_PARAM_LEN);
  }
    static unsigned long __init intel_mid_calibrate_tsc(void)
@@ -206,6 +212,8 @@ void __init x86_intel_mid_early_setup(void)
  x86_init.mpparse.find_smp_config = x86_init_noop;
  x86_init.mpparse.get_smp_config = x86_init_uint_noop;
  set_bit(MP_BUS_ISA,

Re: [RFC v3 1/7] platform/x86: intel_pmc_ipc: Use devm_* calls in driver probe function

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi,


On 10/01/2017 07:34 AM, Andy Shevchenko wrote:

On Tue, Sep 5, 2017 at 8:37 AM,
 wrote:

From: Kuppuswamy Sathyanarayanan 

This patch cleans up unnecessary free/alloc calls in ipc_plat_probe(),
ipc_pci_probe() and ipc_plat_get_res() functions by using devm_*
calls.

This patch also adds proper error handling for failure cases in
ipc_pci_probe() function.


Pushed (this patch only) to my review queue with slight style changes, thanks.

Thanks Andy. Will rebase my rest of the patches on top of this patch.



Signed-off-by: Kuppuswamy Sathyanarayanan 

---
  drivers/platform/x86/intel_pmc_ipc.c | 104 ---
  1 file changed, 34 insertions(+), 70 deletions(-)

Changes since v2:
  * Used pcim_* device managed functions.

Changes since v1:
  * Merged devm_* related changes into a single function.
  * Instead of removing free_irq, use devm_free_irq function.

diff --git a/drivers/platform/x86/intel_pmc_ipc.c 
b/drivers/platform/x86/intel_pmc_ipc.c
index bb792a5..5eef649 100644
--- a/drivers/platform/x86/intel_pmc_ipc.c
+++ b/drivers/platform/x86/intel_pmc_ipc.c
@@ -480,52 +480,39 @@ static irqreturn_t ioc(int irq, void *dev_id)

  static int ipc_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
  {
-   resource_size_t pci_resource;
 int ret;
-   int len;
+   struct intel_pmc_ipc_dev *pmc = &ipcdev;

-   ipcdev.dev = &pci_dev_get(pdev)->dev;
-   ipcdev.irq_mode = IPC_TRIGGER_MODE_IRQ;
+   /* Only one PMC is supported */
+   if (pmc->dev)
+   return -EBUSY;

-   ret = pci_enable_device(pdev);
+   pmc->irq_mode = IPC_TRIGGER_MODE_IRQ;
+
+   ret = pcim_enable_device(pdev);
 if (ret)
 return ret;

-   ret = pci_request_regions(pdev, "intel_pmc_ipc");
+   ret = pcim_iomap_regions(pdev, 1 << 0, pci_name(pdev));
 if (ret)
 return ret;

-   pci_resource = pci_resource_start(pdev, 0);
-   len = pci_resource_len(pdev, 0);
-   if (!pci_resource || !len) {
-   dev_err(&pdev->dev, "Failed to get resource\n");
-   return -ENOMEM;
-   }
+   init_completion(&pmc->cmd_complete);

-   init_completion(&ipcdev.cmd_complete);
+   pmc->ipc_base =  pcim_iomap_table(pdev)[0];

-   if (request_irq(pdev->irq, ioc, 0, "intel_pmc_ipc", &ipcdev)) {
+   ret = devm_request_irq(&pdev->dev, pdev->irq, ioc, 0, "intel_pmc_ipc",
+   pmc);
+   if (ret) {
 dev_err(&pdev->dev, "Failed to request irq\n");
-   return -EBUSY;
+   return ret;
 }

-   ipcdev.ipc_base = ioremap_nocache(pci_resource, len);
-   if (!ipcdev.ipc_base) {
-   dev_err(&pdev->dev, "Failed to ioremap ipc base\n");
-   free_irq(pdev->irq, &ipcdev);
-   ret = -ENOMEM;
-   }
+   pmc->dev = &pdev->dev;

-   return ret;
-}
+   pci_set_drvdata(pdev, pmc);

-static void ipc_pci_remove(struct pci_dev *pdev)
-{
-   free_irq(pdev->irq, &ipcdev);
-   pci_release_regions(pdev);
-   pci_dev_put(pdev);
-   iounmap(ipcdev.ipc_base);
-   ipcdev.dev = NULL;
+   return 0;
  }

  static const struct pci_device_id ipc_pci_ids[] = {
@@ -540,7 +527,6 @@ static struct pci_driver ipc_pci_driver = {
 .name = "intel_pmc_ipc",
 .id_table = ipc_pci_ids,
 .probe = ipc_pci_probe,
-   .remove = ipc_pci_remove,
  };

  static ssize_t intel_pmc_ipc_simple_cmd_store(struct device *dev,
@@ -849,18 +835,16 @@ static int ipc_plat_get_res(struct platform_device *pdev)
 dev_err(&pdev->dev, "Failed to get ipc resource\n");
 return -ENXIO;
 }
-   size = PLAT_RESOURCE_IPC_SIZE + PLAT_RESOURCE_GCR_SIZE;
+   res->end = (res->start + PLAT_RESOURCE_IPC_SIZE +
+   PLAT_RESOURCE_GCR_SIZE - 1);

-   if (!request_mem_region(res->start, size, pdev->name)) {
-   dev_err(&pdev->dev, "Failed to request ipc resource\n");
-   return -EBUSY;
-   }
-   addr = ioremap_nocache(res->start, size);
-   if (!addr) {
-   dev_err(&pdev->dev, "I/O memory remapping failed\n");
-   release_mem_region(res->start, size);
-   return -ENOMEM;
+   addr = devm_ioremap_resource(&pdev->dev, res);
+   if (IS_ERR(addr)) {
+   dev_err(&pdev->dev,
+   "PMC I/O memory remapping failed\n");
+   return PTR_ERR(addr);
 }
+
 ipcdev.ipc_base = addr;

 ipcdev.gcr_mem_base = addr + PLAT_RESOURCE_GCR_OFFSET;
@@ -917,7 +901,6 @@ MODULE_DEVICE_TABLE(acpi, ipc_acpi_ids);

  static int ipc_plat_probe(struct platform_device *pdev)
  {
-   struct resource *res;
 int ret;

 ipcdev.dev = &pdev->dev;
@@ -939,61 +922,42 @@ static int ipc_plat_probe(struct platform_device *pdev)
 ret = ipc_create_pmc_devices();
 if (ret) {
 

Re: [RFC v3 2/7] platform/x86: intel_pmc_ipc: Use MFD framework to create dependent devices

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi Andy,


On 10/01/2017 07:44 AM, Andy Shevchenko wrote:

On Tue, Sep 5, 2017 at 8:37 AM,
 wrote:

From: Kuppuswamy Sathyanarayanan 

Currently, we have lot of repetitive code in dependent device resource
allocation and device creation handling code. This logic can be improved if
we use MFD framework for dependent device creation. This patch adds this
support.

+   punit_cell.id = -1;

I will remove this line in next version.

+   return devm_mfd_add_devices(&pdev->dev, PLATFORM_DEVID_AUTO,
+   &punit_cell, 1, NULL, 0, NULL);

IIRC you don't need to file cell ID in case of DEVID_AUTO.
I am planning to use DEVID_NONE here to match the current behavior. 
Unless you have some concerns?




--
Sathyanarayanan Kuppuswamy
Linux kernel developer



Re: [RFC 5/5] pm: remove kernel thread freezing

2017-10-03 Thread Bart Van Assche
On Wed, 2017-10-04 at 02:47 +0200, Luis R. Rodriguez wrote:
>   3) Lookup for kthreads which seem to generate IO -- address / review if
>  removal of the freezer API can be done somehow with a quescing. This
>  is currently for example being done on SCSI / md.
>  4) Only after all the above is done should we consider this patch or some
> form of it.

After having given this more thought, I think we should omit these last two
steps. Modifying the md driver such that it does not submit I/O requests while
processes are frozen requires either to use the freezer API or to open-code it.
I think there is general agreement in the kernel community that open-coding a
single mechanism in multiple drivers is wrong. Does this mean that instead of
removing the freezer API we should keep it and review all its users instead?

Bart.


Re: d82fed7529 ("locking/lockdep/selftests: Fix mixed read-write .."): BUG: -1 unexpected failures (out of 262) - debugging disabled! |

2017-10-03 Thread Fengguang Wu

On Wed, Aug 30, 2017 at 09:41:59AM +0200, Peter Zijlstra wrote:

On Wed, Aug 30, 2017 at 08:29:47AM +0200, Peter Zijlstra wrote:

On Wed, Aug 30, 2017 at 11:37:21AM +0800, kernel test robot wrote:

> [0.004000] 
-
> [0.004000] BUG:  -1 unexpected failures (out of 262) - debugging 
disabled! |
> [0.004000] 
-

lol.. however did that happen.. /me goes look.


Ah, I think this should cure..


It works!

Tested-by: Fengguang Wu 


---
diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index cd0b5c964bd0..2b827b8a1d8c 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -2031,11 +2031,13 @@ void locking_selftest(void)
print_testname("mixed read-lock/lock-write ABBA");
pr_cont(" |");
dotest(rlock_ABBA1, FAILURE, LOCKTYPE_RWLOCK);
+#ifdef CONFIG_PROVE_LOCKING
/*
 * Lockdep does indeed fail here, but there's nothing we can do about
 * that now.  Don't kill lockdep for it.
 */
unexpected_testcase_failures--;
+#endif

pr_cont(" |");
dotest(rwsem_ABBA1, FAILURE, LOCKTYPE_RWSEM);


Re: [PATCH 02/18] tracing/filter: use ARRAY_SIZE

2017-10-03 Thread Steven Rostedt
On Sun,  1 Oct 2017 15:30:40 -0400
Jérémy Lefaure  wrote:

> It is useless to re-invent the ARRAY_SIZE macro so let's use it instead
> of DATA_CNT.
> 
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> |
>  (sizeof(E)@p /sizeof(E[...]))
> |
>  (sizeof(E)@p /sizeof(T))
> )
> 
> Signed-off-by: Jérémy Lefaure 
> ---
>  kernel/trace/trace_events_filter.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_filter.c 
> b/kernel/trace/trace_events_filter.c
> index 61e7f0678d33..02d0f378dc5c 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "trace.h"
>  #include "trace_output.h"
> @@ -2376,8 +2377,6 @@ static struct test_filter_data_t {
>  #undef YES
>  #undef NO
>  
> -#define DATA_CNT (sizeof(test_filter_data)/sizeof(struct test_filter_data_t))

Just change DATA_CNT to be:

 #define DATA_CNT ARRAY_SIZE(test_filter_data)

and don't change the rest.

-- Steve

> -
>  static int test_pred_visited;
>  
>  static int test_pred_visited_fn(struct filter_pred *pred, void *event)
> @@ -2417,7 +2416,7 @@ static __init int ftrace_test_event_filter(void)
>  
>   printk(KERN_INFO "Testing ftrace filter: ");
>  
> - for (i = 0; i < DATA_CNT; i++) {
> + for (i = 0; i < ARRAY_SIZE(test_filter_data); i++) {
>   struct event_filter *filter = NULL;
>   struct test_filter_data_t *d = &test_filter_data[i];
>   int err;
> @@ -2463,7 +2462,7 @@ static __init int ftrace_test_event_filter(void)
>   }
>   }
>  
> - if (i == DATA_CNT)
> + if (i == ARRAY_SIZE(test_filter_data))
>   printk(KERN_CONT "OK\n");
>  
>   return 0;



Re: [RFC v3 0/7] PMC/PUNIT IPC driver cleanup

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi Andy,

Thanks for the review. I really appreciate you taking the time to review 
such big chunk of code changes.



On 10/01/2017 07:46 AM, Andy Shevchenko wrote:

On Tue, Sep 5, 2017 at 8:37 AM,
 wrote:

From: Kuppuswamy Sathyanarayanan 

Hi All,

Currently intel_pmc_ipc.c, intel_punit_ipc.c, intel_scu_ipc.c drivers 
implements the same IPC features.
This code duplication could be avoided if we implement the IPC driver as a 
generic library and let custom
device drivers use API provided by generic driver. This patchset mainly 
addresses this issue.

Along with above code duplication issue, This patchset also addresses following 
issues in intel_pmc_ipc and
intel_punit_ipc drivers.

1. Intel_pmc_ipc.c driver does not use any resource managed (devm_*) calls.
2. In Intel_pmc_ipc.c driver, dependent devices like PUNIT, Telemetry and iTCO 
are created manually and uses lot of redundant buffer code.
3. Global variable is used to store the IPC device structure and it is used 
across all functions in intel_pmc_ipc.c and intel_punit_ipc.c.

More info on Intel IPC device library:
-

A generic Intel IPC class driver has been implemented and all common IPC helper 
functions has been moved to this driver. It exposes APIs to create IPC device 
channel, send raw IPC command and simple IPC commands. It also creates device 
attribute to send IPC command from user space.

API for creating a new IPC channel device is,

struct intel_ipc_dev *devm_intel_ipc_dev_create(struct device *dev, const char 
*devname, struct intel_ipc_dev_cfg *cfg, struct intel_ipc_dev_ops *ops)

The IPC channel drivers (PUNIT/PMC/SCU) when creating a new device can 
configure their device params like register mapping, irq, irq-mode, channel 
type,etc  using intel_ipc_dev_cfg and intel_ipc_dev_ops arguments. After a new 
IPC channel device is created, IPC users can use the generic APIs to make IPC 
calls.

For example, after using this new model, IPC call to PMC device will look like,

pmc_ipc_dev = intel_ipc_dev_get(INTEL_PMC_IPC_DEV);
ipc_dev_raw_cmd(pmc_ipc_dev, cmd, PMC_PARAM_LEN, (u32 *)ipc_in, 1, NULL, 0, 0, 
0);

I am still testing the driver in different products. But posted it to get some 
early comments. I also welcome any PMC/PUNIT driver users to check these 
patches in their product.

I have applied first patch from the series with some small changes.
Please rebase the rest on top of my review branch (review-andy).

Will do. I will soon send v4 version.

Before sending new version, address comments that others and me did.

Yes.




--
Sathyanarayanan Kuppuswamy
Linux kernel developer



Re: [RFC v3 4/7] platform: x86: Add generic Intel IPC driver

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi,


On 10/01/2017 07:59 AM, Andy Shevchenko wrote:

On Tue, Sep 5, 2017 at 8:37 AM,
 wrote:

From: Kuppuswamy Sathyanarayanan 

Currently intel_scu_ipc.c, intel_pmc_ipc.c and intel_punit_ipc.c
redundantly implements the same IPC features and has lot of code
duplication between them. This driver addresses this issue by grouping
the common IPC functionalities under the same driver.
+static const char *ipc_dev_err_string(struct intel_ipc_dev *ipc_dev,
+   int error)
+{
+   switch (error) {
+   case IPC_DEV_ERR_NONE:
+   return "No error";
+   case IPC_DEV_ERR_CMD_NOT_SUPPORTED:
+   return "Command not-supported/Invalid";
+   case IPC_DEV_ERR_CMD_NOT_SERVICED:
+   return "Command not-serviced/Invalid param";
+   case IPC_DEV_ERR_UNABLE_TO_SERVICE:
+   return "Unable-to-service/Cmd-timeout";
+   case IPC_DEV_ERR_CMD_INVALID:
+   return "Command-invalid/Cmd-locked";
+   case IPC_DEV_ERR_CMD_FAILED:
+   return "Command-failed/Invalid-VR-id";
+   case IPC_DEV_ERR_EMSECURITY:
+   return "Invalid Battery/VR-Error";
+   case IPC_DEV_ERR_UNSIGNEDKERNEL:
+   return "Unsigned kernel";
+   default:
+   return "Unknown Command";
+   };
+}

Since it's continuous list you can define an array of messages like

const char * const *errors = {
  [..._ERR_...] = "",
  ...

};

Also you can use enum in the header and define _ERR_MAX there.
Thus, code would be transformed to

if (error < _ERR_MAX)
  return errors[error];

return "Unknown Command";

Thanks will fix it in next version.




--
Sathyanarayanan Kuppuswamy
Linux kernel developer



Re: [PATCH net] RDS: IB: Initialize max_items based on underlying device attributes

2017-10-03 Thread Santosh Shilimkar

Hi Avinash,

On 10/3/2017 5:50 PM, Avinash Repaka wrote:

Use max_1m_mrs/max_8k_mrs while setting max_items, as the former
variables are set based on the underlying device attricutes.


s/attricutes/attributes


Signed-off-by: Avinash Repaka 
---
  net/rds/ib_rdma.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


This patch is fine but it will be nice if you can collect
these changes in series as you are trying to update the
FRWR support. Like this patch and other cleanup patch
you posted yesterday.

With that log fixed,
Acked-by: Santosh Shilimkar 



[PATCH v2 5/6] f2fs: trace f2fs_remove_discard

2017-10-03 Thread Chao Yu
From: Chao Yu 

This patch adds tracepoint to trace f2fs_remove_discard.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/segment.c   | 2 ++
 include/trace/events/f2fs.h | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index ba433828c170..4a108321233d 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -822,6 +822,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi,
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
 
+   trace_f2fs_remove_discard(dc->bdev, dc->start, dc->len);
+
f2fs_bug_on(sbi, dc->ref);
 
if (dc->error == -EOPNOTSUPP)
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 5d216f7fb05a..0e7a31694ff5 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -1286,6 +1286,13 @@ DEFINE_EVENT(f2fs_discard, f2fs_issue_discard,
TP_ARGS(dev, blkstart, blklen)
 );
 
+DEFINE_EVENT(f2fs_discard, f2fs_remove_discard,
+
+   TP_PROTO(struct block_device *dev, block_t blkstart, block_t blklen),
+
+   TP_ARGS(dev, blkstart, blklen)
+);
+
 TRACE_EVENT(f2fs_issue_reset_zone,
 
TP_PROTO(struct block_device *dev, block_t blkstart),
-- 
2.14.1.145.gb3622a4ee



[PATCH v2 4/6] f2fs: reduce cmd_lock coverage in __issue_discard_cmd

2017-10-03 Thread Chao Yu
From: Chao Yu 

__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/segment.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index e8fcc1eb4e0a..ba433828c170 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1158,14 +1158,14 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
int i, iter = 0, issued = 0;
bool io_interrupted = false;
 
-   mutex_lock(&dcc->cmd_lock);
-   f2fs_bug_on(sbi,
-   !__check_rb_tree_consistence(sbi, &dcc->root));
-   blk_start_plug(&plug);
for (i = MAX_PLIST_NUM - 1; i >= 0; i--) {
if (i + 1 < dpolicy->granularity)
break;
pend_list = &dcc->pend_list[i];
+
+   mutex_lock(&dcc->cmd_lock);
+   f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root));
+   blk_start_plug(&plug);
list_for_each_entry_safe(dc, tmp, pend_list, list) {
f2fs_bug_on(sbi, dc->state != D_PREP);
 
@@ -1179,12 +1179,14 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
issued++;
 skip:
if (++iter >= dpolicy->max_requests)
-   goto out;
+   break;
}
+   blk_finish_plug(&plug);
+   mutex_unlock(&dcc->cmd_lock);
+
+   if (iter >= dpolicy->max_requests)
+   break;
}
-out:
-   blk_finish_plug(&plug);
-   mutex_unlock(&dcc->cmd_lock);
 
if (!issued && io_interrupted)
issued = -1;
-- 
2.14.1.145.gb3622a4ee



[PATCH v2 6/6] f2fs: give up CP_TRIMMED_FLAG if it drops discards

2017-10-03 Thread Chao Yu
From: Chao Yu 

In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/f2fs.h|  2 +-
 fs/f2fs/segment.c | 15 +++
 fs/f2fs/super.c   |  5 +++--
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index f274805e231d..c85f49c41003 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2565,7 +2565,7 @@ void init_discard_policy(struct discard_policy *dpolicy, 
int discard_type,
unsigned int granularity);
 void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new);
 void stop_discard_thread(struct f2fs_sb_info *sbi);
-void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi);
+bool f2fs_wait_discard_bios(struct f2fs_sb_info *sbi);
 void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc);
 void release_discard_addrs(struct f2fs_sb_info *sbi);
 int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 4a108321233d..bfbcff8339c5 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1196,12 +1196,13 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
return issued;
 }
 
-static void __drop_discard_cmd(struct f2fs_sb_info *sbi)
+static bool __drop_discard_cmd(struct f2fs_sb_info *sbi)
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
struct list_head *pend_list;
struct discard_cmd *dc, *tmp;
int i;
+   bool dropped = false;
 
mutex_lock(&dcc->cmd_lock);
for (i = MAX_PLIST_NUM - 1; i >= 0; i--) {
@@ -1209,9 +1210,12 @@ static void __drop_discard_cmd(struct f2fs_sb_info *sbi)
list_for_each_entry_safe(dc, tmp, pend_list, list) {
f2fs_bug_on(sbi, dc->state != D_PREP);
__remove_discard_cmd(sbi, dc);
+   dropped = true;
}
}
mutex_unlock(&dcc->cmd_lock);
+
+   return dropped;
 }
 
 static void __wait_one_discard_bio(struct f2fs_sb_info *sbi,
@@ -1306,15 +1310,18 @@ void stop_discard_thread(struct f2fs_sb_info *sbi)
 }
 
 /* This comes from f2fs_put_super */
-void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi)
+bool f2fs_wait_discard_bios(struct f2fs_sb_info *sbi)
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
struct discard_policy dpolicy;
+   bool dropped;
 
init_discard_policy(&dpolicy, DPOLICY_UMOUNT, dcc->discard_granularity);
__issue_discard_cmd(sbi, &dpolicy);
-   __drop_discard_cmd(sbi);
+   dropped = __drop_discard_cmd(sbi);
__wait_all_discard_cmd(sbi, &dpolicy);
+
+   return dropped;
 }
 
 static int issue_discard_thread(void *data)
@@ -1659,7 +1666,7 @@ void init_discard_policy(struct discard_policy *dpolicy,
dpolicy->max_interval = DEF_MAX_DISCARD_ISSUE_TIME;
dpolicy->max_requests = DEF_MAX_DISCARD_REQUEST;
dpolicy->io_aware_gran = MAX_PLIST_NUM;
-   dpolicy->io_aware = false;
+   dpolicy->io_aware = true;
} else if (discard_type == DPOLICY_FSTRIM) {
dpolicy->max_requests = DEF_MAX_DISCARD_REQUEST;
dpolicy->io_aware_gran = MAX_PLIST_NUM;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index a13269d1a1f0..1d68c18a487b 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -807,6 +807,7 @@ static void f2fs_put_super(struct super_block *sb)
 {
struct f2fs_sb_info *sbi = F2FS_SB(sb);
int i;
+   bool dropped;
 
f2fs_quota_off_umount(sb);
 
@@ -827,9 +828,9 @@ static void f2fs_put_super(struct super_block *sb)
}
 
/* be sure to wait for any on-going discard commands */
-   f2fs_wait_discard_bios(sbi, true);
+   dropped = f2fs_wait_discard_bios(sbi);
 
-   if (f2fs_discard_en(sbi) && !sbi->discard_blks) {
+   if (f2fs_discard_en(sbi) && !sbi->discard_blks && !dropped) {
struct cp_control cpc = {
.reason = CP_UMOUNT | CP_TRIMMED,
};
-- 
2.14.1.145.gb3622a4ee



[PATCH v2 2/6] f2fs: wrap discard policy

2017-10-03 Thread Chao Yu
From: Chao Yu 

This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.

struct discard_policy {
unsigned int min_interval;  /* used for candidates exist */
unsigned int max_interval;  /* used for candidates not exist */
unsigned int max_requests;  /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be 
aware of I/O */
bool io_aware;  /* issue discard in idle time */
bool sync;  /* submit discard with REQ_SYNC flag */
};

This patch doesn't change any logic of codes.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/f2fs.h| 12 +++-
 fs/f2fs/segment.c | 38 +-
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6d51e8342513..67dd6a317384 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -149,7 +149,7 @@ enum {
 #define BATCHED_TRIM_BLOCKS(sbi)   \
(BATCHED_TRIM_SEGMENTS(sbi) << (sbi)->log_blocks_per_seg)
 #define MAX_DISCARD_BLOCKS(sbi)BLKS_PER_SEC(sbi)
-#define DISCARD_ISSUE_RATE 8
+#define DEF_MAX_DISCARD_REQUEST8   /* issue 8 discards per 
round */
 #define DEF_MIN_DISCARD_ISSUE_TIME 50  /* 50 ms, if exists */
 #define DEF_MAX_DISCARD_ISSUE_TIME 6   /* 60 s, if no candidates */
 #define DEF_CP_INTERVAL60  /* 60 secs */
@@ -245,6 +245,15 @@ struct discard_cmd {
int error;  /* bio error */
 };
 
+struct discard_policy {
+   unsigned int min_interval;  /* used for candidates exist */
+   unsigned int max_interval;  /* used for candidates not exist */
+   unsigned int max_requests;  /* # of discards issued per round */
+   unsigned int io_aware_gran; /* minimum granularity discard not be 
aware of I/O */
+   bool io_aware;  /* issue discard in idle time */
+   bool sync;  /* submit discard with REQ_SYNC flag */
+};
+
 struct discard_cmd_control {
struct task_struct *f2fs_issue_discard; /* discard thread */
struct list_head entry_list;/* 4KB discard entry list */
@@ -263,6 +272,7 @@ struct discard_cmd_control {
atomic_t issing_discard;/* # of issing discard */
atomic_t discard_cmd_cnt;   /* # of cached cmd count */
struct rb_root root;/* root of discard rb-tree */
+   struct discard_policy dpolicy;  /* current discard policy */
 };
 
 /* for the list of fsync inodes, used only during recovery */
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index a0a0a887fc31..5853187230e7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -879,6 +879,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) :
&(dcc->wait_list);
struct bio *bio = NULL;
+   int flag = dcc->dpolicy.sync ? REQ_SYNC : 0;
 
if (dc->state != D_PREP)
return;
@@ -897,7 +898,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
if (bio) {
bio->bi_private = dc;
bio->bi_end_io = f2fs_submit_discard_endio;
-   bio->bi_opf |= REQ_SYNC;
+   bio->bi_opf |= flag;
submit_bio(bio);
list_move_tail(&dc->list, wait_list);
__check_sit_bitmap(sbi, dc->start, dc->start + dc->len);
@@ -1092,6 +1093,7 @@ static void __issue_discard_cmd_range(struct f2fs_sb_info 
*sbi,
struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
struct rb_node **insert_p = NULL, *insert_parent = NULL;
struct discard_cmd *dc;
+   struct discard_policy *dpolicy = &dcc->dpolicy;
struct blk_plug plug;
int issued;
 
@@ -1124,7 +1126,7 @@ static void __issue_discard_cmd_range(struct f2fs_sb_info 
*sbi,
 
__submit_discard_cmd(sbi, dc, true);
 
-   if (++issued >= DISCARD_ISSUE_RATE) {
+   if (++issued >= dpolicy->max_requests) {
start = dc->lstart + dc->len;
 
blk_finish_plug(&plug);
@@ -1152,6 +1154,7 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi, 
bool issue_cond)
struct list_head *pend_list;
struct discard_cmd *dc, *tmp;
struct blk_plug plug;
+   struct discard_policy *dpolicy = &dcc->dpolicy;
int iter = 0, issued = 0;
int i;
bool io_interrupted = false;
@@ -1179,14 +1182,16 @@ static int __issue_discard_cmd(struct

[PATCH v2 3/6] f2fs: split discard policy

2017-10-03 Thread Chao Yu
From: Chao Yu 

There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.

This patch changes to split discard policy for different scenarios.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/f2fs.h|  17 +--
 fs/f2fs/segment.c | 149 ++
 fs/f2fs/segment.h |   5 +-
 fs/f2fs/sysfs.c   |  13 -
 4 files changed, 88 insertions(+), 96 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 67dd6a317384..f274805e231d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -210,10 +210,6 @@ struct discard_entry {
 #define plist_idx(blk_num) ((blk_num) >= MAX_PLIST_NUM ?   \
(MAX_PLIST_NUM - 1) : (blk_num - 1))
 
-#define P_ACTIVE   0x01
-#define P_TRIM 0x02
-#define plist_issue(tag)   (((tag) & P_ACTIVE) || ((tag) & P_TRIM))
-
 enum {
D_PREP,
D_SUBMIT,
@@ -245,13 +241,23 @@ struct discard_cmd {
int error;  /* bio error */
 };
 
+enum {
+   DPOLICY_BG,
+   DPOLICY_FORCE,
+   DPOLICY_FSTRIM,
+   DPOLICY_UMOUNT,
+   MAX_DPOLICY,
+};
+
 struct discard_policy {
+   int type;   /* type of discard */
unsigned int min_interval;  /* used for candidates exist */
unsigned int max_interval;  /* used for candidates not exist */
unsigned int max_requests;  /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be 
aware of I/O */
bool io_aware;  /* issue discard in idle time */
bool sync;  /* submit discard with REQ_SYNC flag */
+   unsigned int granularity;   /* discard granularity */
 };
 
 struct discard_cmd_control {
@@ -272,7 +278,6 @@ struct discard_cmd_control {
atomic_t issing_discard;/* # of issing discard */
atomic_t discard_cmd_cnt;   /* # of cached cmd count */
struct rb_root root;/* root of discard rb-tree */
-   struct discard_policy dpolicy;  /* current discard policy */
 };
 
 /* for the list of fsync inodes, used only during recovery */
@@ -2556,6 +2561,8 @@ int f2fs_flush_device_cache(struct f2fs_sb_info *sbi);
 void destroy_flush_cmd_control(struct f2fs_sb_info *sbi, bool free);
 void invalidate_blocks(struct f2fs_sb_info *sbi, block_t addr);
 bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr);
+void init_discard_policy(struct discard_policy *dpolicy, int discard_type,
+   unsigned int granularity);
 void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new);
 void stop_discard_thread(struct f2fs_sb_info *sbi);
 void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 5853187230e7..e8fcc1eb4e0a 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -873,13 +873,14 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi,
 
 /* this function is copied from blkdev_issue_discard from block/blk-lib.c */
 static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
-   struct discard_cmd *dc, bool fstrim)
+   struct discard_policy *dpolicy,
+   struct discard_cmd *dc)
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
-   struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) :
-   &(dcc->wait_list);
+   struct list_head *wait_list = (dpolicy->type == DPOLICY_FSTRIM) ?
+   &(dcc->fstrim_list) : &(dcc->wait_list);
struct bio *bio = NULL;
-   int flag = dcc->dpolicy.sync ? REQ_SYNC : 0;
+   int flag = dpolicy->sync ? REQ_SYNC : 0;
 
if (dc->state != D_PREP)
return;
@@ -1086,14 +1087,13 @@ static int __queue_discard_cmd(struct f2fs_sb_info *sbi,
 }
 
 static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi,
-   unsigned int start, unsigned int end,
-   unsigned int granularity)
+   struct discard_policy *dpolicy,
+   unsigned int start, unsigned int end)
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
struct rb_node **insert_p = NULL, *insert_parent 

[PATCH v2 1/6] f2fs: support issuing/waiting discard in range

2017-10-03 Thread Chao Yu
From: Chao Yu 

Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.

In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.

Signed-off-by: Chao Yu 
---
v2:
- rebase on last codes of Jaegeuk's dev-test branch.
 fs/f2fs/f2fs.h|   3 +-
 fs/f2fs/segment.c | 127 +-
 2 files changed, 108 insertions(+), 22 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index a3fdcdade50b..6d51e8342513 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -251,6 +251,7 @@ struct discard_cmd_control {
struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */
unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */
struct list_head wait_list; /* store on-flushing entries */
+   struct list_head fstrim_list;   /* in-flight discard from 
fstrim */
wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
unsigned int discard_wake;  /* to wake up discard thread */
struct mutex cmd_lock;
@@ -2547,7 +2548,7 @@ void invalidate_blocks(struct f2fs_sb_info *sbi, block_t 
addr);
 bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr);
 void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new);
 void stop_discard_thread(struct f2fs_sb_info *sbi);
-void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount);
+void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi);
 void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc);
 void release_discard_addrs(struct f2fs_sb_info *sbi);
 int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index d5bad5aedded..a0a0a887fc31 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -873,9 +873,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi,
 
 /* this function is copied from blkdev_issue_discard from block/blk-lib.c */
 static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
-   struct discard_cmd *dc)
+   struct discard_cmd *dc, bool fstrim)
 {
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
+   struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) :
+   &(dcc->wait_list);
struct bio *bio = NULL;
 
if (dc->state != D_PREP)
@@ -897,7 +899,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
bio->bi_end_io = f2fs_submit_discard_endio;
bio->bi_opf |= REQ_SYNC;
submit_bio(bio);
-   list_move_tail(&dc->list, &dcc->wait_list);
+   list_move_tail(&dc->list, wait_list);
__check_sit_bitmap(sbi, dc->start, dc->start + dc->len);
 
f2fs_update_iostat(sbi, FS_DISCARD, 1);
@@ -1082,6 +1084,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info *sbi,
return 0;
 }
 
+static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi,
+   unsigned int start, unsigned int end,
+   unsigned int granularity)
+{
+   struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
+   struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
+   struct rb_node **insert_p = NULL, *insert_parent = NULL;
+   struct discard_cmd *dc;
+   struct blk_plug plug;
+   int issued;
+
+next:
+   issued = 0;
+
+   mutex_lock(&dcc->cmd_lock);
+   f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root));
+
+   dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root,
+   NULL, start,
+   (struct rb_entry **)&prev_dc,
+   (struct rb_entry **)&next_dc,
+   &insert_p, &insert_parent, true);
+   if (!dc)
+   dc = next_dc;
+
+   blk_start_plug(&plug);
+
+   while (dc && dc->lstart <= end) {
+   struct rb_node *node;
+
+   if (dc->len < granularity)
+   goto skip;
+
+   if (dc->state != D_PREP) {
+   list_move_tail(&dc->list, &dcc->fstrim_list);
+   goto skip;
+   }
+
+   __submit_discard_cmd(sbi, dc, true);
+
+   if (++issued >= DISCARD_ISSUE_RATE) {
+   start = dc->lstart + dc->len;
+
+   blk_finish_plug(&plug);
+   mutex_unlock(&dcc->cmd_lock);

[PATCH] objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER

2017-10-03 Thread Josh Poimboeuf
With CONFIG_ORC_UNWINDER, if the user doesn't have libelf-devel
installed, and they don't see the make warning, their ORC unwinder will
be silently broken.  Upgrade the warning to an error.

Reported-and-tested-by: Borislav Petkov 
Signed-off-by: Josh Poimboeuf 
---
 Makefile | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index cf007a31d575..5445cbf4d9a0 100644
--- a/Makefile
+++ b/Makefile
@@ -933,7 +933,11 @@ ifdef CONFIG_STACK_VALIDATION
   ifeq ($(has_libelf),1)
 objtool_target := tools/objtool FORCE
   else
-$(warning "Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev, 
libelf-devel or elfutils-libelf-devel")
+ifdef CONFIG_ORC_UNWINDER
+  $(error "Cannot generate ORC metadata for CONFIG_ORC_UNWINDER, please 
install libelf-dev, libelf-devel or elfutils-libelf-devel")
+else
+  $(warning "Cannot use CONFIG_STACK_VALIDATION, please install 
libelf-dev, libelf-devel or elfutils-libelf-devel")
+endif
 SKIP_STACK_VALIDATION := 1
 export SKIP_STACK_VALIDATION
   endif
-- 
2.13.6



Re: [RFC v3 3/7] platform/x86: intel_pmc_ipc: Use regmap calls for GCR updates

2017-10-03 Thread sathyanarayanan kuppuswamy

Hi Andy,


On 10/01/2017 07:48 AM, Andy Shevchenko wrote:

On Tue, Sep 5, 2017 at 8:37 AM,
 wrote:

From: Kuppuswamy Sathyanarayanan 

Currently, update_no_reboot_bit() function implemented in this driver
uses mutex_lock to protect its register updates. But this function is
called with in atomic context in iTCO_wdt_start() and iTCO_wdt_stop()
functions in iTCO_wdt.c driver, which in turn causes "sleeping into
atomic context" issue. This patch fixes this issue by refactoring the
current GCR read/write/update functions with regmap APIs.

Since it sounds as candidate for stable,

Yes.

can we have split it to just
as less as possible intrusive fix + moving to regmap as a separate
change?

If we have to split it into two patches then,

Patch #1 will fix the "sleep in atomic context issue" by replacing 
mutex_lock() with spin_lock()

in GCR read/write APIs to protect the GCR memory updates.
Patch #2 will remove GCR read/write/update APIs and replace it with 
regmap APIs. But along with this
change we will also remove the spin_lock() added in previous patch 
because regmap calls are already

protected by its own locking mechanism.

Since Patch #2 will clean up what we do in Patch #1, Do we need to split 
it into two patches?



It should include Fixes: tag as well I suppose.

Agree. I will add Fixes tag in next version.



Signed-off-by: Kuppuswamy Sathyanarayanan 

---
  drivers/platform/x86/Kconfig |   1 +
  drivers/platform/x86/intel_pmc_ipc.c | 115 ---
  2 files changed, 40 insertions(+), 76 deletions(-)

diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig
index 80b8795..45f4e79 100644
--- a/drivers/platform/x86/Kconfig
+++ b/drivers/platform/x86/Kconfig
@@ -1054,6 +1054,7 @@ config PVPANIC
  config INTEL_PMC_IPC
 tristate "Intel PMC IPC Driver"
 depends on ACPI
+   select REGMAP_MMIO
 ---help---
 This driver provides support for PMC control on some Intel platforms.
 The PMC is an ARC processor which defines IPC commands for 
communication
diff --git a/drivers/platform/x86/intel_pmc_ipc.c 
b/drivers/platform/x86/intel_pmc_ipc.c
index 021dcf6..40a25f8 100644
--- a/drivers/platform/x86/intel_pmc_ipc.c
+++ b/drivers/platform/x86/intel_pmc_ipc.c
@@ -31,9 +31,11 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
+#include 

  #include 

@@ -125,7 +127,7 @@ static struct intel_pmc_ipc_dev {

 /* gcr */
 void __iomem *gcr_mem_base;
-   bool has_gcr_regs;
+   struct regmap *gcr_regs;

 /* Telemetry */
 u8 telem_res_inval;
@@ -150,6 +152,14 @@ static char *ipc_err_sources[] = {
 "Unsigned kernel",
  };

+static struct regmap_config gcr_regmap_config = {
+.reg_bits = 32,
+.reg_stride = 4,
+.val_bits = 32,
+   .fast_io = true,
+   .max_register = PLAT_RESOURCE_GCR_SIZE,
+};
+
  /* Prevent concurrent calls to the PMC */
  static DEFINE_MUTEX(ipclock);

@@ -183,21 +193,6 @@ static inline u32 ipc_data_readl(u32 offset)
 return readl(ipcdev.ipc_base + IPC_READ_BUFFER + offset);
  }

-static inline u64 gcr_data_readq(u32 offset)
-{
-   return readq(ipcdev.gcr_mem_base + offset);
-}
-
-static inline int is_gcr_valid(u32 offset)
-{
-   if (!ipcdev.has_gcr_regs)
-   return -EACCES;
-
-   if (offset > PLAT_RESOURCE_GCR_SIZE)
-   return -EINVAL;
-
-   return 0;
-}

  /**
   * intel_pmc_gcr_read() - Read PMC GCR register
@@ -210,21 +205,10 @@ static inline int is_gcr_valid(u32 offset)
   */
  int intel_pmc_gcr_read(u32 offset, u32 *data)
  {
-   int ret;
-
-   mutex_lock(&ipclock);
-
-   ret = is_gcr_valid(offset);
-   if (ret < 0) {
-   mutex_unlock(&ipclock);
-   return ret;
-   }
-
-   *data = readl(ipcdev.gcr_mem_base + offset);
-
-   mutex_unlock(&ipclock);
+   if (!ipcdev.gcr_regs)
+   return -EACCES;

-   return 0;
+   return regmap_read(ipcdev.gcr_regs, offset, data);
  }
  EXPORT_SYMBOL_GPL(intel_pmc_gcr_read);

@@ -240,21 +224,10 @@ EXPORT_SYMBOL_GPL(intel_pmc_gcr_read);
   */
  int intel_pmc_gcr_write(u32 offset, u32 data)
  {
-   int ret;
-
-   mutex_lock(&ipclock);
-
-   ret = is_gcr_valid(offset);
-   if (ret < 0) {
-   mutex_unlock(&ipclock);
-   return ret;
-   }
-
-   writel(data, ipcdev.gcr_mem_base + offset);
-
-   mutex_unlock(&ipclock);
+   if (!ipcdev.gcr_regs)
+   return -EACCES;

-   return 0;
+   return regmap_write(ipcdev.gcr_regs, offset, data);
  }
  EXPORT_SYMBOL_GPL(intel_pmc_gcr_write);

@@ -271,33 +244,10 @@ EXPORT_SYMBOL_GPL(intel_pmc_gcr_write);
   */
  int intel_pmc_gcr_update(u32 offset, u32 mask, u32 val)
  {
-   u32 new_val;
-   int ret = 0;
-
-   mutex_lock(&ipclock);
-
-   ret = is_gcr_valid(offset);
-   if (ret < 0)
-   goto gcr_ipc_unlo

Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Jerome Glisse
On Tue, Oct 03, 2017 at 05:43:47PM -0700, Nadav Amit wrote:
> Jerome Glisse  wrote:
> 
> > On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> > 
> >> I'd like some more explanation about the inner working of "that new
> >> user" as per comment above.
> >> 
> >> It would be enough to drop mmu_notifier_invalidate_range from above
> >> without adding it to the filebacked case. The above gives higher prio
> >> to the hypothetical and uncertain future case, than to the current
> >> real filebacked case that doesn't need ->invalidate_range inside the
> >> PT lock, or do you see something that might already need such
> >> ->invalidate_range?
> > 
> > No i don't see any new user today that might need such invalidate but
> > i was trying to be extra cautious as i have a tendency to assume that
> > someone might do a patch that use try_to_unmap() without going through
> > all the comments in the function and thus possibly using it in a an
> > unexpected way from mmu_notifier callback point of view. I am fine
> > with putting the burden on new user to get it right and adding an
> > extra warning in the function description to try to warn people in a
> > sensible way.
> 
> I must be missing something. After the PTE is changed, but before the
> secondary TLB notification/invalidation - What prevents another thread from
> changing the mappings (e.g., using munmap/mmap), and setting a new page
> at that PTE?
> 
> Wouldn’t it end with the page being mapped without a secondary TLB flush in
> between?

munmap would call mmu_notifier to invalidate the range too so secondary
TLB would be properly flush before any new pte can be setup in for that
particular virtual address range. Unlike CPU TLB flush, secondary TLB
flush are un-conditional and thus current pte value does not play any
role.

Cheers,
Jérôme


Re: btusb "firmware request while host is not available" at resume

2017-10-03 Thread Luis R. Rodriguez
On Tue, Oct 3, 2017 at 5:20 PM, Luis R. Rodriguez  wrote:
> the ordering devised currently there is:
>
>   o device driver pm ops called
>   o notifier for suspend issued - *going to suspend*
>   o userspace frozen
>   o filesystem freeze
>
> On the way back up this order is inverted:
>
>   o filesystem freeze
>   o userspace frozen
>   o notifier for suspend issued - *going to suspend*
>   o device driver pm ops called

Fortunately I had it a tad bit wrong, but in a good way. Our ordering
on our way down is:

 o notifier for suspend issued - *going to suspend*
 o userspace frozen
 o filesystem freeze (new, being proposed)
 o device driver pm ops called

Then on our way up:

  o device driver pm ops called
  o filesystem thaw
  o userspace thaw
  o notifier for resume issued - *thawing*

So the driver callbacks get called *later*, so anything called in
notifiers do get a chance to quiesce things properly.

  Luis


Re: [f2fs-dev] [PATCH v2] f2fs: support flexible inline xattr size

2017-10-03 Thread Chao Yu
Hi Jaegeuk,

When I rebase on last dev-test branch, I found that there are splitted codes
from this patch which contains fixing codes for recovery flow. I guess you
have missed this v2 patch.

Could you merge and test this v2?

Thanks,

On 2017/9/6 21:59, Chao Yu wrote:
> From: Chao Yu 
> 
> Now, in product, more and more features based on file encryption were
> introduced, their demand of xattr space is increasing, however, inline
> xattr has fixed-size of 200 bytes, once inline xattr space is full, new
> increased xattr data would occupy additional xattr block which may bring
> us more space usage and performance regression during persisting.
> 
> In order to resolve above issue, it's better to expand inline xattr size
> flexibly according to user's requirement.
> 
> So this patch introduces new filesystem feature 'flexible inline xattr',
> and new mount option 'inline_xattr_size=%u', once mkfs enables the
> feature, we can use the option to make f2fs supporting flexible inline
> xattr size.
> 
> To support this feature, we add extra attribute i_inline_xattr_size in
> inode layout, indicating that how many space inline xattr borrows from
> block address mapping space in inode layout, by this, we can easily
> locate and store flexible-sized inline xattr data in inode.
> 
> Inode disk layout:
>   +--+
>   | .i_mode  |
>   | ...  |
>   | .i_ext   |
>   +--+
>   | .i_extra_isize   |
>   | .i_inline_xattr_size |---+
>   | ...  |   |
>   +--+   |
>   | .i_addr  |   |
>   |  - block address or  |   |
>   |  - inline data   |   |
>   +--+<---+  v
>   |inline xattr  |+---inline xattr range
>   +--+<---+
>   | .i_nid   |
>   +--+
>   |   node_footer|
>   | (nid, ino, offset)   |
>   +--+
> 
> Signed-off-by: Chao Yu 
> ---
> v2:
> - recover i_inline_xattr_size during SPOR.
>  fs/f2fs/f2fs.h  | 43 ++-
>  fs/f2fs/inode.c | 12 
>  fs/f2fs/namei.c |  6 ++
>  fs/f2fs/node.c  | 10 --
>  fs/f2fs/super.c | 32 +++-
>  fs/f2fs/sysfs.c |  7 +++
>  fs/f2fs/xattr.c | 18 +-
>  include/linux/f2fs_fs.h |  5 +++--
>  8 files changed, 106 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 0d4bf497d566..83d3ec57e303 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -93,6 +93,7 @@ extern char *fault_name[FAULT_MAX];
>  #define F2FS_MOUNT_GRPQUOTA  0x0010
>  #define F2FS_MOUNT_PRJQUOTA  0x0020
>  #define F2FS_MOUNT_QUOTA 0x0040
> +#define F2FS_MOUNT_INLINE_XATTR_SIZE 0x0080
>  
>  #define clear_opt(sbi, option)   ((sbi)->mount_opt.opt &= 
> ~F2FS_MOUNT_##option)
>  #define set_opt(sbi, option) ((sbi)->mount_opt.opt |= F2FS_MOUNT_##option)
> @@ -118,6 +119,7 @@ struct f2fs_mount_info {
>  #define F2FS_FEATURE_EXTRA_ATTR  0x0008
>  #define F2FS_FEATURE_PRJQUOTA0x0010
>  #define F2FS_FEATURE_INODE_CHKSUM0x0020
> +#define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR   0x0040
>  
>  #define F2FS_HAS_FEATURE(sb, mask)   \
>   ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
> @@ -379,11 +381,14 @@ struct f2fs_flush_device {
>  
>  /* for inline stuff */
>  #define DEF_INLINE_RESERVED_SIZE 1
> +#define DEF_MIN_INLINE_SIZE  1
>  static inline int get_extra_isize(struct inode *inode);
> -#define MAX_INLINE_DATA(inode)   (sizeof(__le32) * \
> - (CUR_ADDRS_PER_INODE(inode) - \
> - DEF_INLINE_RESERVED_SIZE - \
> - F2FS_INLINE_XATTR_ADDRS))
> +static inline int get_inline_xattr_addrs(struct inode *inode);
> +#define F2FS_INLINE_XATTR_ADDRS(inode)   get_inline_xattr_addrs(inode)
> +#define MAX_INLINE_DATA(inode)   (sizeof(__le32) *   
> \
> + (CUR_ADDRS_PER_INODE(inode) -   \
> + F2FS_INLINE_XATTR_ADDRS(inode) -\
> + DEF_INLINE_RESERVED_SIZE))
>  
>  /* for inline dir */
>  #define NR_INLINE_DENTRY(inode)  (MAX_INLINE_DATA(inode) * BITS_PER_BYTE 
> / \
> @@ -592,6 +597,7 @@ struct f2fs_inode_info {
>  
>   int i_extra_isize;  /* size of extra space located in 
> i_addr */
>   kprojid_t i_projid; /* id for project quota */
> + int i_inline_xattr_size;/* inline xattr size */
>  };
>  
>  static inline void get_extent_info(struct extent_info *ext,
> @@ -1043,6 +1049,7 @@ struct f2fs_sb_info {
>   loff_t max_file_blocks; /* max block index of 

Re: [GIT PULL] ARM: at91: fixes for 4.14 #1

2017-10-03 Thread Olof Johansson
On Mon, Sep 18, 2017 at 06:20:29PM +0200, Nicolas Ferre wrote:
> From: Nicolas Ferre 
> 
> Arnd, Olof,
> 
> Here are some fixes for 4.14. I send them while Alexandre is in the US. The
> most of this PR are modifications to the sama5d27 SOM1 EK board that was
> introducts lately. These modifications are errors due to typos, 
> misunderstanding
> of the DT binding or simply last minute hardware modifications.
> I also added one patch for the AT91 PM subsystem as it is the re-alignment on
> one treewide change by Russell to get rid of virt_to_phys() in low-level code.
> I think that unifying this move the soonest for all ARM platform is better.
> 
> Thanks, best regards,
> 
> The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e:
> 
>   Linux 4.14-rc1 (2017-09-16 15:47:51 -0700)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/nferre/linux-at91.git 
> tags/at91-fixes
> 
> for you to fetch changes up to 093d79f62a89f47d9b5fd0746768146d9696535c:
> 
>   ARM: at91: Replace uses of virt_to_phys with __pa_symbol (2017-09-18 
> 10:05:38 +0200)

Applied to fixes.


-Olof



Re: [PATCH] ARM: defconfig: FRAMEBUFFER_CONSOLE can no longer be =m

2017-10-03 Thread Olof Johansson
On Mon, Sep 18, 2017 at 05:47:50PM +0200, Arnd Bergmann wrote:
> It is no longer possible to load this at runtime, so let's
> change the few remaining users to have it built-in all
> the time.
> 
> arch/arm/configs/zeus_defconfig:115:warning: symbol value 'm' invalid for 
> FRAMEBUFFER_CONSOLE
> arch/arm/configs/viper_defconfig:116:warning: symbol value 'm' invalid for 
> FRAMEBUFFER_CONSOLE
> arch/arm/configs/pxa_defconfig:474:warning: symbol value 'm' invalid for 
> FRAMEBUFFER_CONSOLE
> 
> Reported-by: kernelci.org bot 
> Fixes: 6104c37094e7 ("fbcon: Make fbcon a built-time depency for fbdev")
> Signed-off-by: Arnd Bergmann 

Applied to fixes.


-Olof


[PATCH] block/drbd: Fix a sleep-in-atomic bug in drbd_bcast_event

2017-10-03 Thread Jia-Ju Bai
The driver may sleep under a RCU lock, and the function call path is:
drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  drbd_bcast_event
genlmsg_new(GFP_NOIO) --> may sleep

To fix it, GFP_NOIO is replaced with GFP_ATOMIC.
This bug is found by my static analysis tool and my code review.


Signed-off-by: Jia-Ju Bai 
---
 drivers/block/drbd/drbd_nl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index a12f77e..713c965 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -4537,7 +4537,7 @@ void drbd_bcast_event(struct drbd_device *device, const 
struct sib_info *sib)
int err = -ENOMEM;
 
seq = atomic_inc_return(&drbd_genl_seq);
-   msg = genlmsg_new(NLMSG_GOODSIZE, GFP_NOIO);
+   msg = genlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC);
if (!msg)
goto failed;
 
-- 
1.7.9.5




[PATCH] block/drbd: Fix a sleep-in-atomic bug in notify_helper

2017-10-03 Thread Jia-Ju Bai
The driver may sleep under a RCU lock, and the function call path is:
drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  notify_helper
genlmsg_new(GFP_NOIO) --> may sleep

To fix it, GFP_NOIO is replaced with GFP_ATOMIC.
This bug is found by my static analysis tool and my code review.

Signed-off-by: Jia-Ju Bai 
---
 drivers/block/drbd/drbd_nl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index a12f77e..ad093da 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -4790,7 +4790,7 @@ void notify_helper(enum drbd_notification_type type,
helper_info.helper_name_len = min(strlen(name), 
sizeof(helper_info.helper_name));
helper_info.helper_status = status;
 
-   skb = genlmsg_new(NLMSG_GOODSIZE, GFP_NOIO);
+   skb = genlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC);
err = -ENOMEM;
if (!skb)
goto fail;
-- 
1.7.9.5




[BUG] drbd/block: A sleep-in-atomic bug in notify_helper

2017-10-03 Thread Jia-Ju Bai

The driver may sleep under a RCU lock, and the function call path is:
drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
drbd_khelper
  notify_helper
mutex_lock --> may sleep

I hope to fix it, but I do not find a proper way at present ...
This bug is found by my static analysis tool and my code review.


Thanks,
Jia-Ju Bai




Re: [PATCH 1/7] gpio: brcmstb: allow all instances to be wakeup sources

2017-10-03 Thread Gregory Fong
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> This commit allows a wakeup parent interrupt to be shared between
> instances.
>
> It also removes the redundant can_wake member of the private data
> structure by using whether the parent_wake_irq has been defined to
> indicate that the GPIO device can wake.
>
> Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support")
> Signed-off-by: Doug Berger 

Acked-by: Gregory Fong 


Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation

2017-10-03 Thread Theodore Ts'o
On Tue, Oct 03, 2017 at 10:13:21PM +0200, Luis R. Rodriguez wrote:
> > After we remove add the NEEDS_RECOVERY flag, we need to make sure
> > recovery flag is pushed out to disk before any other changes are
> > allowed to be pushed out to disk.  That's why we originally did the
> > update synchronously.
> 
> I see. I had to try to dig through linux-history to see why this was done, and
> I actually could not find an exact commit which explained what you just did.
> Thanks!
> 
> Hrm, so on freeze we issue the same commit as well, so is a sync *really* 
> needed
> on thaw?

So let me explain what's going on.  When we do a freeze, we make sure
all of the blocks that are written to the journal are writen to the
final location on disk, and the journal is is truncated.  (This is
called a "journal checkpoint".)  Then we clear the NEEDS RECOVERY
feature flag and set the EXT4_VALID_FS flags in the superblock.  In
the no journal case, we flush out all metadata writes, and we set the
EXT4_VALID_FS flag.  In both cases, we call ext4_commit_super(sb, 1)
to make sure the flags update is pushed out to disk.  This puts the
file system into a "quiscent" state; in fact, it looks like the file
system has been unmounted, so that it becomes safe to run
dump/restore, run fsck -n on the file system, etc.

The problem on the thaw side is that we need to make sure that
EXT4_VALID_FS flag is removed (and if journaling is enabled, the NEEDS
RECOVERY feature flag is set) and the superblock is flushed out to the
storage device before any other writes are persisted on the disk.  In
the case where we have journalling enabled, we could wait until the
first journal commit to write out the superblock, since in journal
mode all metadata updates to the disk are suppressed until the journal
commit.  We don't do that today, but it is a change we could make.

However, in the no journal node we need to make sure the EXT4_VALID_FS
flag is cleared on disk before any other metadata operations can take
place, and calling ext4_commit_super(sb, 1) is the only real way to do
that.

> No, it was am empirical evaluation done at testing, I observed bio_submit()
> stalls, we never get completion from the device. Even if we call thaw at the
> very end of resume, after the devices should be up, we still end up in the 
> same
> situation. Given how I order freezing filesystems after freezing userspace I 
> do
> believe we should thaw filesystems prior unfreezing userspace, its why I 
> placed
> the call where it is now.

So when we call ext4_commit_super(sb, 1), we issue the bio, and then
we block waiting for the I/O to complete.  That's a blocking call.  Is
the thaw context one which allows blocking?  If userspace is still
frozen, does that imply that the scheduler isn't allow to run?  If
that's the case, then that's probably the problem.

More generally, if the thawing code needs to allocate memory, or do
any number of things that could potentially block, this could
potentially be an issue.  Or if we have a network block device, or
something else in the storage stack that needs to run a kernel thread
context (or a workqueue, etc.) --- is the fact that userspace is
frozen mean the scheduler is going to refuse to schedule()?

I know at one point we made a distinction between freezing userspace
threads and kernel threads, but were there people who didn't like this
because of unnecessary complexity.  It sounds to me like on the thaw
side, we might also need to unfreeze kernel threads first, and then
allow userspace threads to run.  Do we do that today, or in the new
freeze/thaw code?  It's been quite a while since I've looked at that
part of the kernel.

- Ted



Re: [PATCH v2 1/4] KVM: LAPIC: Fix lapic timer mode transition

2017-10-03 Thread Wanpeng Li
2017-10-04 1:05 GMT+08:00 Radim Krčmář :
> 2017-09-28 18:04-0700, Wanpeng Li:
>> From: Wanpeng Li 
>>
>> SDM 10.5.4.1 TSC-Deadline Mode mentioned that "Transitioning between 
>> TSC-Deadline
>> mode and other timer modes also disarms the timer". So the APIC Timer 
>> Initial Count
>> Register for one-shot/periodic mode should be reset. This patch do it.
>
> At the beginning of the secion is also:
>
>   A write to the LVT Timer Register that changes the timer mode disarms
>   the local APIC timer. The supported timer modes are given in Table
>   10-2. The three modes of the local APIC timer are mutually exclusive.

Yeah, I saw it before sending out the patches, but it is mentioned in
TSC-deadline section which looks strange, if the timer is still
disarmed when switching between one-shot and periodic mode before
TSC-deadline is introduced and w/o TSC-deadline section?

>
> So we should also disarm when switching between one-shot and periodic.
>
> apic_update_lvtt() already has logic to determine whether the timer mode
> has changed and is the perfect place to clear APIC_TMICT.

Agreed, thanks for your review, Radim. :)

Regards,
Wanpeng Li

>
> Thanks.
>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  arch/x86/include/asm/apicdef.h | 1 +
>>  arch/x86/kvm/lapic.c   | 3 +++
>>  2 files changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
>> index c46bb99..d8ef1b4 100644
>> --- a/arch/x86/include/asm/apicdef.h
>> +++ b/arch/x86/include/asm/apicdef.h
>> @@ -100,6 +100,7 @@
>>  #define  APIC_TIMER_BASE_CLKIN   0x0
>>  #define  APIC_TIMER_BASE_TMBASE  0x1
>>  #define  APIC_TIMER_BASE_DIV 0x2
>> +#define  APIC_LVT_TIMER_MASK (3 << 17)
>>  #define  APIC_LVT_TIMER_ONESHOT  (0 << 17)
>>  #define  APIC_LVT_TIMER_PERIODIC (1 << 17)
>>  #define  APIC_LVT_TIMER_TSCDEADLINE  (2 << 17)
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 69c5612..a739cbb 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1722,6 +1722,9 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
>> reg, u32 val)
>>   break;
>>
>>   case APIC_LVTT:
>> + if (apic_lvtt_tscdeadline(apic) != ((val &
>> + APIC_LVT_TIMER_MASK) == APIC_LVT_TIMER_TSCDEADLINE))
>> + kvm_lapic_set_reg(apic, APIC_TMICT, 0);
>>   if (!kvm_apic_sw_enabled(apic))
>>   val |= APIC_LVT_MASKED;
>>   val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask);
>> --
>> 2.7.4
>>


Re: [PATCH v2 2/4] KVM: LAPIC: Keep timer running when switching between one-shot and periodic mode

2017-10-03 Thread Wanpeng Li
2017-10-04 1:06 GMT+08:00 Radim Krčmář :
> 2017-09-28 18:04-0700, Wanpeng Li:
>> From: Wanpeng Li 
>>
>> If we take TSC-deadline mode timer out of the picture, the Intel SDM
>> does not say that the timer is disable when the timer mode is change,
>> either from one-shot to periodic or vice versa.
>
> I think it does, please see comment under [v2 1/4].

As I replied to [v2 1/4].

Regards,
Wanpeng Li

>
>> After this patch, the timer is no longer disarmed on change of mode, so
>> the counter (TMCCT) keeps counting down.
>>
>> So what does a write to LVTT changes ? On baremetal, the change of mode
>> is probably taken into account only when the counter reach 0. When this
>> happen, LVTT is use to figure out if the counter should restard counting
>> down from TMICT (so periodic mode) or stop counting (if one-shot mode).
>>
>> This patch is based on observation of the behavior of the APIC timer on
>> baremetal as well as check that they does not go against the description
>> written in the Intel SDM.
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  arch/x86/kvm/lapic.c | 40 
>>  1 file changed, 28 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index a739cbb..946c11b 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1301,7 +1301,7 @@ static void update_divide_count(struct kvm_lapic *apic)
>>  apic->divide_count);
>>  }
>>
>> -static void apic_update_lvtt(struct kvm_lapic *apic)
>> +static bool apic_update_lvtt(struct kvm_lapic *apic)
>>  {
>>   u32 timer_mode = kvm_lapic_get_reg(apic, APIC_LVTT) &
>>   apic->lapic_timer.timer_mode_mask;
>> @@ -1309,7 +1309,9 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
>>   if (apic->lapic_timer.timer_mode != timer_mode) {
>>   apic->lapic_timer.timer_mode = timer_mode;
>>   hrtimer_cancel(&apic->lapic_timer.timer);
>> + return true;
>>   }
>> + return false;
>>  }
>>
>>  static void apic_timer_expired(struct kvm_lapic *apic)
>> @@ -1430,11 +1432,12 @@ static void start_sw_period(struct kvm_lapic *apic)
>>   HRTIMER_MODE_ABS_PINNED);
>>  }
>>
>> -static bool set_target_expiration(struct kvm_lapic *apic)
>> +static bool set_target_expiration(struct kvm_lapic *apic, bool timer_update)
>>  {
>> - ktime_t now;
>> - u64 tscl = rdtsc();
>> + ktime_t now, remaining;
>> + u64 tscl = rdtsc(), delta;
>>
>> + /* Calculate the next time the timer should trigger an interrupt */
>>   now = ktime_get();
>>   apic->lapic_timer.period = (u64)kvm_lapic_get_reg(apic, APIC_TMICT)
>>   * APIC_BUS_CYCLE_NS * apic->divide_count;
>> @@ -1470,9 +1473,21 @@ static bool set_target_expiration(struct kvm_lapic 
>> *apic)
>>  ktime_to_ns(ktime_add_ns(now,
>>   apic->lapic_timer.period)));
>>
>> + if (!timer_update)
>> + delta = apic->lapic_timer.period;
>> + else {
>> + remaining = ktime_sub(apic->lapic_timer.target_expiration, 
>> now);
>> + if (ktime_to_ns(remaining) < 0)
>> + remaining = 0;
>> + delta = mod_64(ktime_to_ns(remaining), 
>> apic->lapic_timer.period);
>> + }
>> +
>> + if (!delta)
>> + return false;
>> +
>>   apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
>> - nsec_to_cycles(apic->vcpu, apic->lapic_timer.period);
>> - apic->lapic_timer.target_expiration = ktime_add_ns(now, 
>> apic->lapic_timer.period);
>> + nsec_to_cycles(apic->vcpu, delta);
>> + apic->lapic_timer.target_expiration = ktime_add_ns(now, delta);
>>
>>   return true;
>>  }
>> @@ -1609,12 +1624,12 @@ void kvm_lapic_restart_hv_timer(struct kvm_vcpu 
>> *vcpu)
>>   restart_apic_timer(apic);
>>  }
>>
>> -static void start_apic_timer(struct kvm_lapic *apic)
>> +static void start_apic_timer(struct kvm_lapic *apic, bool timer_update)
>>  {
>>   atomic_set(&apic->lapic_timer.pending, 0);
>>
>>   if ((apic_lvtt_period(apic) || apic_lvtt_oneshot(apic))
>> - && !set_target_expiration(apic))
>> + && !set_target_expiration(apic, timer_update))
>>   return;
>>
>>   restart_apic_timer(apic);
>> @@ -1729,7 +1744,8 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
>> reg, u32 val)
>>   val |= APIC_LVT_MASKED;
>>   val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask);
>>   kvm_lapic_set_reg(apic, APIC_LVTT, val);
>> - apic_update_lvtt(apic);
>> + if (apic_update_lvtt(apic) && !apic_lvtt_tscdeadline(apic))
>> + start_apic_timer(apic, true);
>>   break;
>>
>>   case APIC_TMICT:
>> @@ -1738,7 +1754,7 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
>> reg, u32 val)
>>
>>   

Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers

2017-10-03 Thread Gregory Fong
Hi Doug,

On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> The basic memory-mapped GPIO controller lock must be released
> before calling the registered GPIO interrupt handlers to allow
> the interrupt handlers to access the hardware.  Otherwise, the
> hardware accesses will deadlock when they attempt to grab the
> lock.

I was having some trouble understanding exactly what the problem was
here, but I think I see it now.  Since this locks the entire bank,
where some GPIOs might be set as inputs and some as inputs (and
interrupt sources), then an interrupt on a GPIO that is supposed to
set another GPIO in the bank would result in deadlock.  Is that
correct?  If so, please update the commit message to make that clear,
and nice fix.  If not that, it would be nice to know what scenario can
cause a problem.

Thanks,
Gregory


Re: [PATCH v2 3/4] KVM: LAPIC: Apply change to TDCR right away to the timer

2017-10-03 Thread Wanpeng Li
2017-10-04 1:28 GMT+08:00 Radim Krčmář :
> 2017-09-28 18:04-0700, Wanpeng Li:
>> From: Wanpeng Li 
>>
>> The description in the Intel SDM of how the divide configuration
>> register is used: "The APIC timer frequency will be the processor's bus
>> clock or core crystal clock frequency divided by the value specified in
>> the divide configuration register."
>>
>> Observation of baremetal shown that when the TDCR is change, the TMCCT
>> does not change or make a big jump in value, but the rate at which it
>> count down change.
>>
>> The patch update the emulation to APIC timer to so that a change to the
>> divide configuration would be reflected in the value of the counter and
>> when the next interrupt is triggered.
>>
>> Cc: Paolo Bonzini 
>> Cc: Radim Krčmář 
>> Signed-off-by: Wanpeng Li 
>> ---
>
> Why do we need to do more than just restart the timer?

Because the current timer (hv or sw) are still running. I think the
goal of this commit is to runtime update the rate of the current timer
which is running. Our restart_apic_timer() implementation just cancels
the current timer when switch between preemption timer and hrtimer.

Regards,
Wanpeng Li

>
> The TMCCT should remain roughly at the same level -- changing divide
> count modifies target_expiration and it looks like apic_get_tmcct()
> would get the same result like before changing divide count.
>
> Thanks.


Re: [PATCH 3/7] gpio: brcmstb: switch to handle_level_irq flow

2017-10-03 Thread Gregory Fong
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> Reading and writing the gpio bank status register each time a pending
> interrupt bit is serviced could cause new pending bits to be cleared
> without servicing the associated interrupts.
>
> By using the handle_level_irq flow instead of the handle_simple_irq
> flow we get proper handling of interrupt masking as well as acking
> of interrupts.  The irq_ack method is added to support this.
>
> Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support")
> Signed-off-by: Doug Berger 

Acked-by: Gregory Fong 


Re: [PATCH 4/7] gpio: brcmstb: correct the configuration of level interrupts

2017-10-03 Thread Gregory Fong
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> This commit corrects a bug when configuring the GPIO hardware for
> IRQ_TYPE_LEVEL_LOW and IRQ_TYPE_LEVEL_HIGH interrupt types. The
> hardware is now correctly configured to support those types.
>
> Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support")
> Signed-off-by: Doug Berger 

Acked-by: Gregory Fong 


Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-03 Thread Michael S. Tsirkin
On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote:
> On Sat, 30 Sep 2017 01:21:43 +0200
> "Rafael J. Wysocki"  wrote:
> 
> > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> > wrote:
> > > intel idle driver does not DTRT when running within a VM:
> > > when going into a deep power state, the right thing to
> > > do is to exit to hypervisor rather than to keep polling
> > > within guest using mwait.
> > >
> > > Currently the solution is just to exit to hypervisor each time we go
> > > idle - this is why kvm does not expose the mwait leaf to guests even
> > > when it allows guests to do mwait.
> > >
> > > But that's not ideal - it seems better to use the idle driver to
> > > guess when will the next interrupt arrive.  
> > 
> > The idle driver alone is not sufficient for that, though.
> > 
> I second that. Why try to solve this problem at vendor specific driver
> level?

Well we still want to e.g. mwait if possible - saves power.

> perhaps just a pv idle driver that decide whether to vmexit
> based on something like local per vCPU timer expiration? I guess we
> can't predict other wake events such as interrupts.
> e.g.
> if (get_next_timer_interrupt() > kvm_halt_target_residency)
>   vmexit
> else
>   poll
> 
> Jacob

It's not always a poll, on x86 putting the CPU in a low power state
is possible within a VM.

Does not seem possible on other CPUs that's why it's vendor specific.

-- 
MST


Re: e4dace3615 ("lib: add test module for CONFIG_DEBUG_VIRTUAL"): [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!

2017-10-03 Thread Florian Fainelli
Hi Fengguang,

On 10/03/2017 07:24 AM, Fengguang Wu wrote:
> Hi Florian,
> 
> FYI we see this BUG in latest RC kernel and linux-next.
> Have it been fixed somewhere else?

The purpose of the test module is to make tests for CONFIG_DEBUG_VIRTUAL
trigger such warnings/VIRTUAL_BUG_ON, because of how the test is coded,
you may want to not load such module. If the module needs to be moved to
another directory because lib/* contains mostly functional tests, let me
know and I will cook something.

Thanks!

> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> 
> commit e4dace3615526fd66c86dd535ee4bc9e8c706e37
> Author: Florian Fainelli 
> AuthorDate: Fri Sep 8 16:15:31 2017 -0700
> Commit: Linus Torvalds 
> CommitDate: Fri Sep 8 18:26:49 2017 -0700
> 
>  lib: add test module for CONFIG_DEBUG_VIRTUAL
>  
>  Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works
>  correctly, at least that it can catch invalid calls to virt_to_phys()
>  against the non-linear kernel virtual address map.
>  
>  Link: 
> http://lkml.kernel.org/r/20170808164035.26725-1-f.faine...@gmail.com
>  Signed-off-by: Florian Fainelli 
>  Cc: "Luis R. Rodriguez" 
>  Cc: Kees Cook 
>  Signed-off-by: Andrew Morton 
>  Signed-off-by: Linus Torvalds 
> 
> 9888a588ea  lib/hexdump.c: return -EINVAL in case of error in hex2bin()
> e4dace3615  lib: add test module for CONFIG_DEBUG_VIRTUAL
> 9e66317d3c  Linux 4.14-rc3
> 1418b85217  Add linux-next specific files for 20170929
> +--+++---+---+
> |  | 9888a588ea | e4dace3615 | 
> v4.14-rc3 | next-20170929 |
> +--+++---+---+
> | boot_successes   | 34 | 0  | 0  
>| 0 |
> | boot_failures| 0  | 15 | 11 
>| 11|
> | kernel_BUG_at_arch/x86/mm/physaddr.c | 0  | 15 | 11 
>| 11|
> | invalid_opcode:#[##] | 0  | 15 | 11 
>| 11|
> | EIP:__phys_addr  | 0  | 15 | 11 
>| 11|
> | Kernel_panic-not_syncing:Fatal_exception | 0  | 15 | 11 
>| 11|
> +--+++---+---+
> 
> [9.837344] test_firmware: interface ready
> [9.838286] test passed
> [9.839498] test_printf: all 260 tests passed
> [   10.183840] test_bitmap: all 460506 tests passed
> [   10.184882] [ cut here ]
> [   10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!
> [   10.186568] invalid opcode:  [#1]
> [   10.187104] CPU: 0 PID: 1 Comm: swapper Not tainted 4.13.0-09302-ge4dace3 
> #1
> [   10.187547] task: 91c45000 task.stack: 91c54000
> [   10.187547] EIP: __phys_addr+0x244/0x2a0
> [   10.187547] EFLAGS: 00210246 CPU: 0
> [   10.187547] EAX:  EBX: 92fe ECX:  EDX: 0001
> [   10.187547] ESI: 12fe EDI:  EBP: 8464e539 ESP: 91c55f08
> [   10.187547]  DS: 007b ES: 007b FS:  GS:  SS: 0068
> [   10.187547] CR0: 80050033 CR2:  CR3: 04792000 CR4: 0690
> [   10.187547] Call Trace:
> [   10.187547]  ? vprintk_func+0x88/0x150
> [   10.187547]  ? test_bitmap_init+0xe7/0xe7
> [   10.187547]  ? test_debug_virtual_init+0x22/0xf0
> [   10.187547]  ? do_one_initcall+0x127/0x3c8
> [   10.187547]  ? kernel_init_freeable+0x217/0x3de
> [   10.187547]  ? kernel_init_freeable+0x245/0x3de
> [   10.187547]  ? rest_init+0x200/0x200
> [   10.187547]  ? kernel_init+0x17/0x340
> [   10.187547]  ? rest_init+0x200/0x200
> [   10.187547]  ? schedule_tail_wrapper+0x6/0x8
> [   10.187547]  ? rest_init+0x200/0x200
> [   10.187547]  ? ret_from_fork+0x19/0x24
> [   10.187547] Code: c9 ba 01 00 00 00 c7 04 24 00 00 00 00 b8 08 31 17 84 e8 
> 30 58 19 00 ff 05 74 c1 1f 84 83 05 20 01 7e 84 01 83 15 24 01 7e 84 00 <0f> 
> 0b 83 05 30 01 7e 84 01 83 15 34 01 7e 84 00 8d 74 26 00 83
> [   10.187547] EIP: __phys_addr+0x244/0x2a0 SS:ESP: 0068:91c55f08
> [   10.205865] ---[ end trace 2cb481e044fc1df9 ]---
> [   10.206542] Kernel panic - not syncing: Fatal exception
> 
># HH:MM RESULT 
> GOOD BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
> git bisect start 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e v4.13 --
> git bisect good 3aea311c1b4002bd730a09530f80f2f2ad3bf495  # 04:45  G 10   
>   00   0  genksyms: fix gperf removal conversion
> git bisect  bad 6d8ef53e8b2fed8b0f91df0c6da7cc92747d934a  # 05:12  B  0   
>   9   21   0  Merge tag 'f2fs-for-4.14' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
> git bisect  bad d719518d9ce9132bad8a06e8029aeead328f66a3  # 0

Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers

2017-10-03 Thread Doug Berger
On 10/03/2017 06:55 PM, Gregory Fong wrote:
> Hi Doug,
> 
> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
>> The basic memory-mapped GPIO controller lock must be released
>> before calling the registered GPIO interrupt handlers to allow
>> the interrupt handlers to access the hardware.  Otherwise, the
>> hardware accesses will deadlock when they attempt to grab the
>> lock.
> 
> I was having some trouble understanding exactly what the problem was
> here, but I think I see it now.  Since this locks the entire bank,
> where some GPIOs might be set as inputs and some as inputs (and
> interrupt sources), then an interrupt on a GPIO that is supposed to
> set another GPIO in the bank would result in deadlock.  Is that
> correct?  If so, please update the commit message to make that clear,
> and nice fix.  If not that, it would be nice to know what scenario can
> cause a problem.

That is an example, but there are really many possibilities.

Basically, if a registered interrupt handler wants to access its GPIO
you are likely to run into trouble.  Another example might be an
interrupt that is configured to trigger on either edge transition and
the handler wants to know whether the input is currently high or low.

I can submit a V2 with a change in the description if you would like,
but I'm not sure what the clearest example would be.

> 
> Thanks,
> Gregory
> 

Thanks for the feedback,
Doug



Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM

2017-10-03 Thread Michael S. Tsirkin
On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote:
> On Mon, 2 Oct 2017, Jacob Pan wrote:
> > On Sat, 30 Sep 2017 01:21:43 +0200
> > "Rafael J. Wysocki"  wrote:
> > 
> > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin 
> > > wrote:
> > > > intel idle driver does not DTRT when running within a VM:
> > > > when going into a deep power state, the right thing to
> > > > do is to exit to hypervisor rather than to keep polling
> > > > within guest using mwait.
> > > >
> > > > Currently the solution is just to exit to hypervisor each time we go
> > > > idle - this is why kvm does not expose the mwait leaf to guests even
> > > > when it allows guests to do mwait.
> > > >
> > > > But that's not ideal - it seems better to use the idle driver to
> > > > guess when will the next interrupt arrive.  
> > > 
> > > The idle driver alone is not sufficient for that, though.
> > > 
> > I second that. Why try to solve this problem at vendor specific driver
> > level? perhaps just a pv idle driver that decide whether to vmexit
> > based on something like local per vCPU timer expiration? I guess we
> > can't predict other wake events such as interrupts.
> > e.g.
> > if (get_next_timer_interrupt() > kvm_halt_target_residency)
> 
> Bah. no. get_next_timer_interrupt() is not available for abuse in random
> cpuidle driver code. It has state and its tied to the nohz code.
> 
> There is the series from Audrey which makes use of the various idle
> prediction mechanisms, scheduler, irq timings, idle governor to get an idea
> about the estimated idle time. Exactly this information can be fed to the
> kvmidle driver which can act accordingly.
> 
> Hacking a random hardware specific idle driver is definitely the wrong
> approach. It might be useful to chain the kvmidle driver and hardware
> specific drivers at some point, i.e. if the kvmdriver decides not to exit
> it delegates the mwait decision to the proper hardware driver in order not
> to reimplement all the required logic again.

By making changes to idle core to allow that chaining?
Does this sound like something reasonable?

> But that's a different story.
> 
> See 
> http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey...@intel.com

Will read that, thanks a lot.

> Thanks,
> 
>   tglx
> 
> 
> 


Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type

2017-10-03 Thread Gregory Fong
Hi Doug,

On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> Mask the GPIO interrupt while its type is being changed, just in case
> it can prevent a spurious interrupt.

"Just in case"?  I don't have access to hardware documentation for
this anymore, but I'd expect to some stronger claim that the hardware
actually requires masking before changing the trigger type.  If you
can quote documentation for this or explain an actual problem seen,
that would be good.

>
> Signed-off-by: Doug Berger 
> ---
>  drivers/gpio/gpio-brcmstb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpio/gpio-brcmstb.c b/drivers/gpio/gpio-brcmstb.c
> index 0418cb266586..e2fff559c1ca 100644
> --- a/drivers/gpio/gpio-brcmstb.c
> +++ b/drivers/gpio/gpio-brcmstb.c
> @@ -363,7 +363,9 @@ static int brcmstb_gpio_irq_setup(struct platform_device 
> *pdev,
> bank->irq_chip.irq_set_type = brcmstb_gpio_irq_set_type;
>
> /* Ensures that all non-wakeup IRQs are disabled at suspend */
> -   bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND;
> +   /* and that interrupts are masked when changing their type  */

This doesn't follow the kernel multi-line comment style, please adjust.

> +   bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND |
> +  IRQCHIP_SET_TYPE_MASKED;
>
> if (IS_ENABLED(CONFIG_PM_SLEEP) && !priv->parent_wake_irq &&
> of_property_read_bool(np, "wakeup-source")) {
> --
> 2.14.1
>

Thanks,
Gregory


Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type

2017-10-03 Thread Doug Berger
On 10/03/2017 07:10 PM, Gregory Fong wrote:
> Hi Doug,
> 
> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
>> Mask the GPIO interrupt while its type is being changed, just in case
>> it can prevent a spurious interrupt.
> 
> "Just in case"?  I don't have access to hardware documentation for
> this anymore, but I'd expect to some stronger claim that the hardware
> actually requires masking before changing the trigger type.  If you
> can quote documentation for this or explain an actual problem seen,
> that would be good.

Well spotted ;).  This was a protectionist change added at the request
of a user of the driver.  I believe that it is superfluous and I suppose
that belief leaked through in the language of my comment.

In actuality, the GPIO APIs don't make provision for clearing GPIO
interrupts before enabling them so this masking is really only a
deferral of a spurious interrupt if one is triggered by changes in the
hardware programming.

I can strike this from the upstream submission and save a couple lines
of source code (after IRQCHIP_MASK_ON_SUSPEND goes away) if you prefer.

> 
>>
>> Signed-off-by: Doug Berger 
>> ---
>>  drivers/gpio/gpio-brcmstb.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpio/gpio-brcmstb.c b/drivers/gpio/gpio-brcmstb.c
>> index 0418cb266586..e2fff559c1ca 100644
>> --- a/drivers/gpio/gpio-brcmstb.c
>> +++ b/drivers/gpio/gpio-brcmstb.c
>> @@ -363,7 +363,9 @@ static int brcmstb_gpio_irq_setup(struct platform_device 
>> *pdev,
>> bank->irq_chip.irq_set_type = brcmstb_gpio_irq_set_type;
>>
>> /* Ensures that all non-wakeup IRQs are disabled at suspend */
>> -   bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND;
>> +   /* and that interrupts are masked when changing their type  */
> 
> This doesn't follow the kernel multi-line comment style, please adjust.
> 
>> +   bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND |
>> +  IRQCHIP_SET_TYPE_MASKED;
>>
>> if (IS_ENABLED(CONFIG_PM_SLEEP) && !priv->parent_wake_irq &&
>> of_property_read_bool(np, "wakeup-source")) {
>> --
>> 2.14.1
>>
> 
> Thanks,
> Gregory
> 

Thanks again,
Doug


Re: e4dace3615 ("lib: add test module for CONFIG_DEBUG_VIRTUAL"): [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!

2017-10-03 Thread Fengguang Wu

On Tue, Oct 03, 2017 at 07:09:37PM -0700, Florian Fainelli wrote:

Hi Fengguang,

On 10/03/2017 07:24 AM, Fengguang Wu wrote:

Hi Florian,

FYI we see this BUG in latest RC kernel and linux-next.
Have it been fixed somewhere else?


The purpose of the test module is to make tests for CONFIG_DEBUG_VIRTUAL
trigger such warnings/VIRTUAL_BUG_ON, because of how the test is coded,


If so, it may help clarify things if there is a printk notice before
triggering the BUG?


you may want to not load such module. If the module needs to be moved to
another directory because lib/* contains mostly functional tests, let me
know and I will cook something.


Never mind, it looks easy to disable CONFIG_TEST_DEBUG_VIRTUAL. :)

Thanks,
Fengguang


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit e4dace3615526fd66c86dd535ee4bc9e8c706e37
Author: Florian Fainelli 
AuthorDate: Fri Sep 8 16:15:31 2017 -0700
Commit: Linus Torvalds 
CommitDate: Fri Sep 8 18:26:49 2017 -0700

 lib: add test module for CONFIG_DEBUG_VIRTUAL

 Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works
 correctly, at least that it can catch invalid calls to virt_to_phys()
 against the non-linear kernel virtual address map.

 Link: http://lkml.kernel.org/r/20170808164035.26725-1-f.faine...@gmail.com
 Signed-off-by: Florian Fainelli 
 Cc: "Luis R. Rodriguez" 
 Cc: Kees Cook 
 Signed-off-by: Andrew Morton 
 Signed-off-by: Linus Torvalds 

9888a588ea  lib/hexdump.c: return -EINVAL in case of error in hex2bin()
e4dace3615  lib: add test module for CONFIG_DEBUG_VIRTUAL
9e66317d3c  Linux 4.14-rc3
1418b85217  Add linux-next specific files for 20170929
+--+++---+---+
|  | 9888a588ea | e4dace3615 | 
v4.14-rc3 | next-20170929 |
+--+++---+---+
| boot_successes   | 34 | 0  | 0
 | 0 |
| boot_failures| 0  | 15 | 11   
 | 11|
| kernel_BUG_at_arch/x86/mm/physaddr.c | 0  | 15 | 11   
 | 11|
| invalid_opcode:#[##] | 0  | 15 | 11   
 | 11|
| EIP:__phys_addr  | 0  | 15 | 11   
 | 11|
| Kernel_panic-not_syncing:Fatal_exception | 0  | 15 | 11   
 | 11|
+--+++---+---+

[9.837344] test_firmware: interface ready
[9.838286] test passed
[9.839498] test_printf: all 260 tests passed
[   10.183840] test_bitmap: all 460506 tests passed
[   10.184882] [ cut here ]
[   10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!
[   10.186568] invalid opcode:  [#1]
[   10.187104] CPU: 0 PID: 1 Comm: swapper Not tainted 4.13.0-09302-ge4dace3 #1
[   10.187547] task: 91c45000 task.stack: 91c54000
[   10.187547] EIP: __phys_addr+0x244/0x2a0
[   10.187547] EFLAGS: 00210246 CPU: 0
[   10.187547] EAX:  EBX: 92fe ECX:  EDX: 0001
[   10.187547] ESI: 12fe EDI:  EBP: 8464e539 ESP: 91c55f08
[   10.187547]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[   10.187547] CR0: 80050033 CR2:  CR3: 04792000 CR4: 0690
[   10.187547] Call Trace:
[   10.187547]  ? vprintk_func+0x88/0x150
[   10.187547]  ? test_bitmap_init+0xe7/0xe7
[   10.187547]  ? test_debug_virtual_init+0x22/0xf0
[   10.187547]  ? do_one_initcall+0x127/0x3c8
[   10.187547]  ? kernel_init_freeable+0x217/0x3de
[   10.187547]  ? kernel_init_freeable+0x245/0x3de
[   10.187547]  ? rest_init+0x200/0x200
[   10.187547]  ? kernel_init+0x17/0x340
[   10.187547]  ? rest_init+0x200/0x200
[   10.187547]  ? schedule_tail_wrapper+0x6/0x8
[   10.187547]  ? rest_init+0x200/0x200
[   10.187547]  ? ret_from_fork+0x19/0x24
[   10.187547] Code: c9 ba 01 00 00 00 c7 04 24 00 00 00 00 b8 08 31 17 84 e8 30 58 
19 00 ff 05 74 c1 1f 84 83 05 20 01 7e 84 01 83 15 24 01 7e 84 00 <0f> 0b 83 05 
30 01 7e 84 01 83 15 34 01 7e 84 00 8d 74 26 00 83
[   10.187547] EIP: __phys_addr+0x244/0x2a0 SS:ESP: 0068:91c55f08
[   10.205865] ---[ end trace 2cb481e044fc1df9 ]---
[   10.206542] Kernel panic - not syncing: Fatal exception

   # HH:MM RESULT GOOD 
BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
git bisect start 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e v4.13 --
git bisect good 3aea311c1b4002bd730a09530f80f2f2ad3bf495  # 04:45  G 10 
00   0  genksyms: fix gperf removal conversion
git bisect  bad 6d8ef53e8b2fed8b0f91df0c6da7cc92747d934a  # 05:12  B  0 
9   21   0  Merge tag 'f2fs-for-4.14' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
git bisect  bad d71

[PATCH v3] sched/deadline: make it configurable

2017-10-03 Thread Nicolas Pitre
On most small systems, the deadline scheduler class is a luxury that
rarely gets used if at all. It is preferable to have the ability to
configure it out to reduce the kernel size in that case.

Before:

$ size -t kernel/sched/built-in.o
   textdata bss dec hex filename
[...]
  244353452 108   279956d5b (TOTALS)

With CONFIG_SCHED_DL=n:

$ size -t kernel/sched/built-in.o
   textdata bss dec hex filename
[...]
  183363388  92   218165538 (TOTALS)

Signed-off-by: Nicolas Pitre 
---

Changes from v2:

- rebased to v4.14-rc2

Changes from v1:

- fix for a compilation error found by kbuild test robot


 include/linux/sched.h  |  2 ++
 include/linux/sched/deadline.h |  8 ++-
 init/Kconfig   |  8 +++
 kernel/locking/rtmutex.c   |  6 ++---
 kernel/sched/Makefile  |  5 ++--
 kernel/sched/core.c| 15 +++-
 kernel/sched/cpudeadline.h |  7 +-
 kernel/sched/debug.c   |  4 
 kernel/sched/rt.c  | 13 +++
 kernel/sched/sched.h   | 44 +---
 kernel/sched/stop_task.c   |  4 
 11 files changed, 90 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 92fb8dd5a9..00b4bed170 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -565,7 +565,9 @@ struct task_struct {
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
+#ifdef CONFIG_SCHED_DL
struct sched_dl_entity  dl;
+#endif
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 975be862e0..8f191a17dd 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -13,7 +13,7 @@
 
 static inline int dl_prio(int prio)
 {
-   if (unlikely(prio < MAX_DL_PRIO))
+   if (IS_ENABLED(CONFIG_SCHED_DL) && unlikely(prio < MAX_DL_PRIO))
return 1;
return 0;
 }
@@ -28,4 +28,10 @@ static inline bool dl_time_before(u64 a, u64 b)
return (s64)(a - b) < 0;
 }
 
+#ifdef CONFIG_SCHED_DL
+#define dl_deadline(tsk)   (tsk)->dl.deadline
+#else
+#define dl_deadline(tsk)   0
+#endif
+
 #endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/init/Kconfig b/init/Kconfig
index 78cb246101..f252e0dbee 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -959,6 +959,14 @@ config SCHED_AUTOGROUP
  desktop applications.  Task group autogeneration is currently based
  upon task session.
 
+config SCHED_DL
+   bool "Deadline Task Scheduling" if EXPERT
+   default y
+   help
+ This adds the sched_dl scheduling class to the kernel providing
+ support for the SCHED_DEADLINE policy. You might want to disable
+ this to reduce the kernel size. If unsure say y.
+
 config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 6f3dba6e4e..12f5eb1953 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -228,7 +228,7 @@ static inline bool unlock_rt_mutex_safe(struct rt_mutex 
*lock,
  * Only use with rt_mutex_waiter_{less,equal}()
  */
 #define task_to_waiter(p)  \
-   &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = 
(p)->dl.deadline }
+   &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = 
dl_deadline(p) }
 
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
@@ -680,7 +680,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct 
*task,
 * the values of the node being removed.
 */
waiter->prio = task->prio;
-   waiter->deadline = task->dl.deadline;
+   waiter->deadline = dl_deadline(task);
 
rt_mutex_enqueue(lock, waiter);
 
@@ -954,7 +954,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
waiter->task = task;
waiter->lock = lock;
waiter->prio = task->prio;
-   waiter->deadline = task->dl.deadline;
+   waiter->deadline = dl_deadline(task);
 
/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 78f54932ea..0d3baba207 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,9 +16,10 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
-obj-y += idle_task.o fair.o rt.o deadline.o
+obj-y += idle_task.o fair.o rt.o
 obj-y += wait.o wait_bit.o swait.o completion.o idle.o
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SCHED_DL) += deadline.o $(if $(CONFIG_SMP),cpudeadline.o)
+obj-$(CONFIG_SMP) += cpupri.o topology.o stop_task.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stat

Re: [PATCH] KVM: VMX: check match table

2017-10-03 Thread Nick Desaulniers
On Tue, Oct 03, 2017 at 11:42:18AM +0200, Paolo Bonzini wrote:
> On 01/10/2017 01:22, Nick Desaulniers wrote:
> > I don't follow (but I also don't know what any of these three letter
> > acryonyms acronyms stand for), does svm depend on vmx or vice-versa?
> Neither, one is Intel (VMX), the other is AMD (SVM).

Oh, neat, did not realize the vendors did not have different names for
their virtualization extensions.

https://rtfmp.com/2016/03/21/what-does-vmx-svm-cpu-flags-mean/

> Would this work for you?

It doesn't apply cleanly on Linus' tree, or the KVM tree master branch,
so I couldn't fully test it.  But it does look like it will do the
trick.

> And again, is this only with clang?

Indeed the warning was coming from Clang, but looks like some
additional cleanup was done, which is good.

Reviewed-by: Nick Desaulniers 


Re: [PATCH] KVM: VMX: check match table

2017-10-03 Thread Nick Desaulniers
- avi

I think an extra `did not` made it in to that last email...sorry!

On Tue, Oct 3, 2017 at 7:54 PM, Nick Desaulniers
 wrote:
> On Tue, Oct 03, 2017 at 11:42:18AM +0200, Paolo Bonzini wrote:
>> On 01/10/2017 01:22, Nick Desaulniers wrote:
>> > I don't follow (but I also don't know what any of these three letter
>> > acryonyms acronyms stand for), does svm depend on vmx or vice-versa?
>> Neither, one is Intel (VMX), the other is AMD (SVM).
>
> Oh, neat, did not realize the vendors did not have different names for
> their virtualization extensions.
>
> https://rtfmp.com/2016/03/21/what-does-vmx-svm-cpu-flags-mean/
>
>> Would this work for you?
>
> It doesn't apply cleanly on Linus' tree, or the KVM tree master branch,
> so I couldn't fully test it.  But it does look like it will do the
> trick.
>
>> And again, is this only with clang?
>
> Indeed the warning was coming from Clang, but looks like some
> additional cleanup was done, which is good.
>
> Reviewed-by: Nick Desaulniers 


Re: [PATCH 6/7] gpio: brcmstb: consolidate interrupt domains

2017-10-03 Thread Gregory Fong
Hi Doug,

On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
> The GPIOLIB IRQ chip helpers were very appealing, but badly broke
> the 1:1 mapping between a GPIO controller's device_node and its
> interrupt domain.

Out of curiosity, what sort of problems have you seen from this?

>
> This commit consolidates the per bank irq domains to a version
> where we have one larger interrupt domain per GPIO controller
> instance spanning multiple GPIO banks.

This works (and is reminiscent to my initially submitted
implementation at [1]), but I think it might make sense to keep as-is
(using the gpiolib irqchip helpers), and instead allocate an irqchip
fwnode per bank and use to_of_node() to set it as the of_node for the
gpiochip before calling gpiochip_irqchip_add().  OTOH, that capability
might go away...

Linus, can you comment on the FIXME in gpiochip_irqchip_add_key() that
says "get rid of this and use gpiochip->parent->of_node everywhere"?
It seems like it would still be beneficial to be able to override the
associated node for a gpiochip, since that's what's used for the
irqdomain, but if that's going away, obviously we don't want to start
using that now.

Thanks,
Gregory

[1] https://patchwork.kernel.org/patch/6347811/


Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles

2017-10-03 Thread Guenter Roeck
On Tue, Oct 3, 2017 at 1:48 PM, Douglas Anderson  wrote:
> The main Linux Makefiles and the tools sub-Makefiles have different
> conventions for passing in CC / CFLAGS.
>
> Here's brief summary for the kernel:
> * CC: target C compiler (must be passed as an argument to make to
>   override)
> * HOSTCC: host C compiler (must be passed as an argument to make to
>   override)
> * CFLAGS: ignored (kernel manages its own CFLAGS)
> * KCFLAGS: extra cflags for the target (expected as an env var)
> * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of
>   make)
>
> Here's a brief summary for the tools:
> * CC: host C compiler (must be passed as an argument to make to
>   override)
> * CFLAGS: base arguments for the host C compiler which are added to by
>   sub builds (expected as an env var)
>
> When the main kernel Makefile calls into the tools Makefile, it should
> adapt things from its syntax to the tools Makefile syntax.
>
> If we don't do this, we have a few issues:
> * If someone wants to user another compiler (like clang, maybe) for
>   hostcc then the tools will still be compiled with gcc.
> * If you happen to be building and you left ${CFLAGS} set to something
>   random (maybe it matches ${CC}, the _target_ C compiler, not the
>   _host_ C compiler) then this random value will be used to compile
>   the tools.
>
> In any case, it seems like it makes sense to pass CC, CFLAGS, and
> similar properly into the tools Makefile.
>
> NOTE: in order to do this properly, we need to add some new
> definitions of HOSTAS and HOSTLD into the main kernel Makefile.  If we
> don't do this and someone overrides "AS" or "LD" on the command line
> then those (which were intended for the target) will be used to build
> host side tools.  We also make up some imaginary "HOSTASFLAGS" and
> "HOSTLDFLAGS".  Someone could specify these, but if they don't at
> least these blank variables will properly clobber ASFLAGS and LDFLAGS
> from the kernel.
>
> This was discovered in the Chrome OS build system where CFLAGS (an
> environment variable) was accidentally left pointing to flags that
> would be an appropriate match to CC.  The kernel didn't use these
> CFLAGS so it was never an issue.  Turning on STACKVALIDATION, however,
> suddenly invoked a tools build that blew up.
>
> Signed-off-by: Douglas Anderson 

Still not sure if I'd want to go that far. Just clearing out CFLAGS
would have fixed my problem, and I start to realize that I am more of
a minimalist when it comes to kernel changes.

Anyway, not that it means much,

Reviewed-by: Guenter Roeck 
Tested-by: Guenter Roeck 

[ I didn't test all possible flag combinations, only a few of them,
but I did make sure that "make allmodconfig; CFLAGS=-junk make" no
longer fails. ]

Guenter

> ---
>
>  Makefile | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index cf007a31d575..0d3af0677d88 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS)
>  HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS)
>  HOST_LFS_LIBS := $(shell getconf LFS_LIBS)
>
> +HOSTAS   = as
>  HOSTCC   = gcc
> +HOSTLD   = ld
>  HOSTCXX  = g++
>  HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \
> -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
> @@ -1616,11 +1618,35 @@ image_name:
>  # Clear a bunch of variables before executing the submake
>  tools/: FORCE
> $(Q)mkdir -p $(objtree)/tools
> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/
> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
> +   LDFLAGS="$(HOSTLDFLAGS)" \
> +   CFLAGS="$(HOSTCFLAGS)" \
> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
> +   $(MAKE) \
> +   AS="$(HOSTAS)" \
> +   CC="$(HOSTCC)" \
> +   CXX="$(HOSTCXX)" \
> +   LD="$(HOSTLD)" \
> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
> +   O=$(abspath $(objtree)) \
> +   subdir=tools \
> +   -C $(src)/tools/
>
>  tools/%: FORCE
> $(Q)mkdir -p $(objtree)/tools
> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $*
> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
> +   LDFLAGS="$(HOSTLDFLAGS)" \
> +   CFLAGS="$(HOSTCFLAGS)" \
> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
> +   $(MAKE) \
> +   AS="$(HOSTAS)" \
> +   CC="$(HOSTCC)" \
> +   CXX="$(HOSTCXX)" \
> +   LD="$(HOSTLD)" \
> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
> +   O=$(abspath $(objtree)) \
> +   subdir=tools \
> +   -C $(src)/tools/ $*
>
>  # Single targets
>

Re: [RFC PATCH 2/2] tools build: Take CC, AS, and LD from the command line

2017-10-03 Thread Guenter Roeck
On Tue, Oct 3, 2017 at 1:48 PM, Douglas Anderson  wrote:
> If the top-level tools build is called to build a tool and is passed
> CC, AS, or LD on the command line then someone wants to use a
> different compiler, assembler, or linker.  Let's honor that.
>
> This together with the change ("kbuild: Pass HOSTCC and similar to
> tools Makefiles") allows us to properly get the HOST C Compiler used
> when the main kernel Makefile calls into the tools.
>
> Signed-off-by: Douglas Anderson 
> ---
>
>  tools/scripts/Makefile.include | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include
> index 9dc8f078a83c..00261848ec54 100644
> --- a/tools/scripts/Makefile.include
> +++ b/tools/scripts/Makefile.include
> @@ -10,6 +10,20 @@ endif
>  endif
>  endif
>
> +SUBMAKE_ARGS :=
> +
> +# If CC, LD, or AR were passed on the command line, pass them through to the
> +# command line of the sub-tools.
> +ifeq ($(origin CC), command line)
> +   SUBMAKE_ARGS += "CC=$(CC)"
> +endif
> +ifeq ($(origin LD), command line)
> +   SUBMAKE_ARGS += "LD=$(LD)"
> +endif
> +ifeq ($(origin AR), command line)
> +   SUBMAKE_ARGS += "AR=$(ar)"

s/ar/AR/ ?

> +endif
> +
>  # check that the output directory actually exists
>  ifneq ($(OUTPUT),)
>  OUTDIR := $(realpath $(OUTPUT))
> @@ -71,7 +85,9 @@ endif
>  #
>  descend = \
> +mkdir -p $(OUTPUT)$(1) && \
> -   $(MAKE) $(COMMAND_O) subdir=$(if $(subdir),$(subdir)/$(1),$(1)) 
> $(PRINT_DIR) -C $(1) $(2)
> +   $(MAKE) $(SUBMAKE_ARGS) $(COMMAND_O) \
> +   subdir=$(if $(subdir),$(subdir)/$(1),$(1)) \
> +   $(PRINT_DIR) -C $(1) $(2)
>
>  QUIET_SUBDIR0  = +$(MAKE) $(COMMAND_O) -C # space to separate -C and subdir
>  QUIET_SUBDIR1  =
> @@ -94,7 +110,9 @@ ifneq ($(silent),1)
> descend = \
> +@echo '  DESCEND  '$(1); \
> mkdir -p $(OUTPUT)$(1) && \
> -   $(MAKE) $(COMMAND_O) subdir=$(if 
> $(subdir),$(subdir)/$(1),$(1)) $(PRINT_DIR) -C $(1) $(2)
> +   $(MAKE) $(SUBMAKE_ARGS) $(COMMAND_O) \
> +   subdir=$(if $(subdir),$(subdir)/$(1),$(1)) \
> +   $(PRINT_DIR) -C $(1) $(2)
>
> QUIET_CLEAN= @printf '  CLEAN%s\n' $1;
> QUIET_INSTALL  = @printf '  INSTALL  %s\n' $1;
> --
> 2.14.2.920.gcf0c67979c-goog
>


[PATCH v2] tools/kvm_stat: Add Python 3 support to kvm_stat

2017-10-03 Thread Jeremy Cline
Make kvm_stat support Python 3 by changing the use of "print" to a
function rather than a statement, switching from "iteritems" and
"iterkeys" (removed in Python 3) to "items" and "keys" respectively,
and decoding bytes to strings when dealing with text.

With this change, kvm_stat is usable with Python 2.6 and greater.

Signed-off-by: Jeremy Cline 
---
There were a few flaws in my original patch that changed the behavior in
Python 2. Additionally, I missed several compatibility problems with
Python 3. Sorry!

Changes in v2:
 - Include the "end=' '" argument in print statements that previously
   had trailing commas since in Python 2 that causes the print
   statement to not print a newline.
 - decode bytes strings to text strings so that string concatenation
   works properly in Python 3.
 - replace "iterkeys" with "keys" as "iterkeys" is not in Python 3.

 tools/kvm/kvm_stat/kvm_stat | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/tools/kvm/kvm_stat/kvm_stat b/tools/kvm/kvm_stat/kvm_stat
index 32283d88701a..217cf6f95c36 100755
--- a/tools/kvm/kvm_stat/kvm_stat
+++ b/tools/kvm/kvm_stat/kvm_stat
@@ -19,9 +19,11 @@ Three different ways of output formatting are available:
 
 The data is sampled from the KVM's debugfs entries and its perf events.
 """
+from __future__ import print_function
 
 import curses
 import sys
+import locale
 import os
 import time
 import optparse
@@ -225,6 +227,8 @@ IOCTL_NUMBERS = {
 'RESET':   0x2403,
 }
 
+ENCODING = locale.getpreferredencoding(False)
+
 
 class Arch(object):
 """Encapsulates global architecture specific data.
@@ -666,7 +670,7 @@ class TracepointProvider(Provider):
 """Returns 'event name: current value' for all enabled events."""
 ret = defaultdict(int)
 for group in self.group_leaders:
-for name, val in group.read().iteritems():
+for name, val in group.read().items():
 if name in self._fields:
 ret[name] += val
 return ret
@@ -955,7 +959,7 @@ class Tui(object):
 except:
 raise Exception
 for line in child.stdout:
-line = line.lstrip().split(' ', 1)
+line = line.decode(ENCODING).lstrip().split(' ', 1)
 # perform a sanity check before calling the more expensive
 # function to possibly extract the guest name
 if ' -name ' in line[1]:
@@ -1005,7 +1009,7 @@ class Tui(object):
 name = ''
 try:
 line = open('/proc/{}/cmdline'
-.format(pid), 'rb').read().split('\0')
+.format(pid), 'r').read().split('\0')
 parms = line[line.index('-name') + 1].split(',')
 while '' in parms:
 # commas are escaped (i.e. ',,'), hence e.g. 'foo,bar' results
@@ -1170,7 +1174,7 @@ class Tui(object):
.format(self.stats.fields_filter))
 self.screen.addstr(3, 0, "New regex: ")
 curses.echo()
-regex = self.screen.getstr()
+regex = self.screen.getstr().decode(ENCODING)
 curses.noecho()
 if len(regex) == 0:
 self.stats.fields_filter = DEFAULT_REGEX
@@ -1204,7 +1208,7 @@ class Tui(object):
 
 curses.echo()
 self.screen.addstr(3, 0, "Pid [0 or pid]: ")
-pid = self.screen.getstr()
+pid = self.screen.getstr().decode(ENCODING)
 curses.noecho()
 
 try:
@@ -1233,7 +1237,7 @@ class Tui(object):
 self.screen.addstr(2, 0, 'Change delay from %.1fs to ' %
self._delay_regular)
 curses.echo()
-val = self.screen.getstr()
+val = self.screen.getstr().decode(ENCODING)
 curses.noecho()
 
 try:
@@ -1273,7 +1277,7 @@ class Tui(object):
 self.print_all_gnames(7)
 curses.echo()
 self.screen.addstr(3, 0, "Guest [ENTER or guest]: ")
-gname = self.screen.getstr()
+gname = self.screen.getstr().decode(ENCODING)
 curses.noecho()
 
 if not gname:
@@ -1369,25 +1373,25 @@ def batch(stats):
 s = stats.get()
 for key in sorted(s.keys()):
 values = s[key]
-print '%-42s%10d%10d' % (key, values[0], values[1])
+print('%-42s%10d%10d' % (key, values[0], values[1]))
 except KeyboardInterrupt:
 pass
 
 
 def log(stats):
 """Prints statistics as reiterating key block, multiple value blocks."""
-keys = sorted(stats.get().iterkeys())
+keys = sorted(stats.get().keys())
 
 def banner():
 for k in keys:
-print '%s' % k,
-print
+print(k, end=' ')
+print()
 
 def statline():
 s = stats.get()
 for k in keys:
-print ' %9d' % s[k][1],
-print
+

Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers

2017-10-03 Thread Gregory Fong
On Tue, Oct 3, 2017 at 7:09 PM, Doug Berger  wrote:
> On 10/03/2017 06:55 PM, Gregory Fong wrote:
>> Hi Doug,
>>
>> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
>>> The basic memory-mapped GPIO controller lock must be released
>>> before calling the registered GPIO interrupt handlers to allow
>>> the interrupt handlers to access the hardware.  Otherwise, the
>>> hardware accesses will deadlock when they attempt to grab the
>>> lock.
>>
>> I was having some trouble understanding exactly what the problem was
>> here, but I think I see it now.  Since this locks the entire bank,
>> where some GPIOs might be set as inputs and some as inputs (and
>> interrupt sources), then an interrupt on a GPIO that is supposed to
>> set another GPIO in the bank would result in deadlock.  Is that
>> correct?  If so, please update the commit message to make that clear,
>> and nice fix.  If not that, it would be nice to know what scenario can
>> cause a problem.
>
> That is an example, but there are really many possibilities.
>
> Basically, if a registered interrupt handler wants to access its GPIO
> you are likely to run into trouble.  Another example might be an
> interrupt that is configured to trigger on either edge transition and
> the handler wants to know whether the input is currently high or low.
>
> I can submit a V2 with a change in the description if you would like,
> but I'm not sure what the clearest example would be.

If you could just mention both of these possible cases, that would be
great! With that change,

Acked-by: Gregory Fong 


Re: [PATCH 4.13 103/110] platform/x86: fujitsu-laptop: Dont oops when FUJ02E3 is not presnt

2017-10-03 Thread Jonathan Woithe
On Tue, Oct 03, 2017 at 05:27:25PM -0700, Darren Hart wrote:
> On Wed, Oct 04, 2017 at 08:39:34AM +1030, Jonathan Woithe wrote:
> > :
> > Reviewed-by: Jonathan Woithe 
> 
> Thanks for the due diligence here Jonathan, but for stable backport
> announcements, you don't need to speak up unless you have an objection.

No worries; thanks for the info.

jonathan


Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type

2017-10-03 Thread Gregory Fong
On Tue, Oct 3, 2017 at 7:22 PM, Doug Berger  wrote:
> On 10/03/2017 07:10 PM, Gregory Fong wrote:
>> Hi Doug,
>>
>> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger  wrote:
>>> Mask the GPIO interrupt while its type is being changed, just in case
>>> it can prevent a spurious interrupt.
>>
>> "Just in case"?  I don't have access to hardware documentation for
>> this anymore, but I'd expect to some stronger claim that the hardware
>> actually requires masking before changing the trigger type.  If you
>> can quote documentation for this or explain an actual problem seen,
>> that would be good.
>
> Well spotted ;).  This was a protectionist change added at the request
> of a user of the driver.  I believe that it is superfluous and I suppose
> that belief leaked through in the language of my comment.
>
> In actuality, the GPIO APIs don't make provision for clearing GPIO
> interrupts before enabling them so this masking is really only a
> deferral of a spurious interrupt if one is triggered by changes in the
> hardware programming.

Indeed.  If the hardware is behaving in unexpected ways that would be
a whole different kind of problem.

>
> I can strike this from the upstream submission and save a couple lines
> of source code (after IRQCHIP_MASK_ON_SUSPEND goes away) if you prefer.
>

Yes, please remove.

Thanks,
Gregory


Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles

2017-10-03 Thread Masahiro Yamada
2017-10-04 5:48 GMT+09:00 Douglas Anderson :
> The main Linux Makefiles and the tools sub-Makefiles have different
> conventions for passing in CC / CFLAGS.
>
> Here's brief summary for the kernel:
> * CC: target C compiler (must be passed as an argument to make to
>   override)
> * HOSTCC: host C compiler (must be passed as an argument to make to
>   override)
> * CFLAGS: ignored (kernel manages its own CFLAGS)
> * KCFLAGS: extra cflags for the target (expected as an env var)
> * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of
>   make)
>
> Here's a brief summary for the tools:
> * CC: host C compiler (must be passed as an argument to make to
>   override)
> * CFLAGS: base arguments for the host C compiler which are added to by
>   sub builds (expected as an env var)
>
> When the main kernel Makefile calls into the tools Makefile, it should
> adapt things from its syntax to the tools Makefile syntax.
>
> If we don't do this, we have a few issues:
> * If someone wants to user another compiler (like clang, maybe) for
>   hostcc then the tools will still be compiled with gcc.
> * If you happen to be building and you left ${CFLAGS} set to something
>   random (maybe it matches ${CC}, the _target_ C compiler, not the
>   _host_ C compiler) then this random value will be used to compile
>   the tools.
>
> In any case, it seems like it makes sense to pass CC, CFLAGS, and
> similar properly into the tools Makefile.
>
> NOTE: in order to do this properly, we need to add some new
> definitions of HOSTAS and HOSTLD into the main kernel Makefile.  If we
> don't do this and someone overrides "AS" or "LD" on the command line
> then those (which were intended for the target) will be used to build
> host side tools.  We also make up some imaginary "HOSTASFLAGS" and
> "HOSTLDFLAGS".  Someone could specify these, but if they don't at
> least these blank variables will properly clobber ASFLAGS and LDFLAGS
> from the kernel.
>
> This was discovered in the Chrome OS build system where CFLAGS (an
> environment variable) was accidentally left pointing to flags that
> would be an appropriate match to CC.  The kernel didn't use these
> CFLAGS so it was never an issue.  Turning on STACKVALIDATION, however,
> suddenly invoked a tools build that blew up.
>
> Signed-off-by: Douglas Anderson 
> ---
>
>  Makefile | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index cf007a31d575..0d3af0677d88 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS)
>  HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS)
>  HOST_LFS_LIBS := $(shell getconf LFS_LIBS)
>
> +HOSTAS   = as
>  HOSTCC   = gcc
> +HOSTLD   = ld
>  HOSTCXX  = g++
>  HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \
> -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
> @@ -1616,11 +1618,35 @@ image_name:
>  # Clear a bunch of variables before executing the submake
>  tools/: FORCE
> $(Q)mkdir -p $(objtree)/tools
> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/
> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
> +   LDFLAGS="$(HOSTLDFLAGS)" \
> +   CFLAGS="$(HOSTCFLAGS)" \
> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
> +   $(MAKE) \
> +   AS="$(HOSTAS)" \
> +   CC="$(HOSTCC)" \
> +   CXX="$(HOSTCXX)" \
> +   LD="$(HOSTLD)" \
> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
> +   O=$(abspath $(objtree)) \
> +   subdir=tools \
> +   -C $(src)/tools/
>
>  tools/%: FORCE
> $(Q)mkdir -p $(objtree)/tools
> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $*
> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
> +   LDFLAGS="$(HOSTLDFLAGS)" \
> +   CFLAGS="$(HOSTCFLAGS)" \
> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
> +   $(MAKE) \
> +   AS="$(HOSTAS)" \
> +   CC="$(HOSTCC)" \
> +   CXX="$(HOSTCXX)" \
> +   LD="$(HOSTLD)" \
> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
> +   O=$(abspath $(objtree)) \
> +   subdir=tools \
> +   -C $(src)/tools/ $*
>

Please do not do this.



CC: for building tools that run on the target machine
HOSTCC: for building tools that run on the build machine


IMHO, this should be consistent to avoid confusion.


Grepping CROSS_COMPILE under tools/ directory,
I see many Makefile expect CC for the target system.

For ex, tools/croup/Makefile

CC = $(CROSS_COMPILE)gcc



Why don't you fix tools/objtool/Makefile ?





-- 
Best Regards
Masahiro Yamada


Re: [PATCH 1/8] ARM: dts: aspeed: Move pinctrl subnodes to improve readability

2017-10-03 Thread Andrew Jeffery
On Thu, 2017-09-28 at 17:21 +0930, Joel Stanley wrote:
> > From: Andrew Jeffery 
> 
> Moving the subnodes out of the pinctrl node declaration to a reference
> allows easier access to the remaining parts of the devicetree.
> 
> > Signed-off-by: Andrew Jeffery 
> > Reviewed-by: Xo Wang 
> > Signed-off-by: Joel Stanley 
> ---
>  arch/arm/boot/dts/aspeed-g4.dtsi | 1483 ++--
>  arch/arm/boot/dts/aspeed-g5.dtsi | 1549 
> +++---
>  2 files changed, 1518 insertions(+), 1514 deletions(-)

I hate the way the diff came out for the g5, it's almost impossible to
read. Also not sure how useful it is to ack my own patch, but given the
rebasing it has suffered:

Acked-by: Andrew Jeffery 

Cheers,

Andrew

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 2/8] ARM: dts: aspeed: Reorder ADC node

2017-10-03 Thread Andrew Jeffery
On Thu, 2017-09-28 at 17:21 +0930, Joel Stanley wrote:
> We try to keep the nodes in address order. The ADC node was out of
> place.
> 
> Signed-off-by: Joel Stanley 
> ---
>  arch/arm/boot/dts/aspeed-g4.dtsi | 16 
>  arch/arm/boot/dts/aspeed-g5.dtsi | 16 
>  2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi 
> b/arch/arm/boot/dts/aspeed-g4.dtsi
> index 1edd0cee6221..a4579498fc25 100644
> --- a/arch/arm/boot/dts/aspeed-g4.dtsi
> +++ b/arch/arm/boot/dts/aspeed-g4.dtsi
> @@ -129,6 +129,14 @@
>   };
>   };
>  
> + adc: adc@1e6e9000 {
> + compatible = "aspeed,ast2400-adc";
> + reg = <0x1e6e9000 0xb0>;
> + clocks = <&syscon ASPEED_CLK_APB>;

We can't do this yet as the clk driver isn't yet merged, and it breaks from the
"just move the node" description in the commit message.

> + #io-channel-cells = <1>;
> + status = "disabled";
> + };
> +
>   sram@1e72 {
>   compatible = "mmio-sram";
>   reg = <0x1e72 0x8000>;  // 32K
> @@ -227,14 +235,6 @@
>   no-loopback-test;
>   status = "disabled";
>   };
> -
> - adc: adc@1e6e9000 {
> - compatible = "aspeed,ast2400-adc";
> - reg = <0x1e6e9000 0xb0>;
> - clocks = <&clk_apb>;
> - #io-channel-cells = <1>;
> - status = "disabled";
> - };
>   };
>   };
>  };
> diff --git a/arch/arm/boot/dts/aspeed-g5.dtsi 
> b/arch/arm/boot/dts/aspeed-g5.dtsi
> index f56dd67efa50..f6430b313f90 100644
> --- a/arch/arm/boot/dts/aspeed-g5.dtsi
> +++ b/arch/arm/boot/dts/aspeed-g5.dtsi
> @@ -173,6 +173,14 @@
>   reg-io-width = <4>;
>   };
>  
> + adc: adc@1e6e9000 {
> + compatible = "aspeed,ast2500-adc";
> + reg = <0x1e6e9000 0xb0>;

Did you intend to change the size cell value here? It now matches the g4, but
there was an explicit comment about the size for some reason in the -hunk
below. It's probably worth an explicit call-out if we're going to change it. 

> + clocks = <&syscon ASPEED_CLK_APB>;

See the clk comment on the g4 diff.

Cheers,

Andrew

> + #io-channel-cells = <1>;
> + status = "disabled";
> + };
> +
>   sram@1e72 {
>   compatible = "mmio-sram";
>   reg = <0x1e72 0x9000>;  // 36K
> @@ -307,14 +315,6 @@
>   no-loopback-test;
>   status = "disabled";
>   };
> -
> - adc: adc@1e6e9000 {
> - compatible = "aspeed,ast2500-adc";
> - reg = <0x1e6e9000 0xb0>;
> - clocks = <&clk_apb>;
> - #io-channel-cells = <1>;
> - status = "disabled";
> - };
>   };
>   };
>  };

signature.asc
Description: This is a digitally signed message part


[PATCH] kconfig: Document the 'menu' struct

2017-10-03 Thread Ulf Magnusson
Understanding what it represents helps a lot when reading the code, and
it's not obvious, so document it.

The ROOT_MENU flag is only set and tested by the gconf and qconf front
ends, so leave it undocumented here. The obvious guess for what it means
is correct.

Signed-off-by: Ulf Magnusson 
---
 scripts/kconfig/expr.h | 45 +
 1 file changed, 45 insertions(+)

diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h
index a73f762..ae7583c 100644
--- a/scripts/kconfig/expr.h
+++ b/scripts/kconfig/expr.h
@@ -166,22 +166,67 @@ struct property {
for (st = sym->prop; st; st = st->next) \
if (st->text)
 
+/*
+ * Represents a node in the menu tree, as seen in e.g. menuconfig (though used
+ * for all front ends). Each symbol, menu, etc. defined in the Kconfig files
+ * gets a node. A symbol defined in multiple locations gets one node at each
+ * location.
+ */
 struct menu {
+   /* The next menu node at the same level */
struct menu *next;
+
+   /* The parent menu node, corresponding to e.g. a menu or choice */
struct menu *parent;
+
+   /* The first child menu node, for e.g. menus and choices */
struct menu *list;
+
+   /*
+* The symbol associated with the menu node. Choices are implemented as
+* a special kind of symbol. NULL for menus, comments, and ifs.
+*/
struct symbol *sym;
+
+   /*
+* The prompt associated with the node. This holds the prompt for a
+* symbol as well as the text for a menu or comment, along with the
+* type (P_PROMPT, P_MENU, etc.)
+*/
struct property *prompt;
+
+   /*
+* 'visible if' dependencies. If more than one is given, they will be
+* ANDed together.
+*/
struct expr *visibility;
+
+   /*
+* Ordinary dependencies from e.g. 'depends on' and 'if', ANDed
+* together
+*/
struct expr *dep;
+
+   /* MENU_* flags */
unsigned int flags;
+
+   /* Any help text associated with the node */
char *help;
+
+   /* The location where the menu node appears in the Kconfig files */
struct file *file;
int lineno;
+
+   /* For use by front ends that need to store auxiliary data */
void *data;
 };
 
+/*
+ * Set on a menu node when the corresponding symbol changes state in some way.
+ * Can be checked by front ends.
+ */
 #define MENU_CHANGED   0x0001
+
 #define MENU_ROOT  0x0002
 
 struct jump_key {
-- 
2.7.4



[PATCH v9 20/29] x86/insn-eval: Add wrapper function for 32 and 64-bit addresses

2017-10-03 Thread Ricardo Neri
The function insn_get_addr_ref() is capable of handling only 64-bit
addresses. A previous commit introduced a function to handle 32-bit
addresses. Invoke these two functions from a third wrapper function that
calls the appropriate routine based on the address size specified in the
instruction structure (obtained by looking at the code segment default
address size and the address override prefix, if present).

While doing this, rename the original function insn_get_addr_ref() with
the more appropriate name get_addr_ref_64(), ensure it is only used
for 64-bit addresses.

Also, since 64-bit addresses are not possible in 32-bit builds, provide
a dummy function such case.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 57 +++-
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 945c9b7..1d510a6 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -872,12 +872,28 @@ static void __user *get_addr_ref_32(struct insn *insn, 
struct pt_regs *regs)
return (void __user *)linear_addr;
 }
 
-/*
- * return the address being referenced be instruction
- * for rm=3 returning the content of the rm reg
- * for rm!=3 calculates the address using SIB and Disp
+/**
+ * get_addr_ref_64() - Obtain a 64-bit linear address
+ * @insn:  Instruction struct with ModRM and SIB bytes and displacement
+ * @regs:  Structure with register values as seen when entering kernel mode
+ *
+ * This function is to be used with 64-bit address encodings to obtain the
+ * linear memory address referred by the instruction's ModRM, SIB,
+ * displacement bytes and segment base address, as applicable.
+ *
+ * Returns:
+ *
+ * Linear address referenced by instruction and registers on success.
+ *
+ * -1L on error.
  */
-void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
+#ifndef CONFIG_X86_64
+static void __user *get_addr_ref_64(struct insn *insn, struct pt_regs *regs)
+{
+   return (void __user *)-1L;
+}
+#else
+static void __user *get_addr_ref_64(struct insn *insn, struct pt_regs *regs)
 {
int addr_offset, base_offset, indx_offset, seg_reg_indx;
unsigned long linear_addr = -1L, seg_base_addr;
@@ -888,6 +904,9 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
insn_get_sib(insn);
sib = insn->sib.value;
 
+   if (insn->addr_bytes != 8)
+   goto out;
+
if (X86_MODRM_MOD(insn->modrm.value) == 3) {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
@@ -971,3 +990,31 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
 out:
return (void __user *)linear_addr;
 }
+#endif /* CONFIG_X86_64 */
+
+/**
+ * insn_get_addr_ref() - Obtain the linear address referred by instruction
+ * @insn:  Instruction structure containing ModRM byte and displacement
+ * @regs:  Structure with register values as seen when entering kernel mode
+ *
+ * Obtain the linear address referred by the instruction's ModRM, SIB and
+ * displacement bytes, and segment base, as applicable. In protected mode,
+ * segment limits are enforced.
+ *
+ * Returns:
+ *
+ * Linear address referenced by instruction and registers on success.
+ *
+ * -1L on error.
+ */
+void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
+{
+   switch (insn->addr_bytes) {
+   case 4:
+   return get_addr_ref_32(insn, regs);
+   case 8:
+   return get_addr_ref_64(insn, regs);
+   default:
+   return (void __user *)-1L;
+   }
+}
-- 
2.7.4



[PATCH v9 03/29] ptrace,x86: Make user_64bit_mode() available to 32-bit builds

2017-10-03 Thread Ricardo Neri
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64
is selected. This implies that code built with CONFIG_X86_64=n cannot use
it. If a piece of code needs to be built for both CONFIG_X86_64=y and
CONFIG_X86_64=n and wants to use this function, it needs to wrap it in
an #ifdef/#endif; potentially, in multiple places.

This can be easily avoided with a single #ifdef/#endif pair within
user_64bit_mode() itself.

Suggested-by: Borislav Petkov 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/ptrace.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8..e2afbf6 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -135,9 +135,9 @@ static inline int v8086_mode(struct pt_regs *regs)
 #endif
 }
 
-#ifdef CONFIG_X86_64
 static inline bool user_64bit_mode(struct pt_regs *regs)
 {
+#ifdef CONFIG_X86_64
 #ifndef CONFIG_PARAVIRT
/*
 * On non-paravirt systems, this is the only long mode CPL 3
@@ -148,8 +148,12 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
/* Headers are too twisted for this to go in paravirt.h. */
return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs;
 #endif
+#else /* !CONFIG_X86_64 */
+   return false;
+#endif
 }
 
+#ifdef CONFIG_X86_64
 #define current_user_stack_pointer()   current_pt_regs()->sp
 #define compat_user_stack_pointer()current_pt_regs()->sp
 #endif
-- 
2.7.4



[PATCH v9 19/29] x86/insn-eval: Add support to resolve 32-bit address encodings

2017-10-03 Thread Ricardo Neri
32-bit and 64-bit address encodings are identical. Thus, the same logic
could be used to resolve the effective address. However, there are two key
differences: address size and enforcement of segment limits.

If running a 32-bit process on a 64-bit kernel, it is best to perform
the address calculation using 32-bit data types. In this manner hardware
is used for the arithmetic, including handling of signs and overflows.

32-bit addresses are generally used in protected mode; segment limits are
enforced in this mode. This implementation obtains the limit of the
segment associated with the instruction operands and prefixes. If the
computed address is outside the segment limits, an error is returned. It
is also possible to use 32-bit address in long mode and virtual-8086 mode
by using an address override prefix. In such cases, segment limits are not
enforced.

The new function get_addr_ref_32() is almost identical to the existing
function insn_get_addr_ref() (used for 64-bit addresses); except for the
differences mentioned above. For the sake of simplicity and readability,
it is better to use two separate functions.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 160 +++
 1 file changed, 160 insertions(+)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index b3aa891..945c9b7 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -712,6 +712,166 @@ int insn_get_modrm_rm_off(struct insn *insn, struct 
pt_regs *regs)
return get_reg_offset(insn, regs, REG_TYPE_RM);
 }
 
+/**
+ * get_addr_ref_32() - Obtain a 32-bit linear address
+ * @insn:  Instruction with ModRM, SIB bytes and displacement
+ * @regs:  Register values as seen when entering kernel mode
+ *
+ * This function is to be used with 32-bit address encodings to obtain the
+ * linear memory address referred by the instruction's ModRM, SIB,
+ * displacement bytes and segment base address, as applicable. If in protected
+ * mode, segment limits are enforced.
+ *
+ * Returns:
+ *
+ * Linear address referenced by instruction and registers on success.
+ *
+ * -1L on error.
+ */
+static void __user *get_addr_ref_32(struct insn *insn, struct pt_regs *regs)
+{
+   int eff_addr, base, indx, addr_offset, base_offset, indx_offset;
+   unsigned long linear_addr = -1L, seg_base_addr, seg_limit, tmp;
+   int seg_reg_indx;
+   insn_byte_t sib;
+
+   insn_get_modrm(insn);
+   insn_get_sib(insn);
+   sib = insn->sib.value;
+
+   if (insn->addr_bytes != 4)
+   goto out;
+
+   if (X86_MODRM_MOD(insn->modrm.value) == 3) {
+   addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
+   if (addr_offset < 0)
+   goto out;
+
+   tmp = regs_get_register(regs, addr_offset);
+   /* The 4 most significant bytes must be zero. */
+   if (tmp & ~0xL)
+   goto out;
+
+   eff_addr = (int)(tmp & 0x);
+
+   seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset);
+   if (seg_reg_indx < 0)
+   goto out;
+
+   seg_base_addr = insn_get_seg_base(regs, seg_reg_indx);
+   if (seg_base_addr == -1L)
+   goto out;
+
+   seg_limit = get_seg_limit(regs, seg_reg_indx);
+   } else {
+   if (insn->sib.nbytes) {
+   /*
+* Negative values in the base and index offset means
+* an error when decoding the SIB byte. Except -EDOM,
+* which means that the registers should not be used
+* in the address computation.
+*/
+   base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE);
+   if (base_offset == -EDOM) {
+   base = 0;
+   } else if (base_offset < 0) {
+   goto out;
+   } else {
+   tmp = regs_get_register(regs, base_offset);
+   /* The 4 most significant bytes must be zero. */
+   if (tmp & ~0xL)
+   goto out;
+
+   base = (int)(tmp & 0x);
+   }
+
+   indx_offset = get_reg_offset(insn, regs, 
REG_TYPE_INDEX);
+   if (indx_offset == -EDOM) {
+   indx = 0;
+   } else if (indx_offset 

[PATCH v9 28/29] selftests/x86: Add tests for User-Mode Instruction Prevention

2017-10-03 Thread Ricardo Neri
Certain user space programs that run on virtual-8086 mode may utilize
instructions protected by the User-Mode Instruction Prevention (UMIP)
security feature present in new Intel processors: SGDT, SIDT and SMSW. In
such a case, a general protection fault is issued if UMIP is enabled. When
such a fault happens, the kernel traps it and emulates the results of
these instructions with dummy values. The purpose of this new
test is to verify whether the impacted instructions can be executed
without causing such #GP. If no #GP exceptions occur, we expect to exit
virtual-8086 mode from INT3.

The instructions protected by UMIP are executed in representative use
cases:
 a) displacement-only memory addressing
 b) register-indirect memory addressing
 c) results stored directly in operands

Unfortunately, it is not possible to check the results against a set of
expected values because no emulation will occur in systems that do not
have the UMIP feature. Instead, results are printed for verification. A
simple verification is done to ensure that results of all tests are
identical.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Signed-off-by: Ricardo Neri 
---
 tools/testing/selftests/x86/entry_from_vm86.c | 73 ++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/x86/entry_from_vm86.c 
b/tools/testing/selftests/x86/entry_from_vm86.c
index d075ea0..f7d9cea 100644
--- a/tools/testing/selftests/x86/entry_from_vm86.c
+++ b/tools/testing/selftests/x86/entry_from_vm86.c
@@ -95,6 +95,22 @@ asm (
"int3\n\t"
"vmcode_int80:\n\t"
"int $0x80\n\t"
+   "vmcode_umip:\n\t"
+   /* addressing via displacements */
+   "smsw (2052)\n\t"
+   "sidt (2054)\n\t"
+   "sgdt (2060)\n\t"
+   /* addressing via registers */
+   "mov $2066, %bx\n\t"
+   "smsw (%bx)\n\t"
+   "mov $2068, %bx\n\t"
+   "sidt (%bx)\n\t"
+   "mov $2074, %bx\n\t"
+   "sgdt (%bx)\n\t"
+   /* register operands, only for smsw */
+   "smsw %ax\n\t"
+   "mov %ax, (2080)\n\t"
+   "int3\n\t"
".size vmcode, . - vmcode\n\t"
"end_vmcode:\n\t"
".code32\n\t"
@@ -103,7 +119,7 @@ asm (
 
 extern unsigned char vmcode[], end_vmcode[];
 extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[],
-   vmcode_sti[], vmcode_int3[], vmcode_int80[];
+   vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[];
 
 /* Returns false if the test was skipped. */
 static bool do_test(struct vm86plus_struct *v86, unsigned long eip,
@@ -160,6 +176,58 @@ static bool do_test(struct vm86plus_struct *v86, unsigned 
long eip,
return true;
 }
 
+void do_umip_tests(struct vm86plus_struct *vm86, unsigned char *test_mem)
+{
+   struct table_desc {
+   unsigned short limit;
+   unsigned long base;
+   } __attribute__((packed));
+
+   /* Initialize variables with arbitrary values */
+   struct table_desc gdt1 = { .base = 0x3c3c3c3c, .limit = 0x };
+   struct table_desc gdt2 = { .base = 0x1a1a1a1a, .limit = 0xaeae };
+   struct table_desc idt1 = { .base = 0x7b7b7b7b, .limit = 0xf1f1 };
+   struct table_desc idt2 = { .base = 0x89898989, .limit = 0x1313 };
+   unsigned short msw1 = 0x1414, msw2 = 0x2525, msw3 = 3737;
+
+   /* UMIP -- exit with INT3 unless kernel emulation did not trap #GP */
+   do_test(vm86, vmcode_umip - vmcode, VM86_TRAP, 3, "UMIP tests");
+
+   /* Results from displacement-only addressing */
+   msw1 = *(unsigned short *)(test_mem + 2052);
+   memcpy(&idt1, test_mem + 2054, sizeof(idt1));
+   memcpy(&gdt1, test_mem + 2060, sizeof(gdt1));
+
+   /* Results from register-indirect addressing */
+   msw2 = *(unsigned short *)(test_mem + 2066);
+   memcpy(&idt2, test_mem + 2068, sizeof(idt2));
+   memcpy(&gdt2, test_mem + 2074, sizeof(gdt2));
+
+   /* Results when using register operands */
+   msw3 = *(unsigned short *)(test_mem + 2080);
+
+   printf("[INFO]\tResult from SMSW:[0x%04x]\n", msw1);
+   printf("[INFO]\tResult from SIDT: limit[0x%04x]base[0x%08lx]\n",
+  idt1.limit, idt1.base);
+   printf("[INFO]\tResult from SGDT: limit[0x%04x]base[0x%08lx]\n",
+  gdt1.limit, gdt1.base);
+
+   if (msw1 != msw2 || msw1 != msw3)
+   printf("[FAIL]\tAll the results of SMSW should be the same.\n");
+   else
+   printf("[PASS]\tAll the results from SMSW are identical.\n");
+
+   if (memcmp(&gdt1, &gdt2, sizeof(gdt1)))
+   printf("[FAIL]\tAll the results of SGDT should be the same.\n");
+   else
+   printf("[PASS]\tAll the results from SG

[PATCH v9 29/29] selftests/x86: Add tests for instruction str and sldt

2017-10-03 Thread Ricardo Neri
The instructions str and sldt are not recognized when running on virtual-
8086 mode and generate an invalid operand exception. These two
instructions are protected by the Intel User-Mode Instruction Prevention
(UMIP) security feature. In protected mode, if UMIP is enabled, these
instructions generate a general protection fault if called from CPL > 0.
Linux traps the general protection fault and emulates the instructions
sgdt, sidt and smsw; but not str and sldt.

These tests are added to verify that the emulation code does not emulate
these two instructions but the expected invalid operand exception is
seen.

Tests fallback to exit with int3 in case emulation does happen.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Signed-off-by: Ricardo Neri 
---
 tools/testing/selftests/x86/entry_from_vm86.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/x86/entry_from_vm86.c 
b/tools/testing/selftests/x86/entry_from_vm86.c
index f7d9cea..361466a 100644
--- a/tools/testing/selftests/x86/entry_from_vm86.c
+++ b/tools/testing/selftests/x86/entry_from_vm86.c
@@ -111,6 +111,11 @@ asm (
"smsw %ax\n\t"
"mov %ax, (2080)\n\t"
"int3\n\t"
+   "vmcode_umip_str:\n\t"
+   "str %eax\n\t"
+   "vmcode_umip_sldt:\n\t"
+   "sldt %eax\n\t"
+   "int3\n\t"
".size vmcode, . - vmcode\n\t"
"end_vmcode:\n\t"
".code32\n\t"
@@ -119,7 +124,8 @@ asm (
 
 extern unsigned char vmcode[], end_vmcode[];
 extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[],
-   vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[];
+   vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[],
+   vmcode_umip_str[], vmcode_umip_sldt[];
 
 /* Returns false if the test was skipped. */
 static bool do_test(struct vm86plus_struct *v86, unsigned long eip,
@@ -226,6 +232,16 @@ void do_umip_tests(struct vm86plus_struct *vm86, unsigned 
char *test_mem)
printf("[FAIL]\tAll the results of SIDT should be the same.\n");
else
printf("[PASS]\tAll the results from SIDT are identical.\n");
+
+   sethandler(SIGILL, sighandler, 0);
+   do_test(vm86, vmcode_umip_str - vmcode, VM86_SIGNAL, 0,
+   "STR instruction");
+   clearhandler(SIGILL);
+
+   sethandler(SIGILL, sighandler, 0);
+   do_test(vm86, vmcode_umip_sldt - vmcode, VM86_SIGNAL, 0,
+   "SLDT instruction");
+   clearhandler(SIGILL);
 }
 
 int main(void)
-- 
2.7.4



[PATCH v9 26/29] x86: Enable User-Mode Instruction Prevention

2017-10-03 Thread Ricardo Neri
User-Mode Instruction Prevention (UMIP) is enabled by setting/clearing a
bit in %cr4.

It makes sense to enable UMIP at some point while booting, before user
spaces come up. Like SMAP and SMEP, is not critical to have it enabled
very early during boot. This is because UMIP is relevant only when there is
a userspace to be protected from. Given the similarities in relevance, it
makes sense to enable UMIP along with SMAP and SMEP.

UMIP is enabled by default. It can be disabled by adding clearcpuid=514
to the kernel parameters.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: Tony Luck 
Cc: Paolo Bonzini 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/Kconfig | 10 ++
 arch/x86/kernel/cpu/common.c | 25 -
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5442735..b7c06d8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1803,6 +1803,16 @@ config X86_SMAP
 
  If unsure, say Y.
 
+config X86_INTEL_UMIP
+   def_bool n
+   depends on CPU_SUP_INTEL
+   prompt "Intel User Mode Instruction Prevention" if EXPERT
+   ---help---
+ The User Mode Instruction Prevention (UMIP) is a security
+ feature in newer Intel processors. If enabled, a general
+ protection fault is issued if the instructions SGDT, SLDT,
+ SIDT, SMSW and STR are executed in user mode.
+
 config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 03f9a1a..45b5ec4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -329,6 +329,28 @@ static __always_inline void setup_smap(struct cpuinfo_x86 
*c)
}
 }
 
+static __always_inline void setup_umip(struct cpuinfo_x86 *c)
+{
+   /* Check the boot processor, plus build option for UMIP. */
+   if (!cpu_feature_enabled(X86_FEATURE_UMIP))
+   goto out;
+
+   /* Check the current processor's cpuid bits. */
+   if (!cpu_has(c, X86_FEATURE_UMIP))
+   goto out;
+
+   cr4_set_bits(X86_CR4_UMIP);
+
+   return;
+
+out:
+   /*
+* Make sure UMIP is disabled in case it was enabled in a
+* previous boot (e.g., via kexec).
+*/
+   cr4_clear_bits(X86_CR4_UMIP);
+}
+
 /*
  * Protection Keys are not available in 32-bit mode.
  */
@@ -1147,9 +1169,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
/* Disable the PN if appropriate */
squash_the_stupid_serial_number(c);
 
-   /* Set up SMEP/SMAP */
+   /* Set up SMEP/SMAP/UMIP */
setup_smep(c);
setup_smap(c);
+   setup_umip(c);
 
/*
 * The vendor-specific functions might have changed features.
-- 
2.7.4



[PATCH v9 27/29] x86/traps: Fixup general protection faults caused by UMIP

2017-10-03 Thread Ricardo Neri
If the User-Mode Instruction Prevention CPU feature is available and
enabled, a general protection fault will be issued if the instructions
sgdt, sldt, sidt, str or smsw are executed from user-mode context
(CPL > 0). If the fault was caused by any of the instructions protected
by UMIP, fixup_umip_exception() will emulate dummy results for these
instructions as follows: if running a 32-bit process, sgdt, sidt and smsw
are emulated; str and sldt are not emulated. No emulation is done for
64-bit processes.

If emulation is successful, the result is passed to the user space program
and no SIGSEGV signal is emitted.

Please note that fixup_umip_exception() also caters for the case when
the fault originated while running in virtual-8086 mode.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: Tony Luck 
Cc: Paolo Bonzini 
Cc: x...@kernel.org
Reviewed-by: Andy Lutomirski 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/traps.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a5791f3..4c0aa6c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_64
 #include 
@@ -514,6 +515,10 @@ do_general_protection(struct pt_regs *regs, long 
error_code)
RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
cond_local_irq_enable(regs);
 
+   if (static_cpu_has(X86_FEATURE_UMIP))
+   if (user_mode(regs) && fixup_umip_exception(regs))
+   return;
+
if (v8086_mode(regs)) {
local_irq_enable();
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
-- 
2.7.4



[PATCH v9 05/29] x86/mpx: Simplify handling of errors when computing linear addresses

2017-10-03 Thread Ricardo Neri
When errors occur in the computation of the linear address, -1L is
returned. Rather than having a separate return path for errors, the
variable used to return the computed linear address can be initialized
with the error value. Hence, only one return path is needed. This makes
the function easier to read.

While here, ensure that the error value is -1L, a 64-bit value, rather
than -1, a 32-bit value.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Peter Zijlstra 
Cc: Nathan Howard 
Cc: Adan Hawthorn 
Cc: Joe Perches 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Suggested-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/mm/mpx.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa95..f4c48a0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -138,7 +138,7 @@ static int get_reg_offset(struct insn *insn, struct pt_regs 
*regs,
  */
 static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs)
 {
-   unsigned long addr, base, indx;
+   unsigned long addr = -1L, base, indx;
int addr_offset, base_offset, indx_offset;
insn_byte_t sib;
 
@@ -149,17 +149,17 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
if (X86_MODRM_MOD(insn->modrm.value) == 3) {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
-   goto out_err;
+   goto out;
addr = regs_get_register(regs, addr_offset);
} else {
if (insn->sib.nbytes) {
base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE);
if (base_offset < 0)
-   goto out_err;
+   goto out;
 
indx_offset = get_reg_offset(insn, regs, 
REG_TYPE_INDEX);
if (indx_offset < 0)
-   goto out_err;
+   goto out;
 
base = regs_get_register(regs, base_offset);
indx = regs_get_register(regs, indx_offset);
@@ -167,14 +167,13 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
} else {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
-   goto out_err;
+   goto out;
addr = regs_get_register(regs, addr_offset);
}
addr += insn->displacement.value;
}
+out:
return (void __user *)addr;
-out_err:
-   return (void __user *)-1;
 }
 
 static int mpx_insn_decode(struct insn *insn,
-- 
2.7.4



[PATCH 1/4] kbuild: replace $(hdr-arch) with $(SRCARCH)

2017-10-03 Thread Masahiro Yamada
Since commit 5e53879008b9 ("sparc,sparc64: unify Makefile"), hdr-arch
and SRCARCH always match.

Signed-off-by: Masahiro Yamada 
---

 Makefile | 21 +
 scripts/Makefile.headersinst |  2 +-
 2 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/Makefile b/Makefile
index cf007a3..8d900ba 100644
--- a/Makefile
+++ b/Makefile
@@ -283,9 +283,6 @@ ifeq ($(ARCH),tilegx)
SRCARCH := tile
 endif
 
-# Where to locate arch specific headers
-hdr-arch  := $(SRCARCH)
-
 KCONFIG_CONFIG ?= .config
 export KCONFIG_CONFIG
 
@@ -378,8 +375,8 @@ CFLAGS_KCOV := $(call 
cc-option,-fsanitize-coverage=trace-pc,)
 
 # Use USERINCLUDE when you must reference the UAPI directories only.
 USERINCLUDE:= \
-   -I$(srctree)/arch/$(hdr-arch)/include/uapi \
-   -I$(objtree)/arch/$(hdr-arch)/include/generated/uapi \
+   -I$(srctree)/arch/$(SRCARCH)/include/uapi \
+   -I$(objtree)/arch/$(SRCARCH)/include/generated/uapi \
-I$(srctree)/include/uapi \
-I$(objtree)/include/generated/uapi \
 -include $(srctree)/include/linux/kconfig.h
@@ -387,8 +384,8 @@ USERINCLUDE:= \
 # Use LINUXINCLUDE when you must reference the include/ directory.
 # Needed to be compatible with the O= option
 LINUXINCLUDE:= \
-   -I$(srctree)/arch/$(hdr-arch)/include \
-   -I$(objtree)/arch/$(hdr-arch)/include/generated \
+   -I$(srctree)/arch/$(SRCARCH)/include \
+   -I$(objtree)/arch/$(SRCARCH)/include/generated \
$(if $(KBUILD_SRC), -I$(srctree)/include) \
-I$(objtree)/include \
$(USERINCLUDE)
@@ -1134,8 +1131,8 @@ headerdep:
 #Default location for installed headers
 export INSTALL_HDR_PATH = $(objtree)/usr
 
-# If we do an all arch process set dst to include/arch-$(hdr-arch)
-hdr-dst = $(if $(KBUILD_HEADERS), dst=include/arch-$(hdr-arch), dst=include)
+# If we do an all arch process set dst to include/arch-$(SRCARCH)
+hdr-dst = $(if $(KBUILD_HEADERS), dst=include/arch-$(SRCARCH), dst=include)
 
 PHONY += archheaders
 archheaders:
@@ -1153,10 +1150,10 @@ headers_install_all:
 
 PHONY += headers_install
 headers_install: __headers
-   $(if $(wildcard $(srctree)/arch/$(hdr-arch)/include/uapi/asm/Kbuild),, \
+   $(if $(wildcard $(srctree)/arch/$(SRCARCH)/include/uapi/asm/Kbuild),, \
  $(error Headers not exportable for the $(SRCARCH) architecture))
$(Q)$(MAKE) $(hdr-inst)=include/uapi dst=include
-   $(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi $(hdr-dst)
+   $(Q)$(MAKE) $(hdr-inst)=arch/$(SRCARCH)/include/uapi $(hdr-dst)
 
 PHONY += headers_check_all
 headers_check_all: headers_install_all
@@ -1165,7 +1162,7 @@ headers_check_all: headers_install_all
 PHONY += headers_check
 headers_check: headers_install
$(Q)$(MAKE) $(hdr-inst)=include/uapi dst=include HDRCHECK=1
-   $(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi $(hdr-dst) 
HDRCHECK=1
+   $(Q)$(MAKE) $(hdr-inst)=arch/$(SRCARCH)/include/uapi $(hdr-dst) 
HDRCHECK=1
 
 # ---
 # Kernel selftest
diff --git a/scripts/Makefile.headersinst b/scripts/Makefile.headersinst
index 343d586..5692d7a 100644
--- a/scripts/Makefile.headersinst
+++ b/scripts/Makefile.headersinst
@@ -30,7 +30,7 @@ __headers: $(subdirs)
 $(subdirs):
$(Q)$(MAKE) $(hdr-inst)=$(obj)/$@ dst=$(dst)/$@
 
-# Skip header install/check for include/uapi and arch/$(hdr-arch)/include/uapi.
+# Skip header install/check for include/uapi and arch/$(SRCARCH)/include/uapi.
 # We have only sub-directories there.
 skip-inst := $(if $(filter %/uapi,$(obj)),1)
 
-- 
2.7.4



[PATCH v9 23/29] x86/cpufeature: Add User-Mode Instruction Prevention definitions

2017-10-03 Thread Ricardo Neri
User-Mode Instruction Prevention is a security feature present in new
Intel processors that, when set, prevents the execution of a subset of
instructions if such instructions are executed in user mode (CPL > 0).
Attempting to execute such instructions causes a general protection
exception.

The subset of instructions comprises:

 * SGDT - Store Global Descriptor Table
 * SIDT - Store Interrupt Descriptor Table
 * SLDT - Store Local Descriptor Table
 * SMSW - Store Machine Status Word
 * STR  - Store Task Register

This feature is also added to the list of disabled-features to allow
a cleaner handling of build-time configuration.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: Tony Luck 
Cc: Paolo Bonzini 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/cpufeatures.h  | 1 +
 arch/x86/include/asm/disabled-features.h| 8 +++-
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 2519c6c..d1f18f2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -292,6 +292,7 @@
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ecx), word 16 */
 #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit Manipulation 
instructions*/
+#define X86_FEATURE_UMIP   (16*32+ 2) /* User Mode Instruction Protection 
*/
 #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for 
Userspace */
 #define X86_FEATURE_OSPKE  (16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW 
*/
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index c10c912..14d6d50 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -16,6 +16,12 @@
 # define DISABLE_MPX   (1<<(X86_FEATURE_MPX & 31))
 #endif
 
+#ifdef CONFIG_X86_INTEL_UMIP
+# define DISABLE_UMIP  0
+#else
+# define DISABLE_UMIP  (1<<(X86_FEATURE_UMIP & 31))
+#endif
+
 #ifdef CONFIG_X86_64
 # define DISABLE_VME   (1<<(X86_FEATURE_VME & 31))
 # define DISABLE_K6_MTRR   (1<<(X86_FEATURE_K6_MTRR & 31))
@@ -63,7 +69,7 @@
 #define DISABLED_MASK130
 #define DISABLED_MASK140
 #define DISABLED_MASK150
-#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57)
+#define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
 #define DISABLED_MASK170
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index 39946d0..cf4c876 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_UMIP_BIT   11 /* enable UMIP support */
+#define X86_CR4_UMIP   _BITUL(X86_CR4_UMIP_BIT)
 #define X86_CR4_LA57_BIT   12 /* enable 5-level page tables */
 #define X86_CR4_LA57   _BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT   13 /* enable VMX virtualization */
-- 
2.7.4



[RFC PATCH 4/4] kbuild: evaluate cc-option and friends only when building kernel

2017-10-03 Thread Masahiro Yamada
$(call cc-option,...) is costly because it invokes the C compiler.

Given that only some targets need compiler flags, it is pointless
to compute them all the time.

The variable no-dot-config-targets lists the targets we can run
without the .config file, such as "make clean", "make help", etc.

My idea is similar here.  We can add no-compiler-targets to list the
targets we can run without the target compiler information.  This
includes no-dot-config-targets + config targets + misc.

When we run only targets listed in the no-compiler-targets, we can
set cc-option and friends to no-op.  The hostcc-option is an exception
since the host compiler is needed for building fixdep, kconfig, etc.

I did not add "dtbs" and "%.dtb" to no-compiler-targets.  This is
intentional.  Theoretically, we can build device tree blobs without
the C compiler.  It it true that in-kernel DT files are pre-processed
by CPP before DTC, but KBUILD_CPPFLAGS is unused.  However, for the
reason of scripts/dtc/ location, Kbuild descends into scripts/mod/
before the DT build.  I do not want to trigger the unrelated re-build
of modpost when building DT.  Perhaps, we can fix this with further
refactoring, but not now.

Signed-off-by: Masahiro Yamada 
---

 Makefile   | 10 ++
 scripts/Kbuild.include | 14 +-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index a4fd682..80e1a38 100644
--- a/Makefile
+++ b/Makefile
@@ -224,9 +224,13 @@ no-dot-config-targets := clean mrproper distclean \
 $(version_h) headers_% archheaders archscripts \
 kernelversion %src-pkg
 
+no-compiler-targets := $(no-dot-config-targets) config %config \
+  kernelrelease image_name
+
 config-targets := 0
 mixed-targets  := 0
 dot-config := 1
+need-compiler  := 1
 
 ifneq ($(filter $(no-dot-config-targets), $(MAKECMDGOALS)),)
ifeq ($(filter-out $(no-dot-config-targets), $(MAKECMDGOALS)),)
@@ -234,6 +238,12 @@ ifneq ($(filter $(no-dot-config-targets), 
$(MAKECMDGOALS)),)
endif
 endif
 
+ifneq ($(filter $(no-compiler-targets), $(MAKECMDGOALS)),)
+   ifeq ($(filter-out $(no-compiler-targets), $(MAKECMDGOALS)),)
+   need-compiler := 0
+   endif
+endif
+
 ifeq ($(KBUILD_EXTMOD),)
 ifneq ($(filter config %config,$(MAKECMDGOALS)),)
 config-targets := 1
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 9ffd3dd..222d0a2 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -96,6 +96,13 @@ try-run = $(shell set -e;\
fi; \
rm -f "$$TMP" "$$TMPO")
 
+# hostcc-option
+# Usage: cflags-y += $(call hostcc-option,-march=winchip-c6,-march=i586)
+hostcc-option = $(call __cc-option, $(HOSTCC),\
+   $(HOSTCFLAGS) $(HOST_EXTRACFLAGS),$(1),$(2))
+
+ifeq ($(need-compiler),1)
+
 # as-option
 # Usage: cflags-y += $(call as-option,-Wa$(comma)-isa=foo,)
 
@@ -123,11 +130,6 @@ CC_OPTION_CFLAGS = $(filter-out 
$(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS))
 cc-option = $(call __cc-option, $(CC),\
$(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS),$(1),$(2))
 
-# hostcc-option
-# Usage: cflags-y += $(call hostcc-option,-march=winchip-c6,-march=i586)
-hostcc-option = $(call __cc-option, $(HOSTCC),\
-   $(HOSTCFLAGS) $(HOST_EXTRACFLAGS),$(1),$(2))
-
 # cc-option-yn
 # Usage: flag := $(call cc-option-yn,-march=winchip-c6)
 cc-option-yn = $(call try-run,\
@@ -180,6 +182,8 @@ ld-version = $(shell $(LD) --version | 
$(srctree)/scripts/ld-version.sh)
 # Usage:  $(call ld-ifversion, -ge, 22252, y)
 ld-ifversion = $(shell [ $(ld-version) $(1) $(2) ] && echo $(3) || echo $(4))
 
+endif
+
 ##
 
 ###
-- 
2.7.4



[PATCH v9 24/29] x86: Add emulation code for UMIP instructions

2017-10-03 Thread Ricardo Neri
The feature User-Mode Instruction Prevention present in recent Intel
processor prevents a group of instructions (sgdt, sidt, sldt, smsw, and
str) from being executed with CPL > 0. Otherwise, a general protection
fault is issued.

Rather than relaying to the user space the general protection fault caused
by the UMIP-protected instructions (in the form of a SIGSEGV signal), it
can be trapped and emulate the result of such instructions to provide dummy
values. This allows to both conserve the current kernel behavior and not
reveal the system resources that UMIP intends to protect (i.e., the
locations of the global descriptor and interrupt descriptor tables, the
segment selectors of the local descriptor table, the value of the task
state register and the contents of the CR0 register).

This emulation is needed because certain applications (e.g., WineHQ and
DOSEMU2) rely on this subset of instructions to function. Given that sldt
and str are not commonly used in programs that run on WineHQ or DOSEMU2,
they are not emulated. Also, emulation is provided only for 32-bit
processes; 64-bit processes that attempt to use the instructions that UMIP
protects will receive the SIGSEGV signal issued as a consequence of the
general protection fault.

The instructions protected by UMIP can be split in two groups. Those which
return a kernel memory address (sgdt and sidt) and those which return a
value (sldt, str and smsw).

For the instructions that return a kernel memory address, applications such
as WineHQ rely on the result being located in the kernel memory space, not
the actual location of the table. The result is emulated as a hard-coded
value that lies close to the top of the kernel memory. The limit for the
GDT and the IDT are set to zero.

The instruction smsw is emulated to return the value that the register CR0
has at boot time as set in the head_32.

Care is taken to appropriately emulate the results when segmentation is
used. That is, rather than relying on USER_DS and USER_CS, the function
insn_get_addr_ref() inspects the segment descriptor pointed by the
registers in pt_regs. This ensures that we correctly obtain the segment
base address and the address and operand sizes even if the user space
application uses a local descriptor table.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: Tony Luck 
Cc: Paolo Bonzini 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/umip.h |  12 ++
 arch/x86/kernel/Makefile|   1 +
 arch/x86/kernel/umip.c  | 309 
 3 files changed, 322 insertions(+)
 create mode 100644 arch/x86/include/asm/umip.h
 create mode 100644 arch/x86/kernel/umip.c

diff --git a/arch/x86/include/asm/umip.h b/arch/x86/include/asm/umip.h
new file mode 100644
index 000..db43f2a
--- /dev/null
+++ b/arch/x86/include/asm/umip.h
@@ -0,0 +1,12 @@
+#ifndef _ASM_X86_UMIP_H
+#define _ASM_X86_UMIP_H
+
+#include 
+#include 
+
+#ifdef CONFIG_X86_INTEL_UMIP
+bool fixup_umip_exception(struct pt_regs *regs);
+#else
+static inline bool fixup_umip_exception(struct pt_regs *regs) { return false; }
+#endif  /* CONFIG_X86_INTEL_UMIP */
+#endif  /* _ASM_X86_UMIP_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index d8e2b70..bafeee1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -125,6 +125,7 @@ obj-$(CONFIG_EFI)   += sysfb_efi.o
 obj-$(CONFIG_PERF_EVENTS)  += perf_regs.o
 obj-$(CONFIG_TRACING)  += tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)+= itmt.o
+obj-$(CONFIG_X86_INTEL_UMIP)   += umip.o
 
 obj-$(CONFIG_ORC_UNWINDER) += unwind_orc.o
 obj-$(CONFIG_FRAME_POINTER_UNWINDER)   += unwind_frame.o
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
new file mode 100644
index 000..1f338cb
--- /dev/null
+++ b/arch/x86/kernel/umip.c
@@ -0,0 +1,309 @@
+/*
+ * umip.c Emulation for instruction protected by the Intel User-Mode
+ * Instruction Prevention feature
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ * Ricardo Neri 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/** DOC: Emulation for User-Mode Instruction Prevention (UMIP)
+ *
+ * The feature User-Mode Instruction Prevention present in recent Intel
+ * processor prevents a group of instructions (sgdt, sidt, sldt, smsw, and str)
+ * from being executed with CPL > 0. Otherwise, a general protection fault is
+ * issued.
+ *
+ * Rather than relaying to the user space the general protection fault caused 
by
+ * the UMIP-protected instructions (in the form of a SIGSEGV signal), it can be
+ * trapped and emulate the result of such instructions to provide dummy

[PATCH 2/4] kbuild: move "_all" target out of $(KBUILD_SRC) conditional

2017-10-03 Thread Masahiro Yamada
The first "_all" occurrence around line 120 is only visible when
KBUILD_SRC is unset.

If O=... is specified, the working directory is relocated, then the
only second occurrence around line 193 is visible, that is not set
to PHONY.

Move the first one to an always visible place.  This clarifies "_all"
is our default target and it is always set to PHONY.

Signed-off-by: Masahiro Yamada 
---

 Makefile | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Makefile b/Makefile
index 8d900ba..39a7c03 100644
--- a/Makefile
+++ b/Makefile
@@ -10,6 +10,10 @@ NAME = Fearless Coyote
 # Comments in this file are targeted only to the developer, do not
 # expect to learn how to build the kernel reading this file.
 
+# That's our default target when none is given on the command line
+PHONY := _all
+_all:
+
 # o Do not use make's built-in rules and variables
 #   (this increases performance and avoids hard-to-debug behaviour);
 # o Look for make include files relative to root of kernel src
@@ -116,10 +120,6 @@ ifeq ("$(origin O)", "command line")
   KBUILD_OUTPUT := $(O)
 endif
 
-# That's our default target when none is given on the command line
-PHONY := _all
-_all:
-
 # Cancel implicit rules on top Makefile
 $(CURDIR)/Makefile Makefile: ;
 
-- 
2.7.4



[PATCH 3/4] kbuild: re-order the code to not parse unnecessary variables

2017-10-03 Thread Masahiro Yamada
The top Makefile is divided into some sections such as mixed targets,
config targets, build targets, etc.

When we build mixed targets, Kbuild just invokes submake to process
them one by one.  In this case, compiler-related variables like CC,
KBUILD_CFLAGS, etc. are unneeded.

Check what kind of targets we are building first, then parse necessary
variables for building them.

Signed-off-by: Masahiro Yamada 
---

 Makefile | 233 ---
 1 file changed, 118 insertions(+), 115 deletions(-)

diff --git a/Makefile b/Makefile
index 39a7c03..a4fd682 100644
--- a/Makefile
+++ b/Makefile
@@ -186,15 +186,6 @@ ifeq ("$(origin M)", "command line")
   KBUILD_EXTMOD := $(M)
 endif
 
-# If building an external module we do not care about the all: rule
-# but instead _all depend on modules
-PHONY += all
-ifeq ($(KBUILD_EXTMOD),)
-_all: all
-else
-_all: modules
-endif
-
 ifeq ($(KBUILD_SRC),)
 # building in the source tree
 srctree := .
@@ -206,6 +197,9 @@ else
 srctree := $(KBUILD_SRC)
 endif
 endif
+
+export KBUILD_CHECKSRC KBUILD_EXTMOD KBUILD_SRC
+
 objtree:= .
 src:= $(srctree)
 obj:= $(objtree)
@@ -214,6 +208,74 @@ VPATH  := $(srctree)$(if 
$(KBUILD_EXTMOD),:$(KBUILD_EXTMOD))
 
 export srctree objtree VPATH
 
+# To make sure we do not include .config for any of the *config targets
+# catch them early, and hand them over to scripts/kconfig/Makefile
+# It is allowed to specify more targets when calling make, including
+# mixing *config targets and build targets.
+# For example 'make oldconfig all'.
+# Detect when mixed targets is specified, and make a second invocation
+# of make so .config is not included in this case either (for *config).
+
+version_h := include/generated/uapi/linux/version.h
+old_version_h := include/linux/version.h
+
+no-dot-config-targets := clean mrproper distclean \
+cscope gtags TAGS tags help% %docs check% coccicheck \
+$(version_h) headers_% archheaders archscripts \
+kernelversion %src-pkg
+
+config-targets := 0
+mixed-targets  := 0
+dot-config := 1
+
+ifneq ($(filter $(no-dot-config-targets), $(MAKECMDGOALS)),)
+   ifeq ($(filter-out $(no-dot-config-targets), $(MAKECMDGOALS)),)
+   dot-config := 0
+   endif
+endif
+
+ifeq ($(KBUILD_EXTMOD),)
+ifneq ($(filter config %config,$(MAKECMDGOALS)),)
+config-targets := 1
+ifneq ($(words $(MAKECMDGOALS)),1)
+mixed-targets := 1
+endif
+endif
+endif
+# install and modules_install need also be processed one by one
+ifneq ($(filter install,$(MAKECMDGOALS)),)
+ifneq ($(filter modules_install,$(MAKECMDGOALS)),)
+   mixed-targets := 1
+endif
+endif
+
+ifeq ($(mixed-targets),1)
+# ===
+# We're called with mixed targets (*config and build targets).
+# Handle them one by one.
+
+PHONY += $(MAKECMDGOALS) __build_one_by_one
+
+$(filter-out __build_one_by_one, $(MAKECMDGOALS)): __build_one_by_one
+   @:
+
+__build_one_by_one:
+   $(Q)set -e; \
+   for i in $(MAKECMDGOALS); do \
+   $(MAKE) -f $(srctree)/Makefile $$i; \
+   done
+
+else
+
+# We need some generic definitions (do not try to remake the file).
+scripts/Kbuild.include: ;
+include scripts/Kbuild.include
+
+# Read KERNELRELEASE from include/config/kernel.release (if it exists)
+KERNELRELEASE = $(shell cat include/config/kernel.release 2> /dev/null)
+KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if 
$(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
+export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION
+
 # SUBARCH tells the usermode build what the underlying arch is.  That is set
 # first, and if a usermode build is happening, the "ARCH=um" on the command
 # line overrides the setting of ARCH below.  If a native build is happening,
@@ -308,40 +370,6 @@ HOSTCFLAGS  += -Wno-unused-value -Wno-unused-parameter \
-Wno-missing-field-initializers -fno-delete-null-pointer-checks
 endif
 
-# Decide whether to build built-in, modular, or both.
-# Normally, just do built-in.
-
-KBUILD_MODULES :=
-KBUILD_BUILTIN := 1
-
-# If we have only "make modules", don't compile built-in objects.
-# When we're building modules with modversions, we need to consider
-# the built-in objects during the descend as well, in order to
-# make sure the checksums are up to date before we record them.
-
-ifeq ($(MAKECMDGOALS),modules)
-  KBUILD_BUILTIN := $(if $(CONFIG_MODVERSIONS),1)
-endif
-
-# If we have "make  modules", compile modules
-# in addition to whatever we do anyway.
-# Just "make" or "make all" shall build modules as well
-
-ifneq ($(filter all _all modules,$(MAKECMDGOALS)),)
-  KBUILD_MODULES := 1
-endif
-
-ifeq ($(MAKECMDGOALS),)
-  KBUILD

[PATCH v9 25/29] x86/umip: Force a page fault when unable to copy emulated result to user

2017-10-03 Thread Ricardo Neri
fixup_umip_exception() will be called from do_general_protection(). If the
former returns false, the latter will issue a SIGSEGV with SEND_SIG_PRIV.
However, when emulation is successful but the emulated result cannot be
copied to user space memory, it is more accurate to issue a SIGSEGV with
SEGV_MAPERR with the offending address. A new function, inspired in
force_sig_info_fault(), is introduced to model the page fault.

Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Chen Yucong 
Cc: Chris Metcalf 
Cc: Dave Hansen 
Cc: Fenghua Yu 
Cc: Huang Rui 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Michael S. Tsirkin 
Cc: Paul Gortmaker 
Cc: Peter Zijlstra 
Cc: Ravi V. Shankar 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: Tony Luck 
Cc: Paolo Bonzini 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/umip.c | 45 +++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 1f338cb..6d07eb5 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -187,6 +187,41 @@ static int emulate_umip_insn(struct insn *insn, int 
umip_inst,
 }
 
 /**
+ * force_sig_info_umip_fault() - Force a SIGSEGV with SEGV_MAPERR
+ * @addr:  Address that caused the signal
+ * @regs:  Register set containing the instruction pointer
+ *
+ * Force a SIGSEGV signal with SEGV_MAPERR as the error code. This function is
+ * intended to be used to provide a segmentation fault when the result of the
+ * UMIP emulation could not be copied to the user space memory.
+ *
+ * Returns: none
+ */
+static void force_sig_info_umip_fault(void __user *addr, struct pt_regs *regs)
+{
+   siginfo_t info;
+   struct task_struct *tsk = current;
+
+   tsk->thread.cr2 = (unsigned long)addr;
+   tsk->thread.error_code  = X86_PF_USER | X86_PF_WRITE;
+   tsk->thread.trap_nr = X86_TRAP_PF;
+
+   info.si_signo   = SIGSEGV;
+   info.si_errno   = 0;
+   info.si_code= SEGV_MAPERR;
+   info.si_addr= addr;
+   force_sig_info(SIGSEGV, &info, tsk);
+
+   if (!(show_unhandled_signals && unhandled_signal(tsk, SIGSEGV)))
+   return;
+
+   pr_err_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx 
error:%x in %lx\n",
+  tsk->comm, task_pid_nr(tsk), regs->ip,
+  regs->sp, X86_PF_USER | X86_PF_WRITE,
+  regs->ip);
+}
+
+/**
  * fixup_umip_exception() - Fixup #GP faults caused by UMIP
  * @regs:  Registers as saved when entering the #GP trap
  *
@@ -299,8 +334,14 @@ bool fixup_umip_exception(struct pt_regs *regs)
return false;
 
nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size);
-   if (nr_copied  > 0)
-   return false;
+   if (nr_copied  > 0) {
+   /*
+* If copy fails, send a signal and tell caller that
+* fault was fixed up.
+*/
+   force_sig_info_umip_fault(uaddr, regs);
+   return true;
+   }
}
 
/* increase IP to let the program keep going */
-- 
2.7.4



[PATCH v9 16/29] x86/insn-eval: Add function to get default params of code segment

2017-10-03 Thread Ricardo Neri
Obtain the default values of the address and operand sizes as specified in
the D and L bits of the the segment descriptor selected by the register
CS. The function can be used for both protected and long modes.
For virtual-8086 mode, the default address and operand sizes are always 2
bytes.

The returned parameters are encoded in a signed 8-bit data type. Auxiliar
macros are provided to encode and decode such values.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/insn-eval.h |  5 
 arch/x86/lib/insn-eval.c | 64 
 2 files changed, 69 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 25d6e44..a5886ecc 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -11,8 +11,13 @@
 #include 
 #include 
 
+#define INSN_CODE_SEG_ADDR_SZ(params) ((params >> 4) & 0xf)
+#define INSN_CODE_SEG_OPND_SZ(params) (params & 0xf)
+#define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
+
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
+char insn_get_code_seg_defaults(struct pt_regs *regs);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 02c4498..cb2734a 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -624,6 +624,70 @@ static unsigned long get_seg_limit(struct pt_regs *regs, 
int seg_reg_idx)
 }
 
 /**
+ * insn_get_code_seg_defaults() - Obtain code segment default parameters
+ * @regs:  Structure with register values as seen when entering kernel mode
+ *
+ * Obtain the default parameters of the code segment: address and operand 
sizes.
+ * The code segment is obtained from the selector contained in the CS register
+ * in regs. In protected mode, the default address is determined by inspecting
+ * the L and D bits of the segment descriptor. In virtual-8086 mode, the 
default
+ * is always two bytes for both address and operand sizes.
+ *
+ * Returns:
+ *
+ * A signed 8-bit value containing the default parameters on success.
+ *
+ * -EINVAL on error.
+ */
+char insn_get_code_seg_defaults(struct pt_regs *regs)
+{
+   struct desc_struct *desc;
+   short sel;
+
+   if (v8086_mode(regs))
+   /* Address and operand size are both 16-bit. */
+   return INSN_CODE_SEG_PARAMS(2, 2);
+
+   sel = get_segment_selector(regs, INAT_SEG_REG_CS);
+   if (sel < 0)
+   return -1L;
+
+   desc = get_desc(sel);
+   if (!desc)
+   return -EINVAL;
+
+   /*
+* The most significant byte of the Type field of the segment descriptor
+* determines whether a segment contains data or code. If this is a data
+* segment, return error.
+*/
+   if (!(desc->type & BIT(3)))
+   return -EINVAL;
+
+   switch ((desc->l << 1) | desc->d) {
+   case 0: /*
+* Legacy mode. CS.L=0, CS.D=0. Address and operand size are
+* both 16-bit.
+*/
+   return INSN_CODE_SEG_PARAMS(2, 2);
+   case 1: /*
+* Legacy mode. CS.L=0, CS.D=1. Address and operand size are
+* both 32-bit.
+*/
+   return INSN_CODE_SEG_PARAMS(4, 4);
+   case 2: /*
+* IA-32e 64-bit mode. CS.L=1, CS.D=0. Address size is 64-bit;
+* operand size is 32-bit.
+*/
+   return INSN_CODE_SEG_PARAMS(4, 8);
+   case 3: /* Invalid setting. CS.L=1, CS.D=1 */
+   /* fall through */
+   default:
+   return -EINVAL;
+   }
+}
+
+/**
  * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte
  * @insn:  Instruction containing the ModRM byte
  * @regs:  Register values as seen when entering kernel mode
-- 
2.7.4



[PATCH v9 17/29] x86/insn-eval: Indicate a 32-bit displacement if ModRM.mod is 0 and ModRM.rm is 101b

2017-10-03 Thread Ricardo Neri
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that when ModRM.mod is zero and
ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means
that none of the registers are used in the computation of the effective
address. A return value of -EDOM indicates callers that they should not
use the value of registers when computing the effective address for the
instruction.

In long mode, the effective address is given by the 32-bit displacement
plus the location of the next instruction. In protected mode, only the
displacement is used.

The instruction decoder takes care of obtaining the displacement.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index cb2734a..dd84819 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -408,6 +408,14 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
switch (type) {
case REG_TYPE_RM:
regno = X86_MODRM_RM(insn->modrm.value);
+
+   /*
+* ModRM.mod == 0 and ModRM.rm == 5 means a 32-bit displacement
+* follows the ModRM byte.
+*/
+   if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5)
+   return -EDOM;
+
if (X86_REX_B(insn->rex_prefix.value))
regno += 8;
break;
@@ -754,10 +762,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
} else {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
-   if (addr_offset < 0)
+   /*
+* -EDOM means that we must ignore the address_offset.
+* In such a case, in 64-bit mode the effective address
+* relative to the RIP of the following instruction.
+*/
+   if (addr_offset == -EDOM) {
+   if (user_64bit_mode(regs))
+   eff_addr = (long)regs->ip + 
insn->length;
+   else
+   eff_addr = 0;
+   } else if (addr_offset < 0) {
goto out;
-
-   eff_addr = regs_get_register(regs, addr_offset);
+   } else {
+   eff_addr = regs_get_register(regs, addr_offset);
+   }
}
 
eff_addr += insn->displacement.value;
-- 
2.7.4



[PATCH v9 21/29] x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode

2017-10-03 Thread Ricardo Neri
It is possible to utilize 32-bit address encodings in virtual-8086 mode via
an address override instruction prefix. However, the range of the
effective address is still limited to [0x-0x]. In such a case, return
error.

Also, linear addresses in virtual-8086 mode are limited to 20 bits. Enforce
such limit by truncating the most significant bytes of the computed linear
address.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 1d510a6..d43808c 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -862,12 +862,23 @@ static void __user *get_addr_ref_32(struct insn *insn, 
struct pt_regs *regs)
goto out;
 
/*
+* Even though 32-bit address encodings are allowed in virtual-8086
+* mode, the address range is still limited to [0x-0x].
+*/
+   if (v8086_mode(regs) && (eff_addr & ~0x))
+   goto out;
+
+   /*
 * Data type long could be 64 bits in size. Ensure that our 32-bit
 * effective address is not sign-extended when computing the linear
 * address.
 */
linear_addr = (unsigned long)(eff_addr & 0x) + seg_base_addr;
 
+   /* Limit linear address to 20 bits */
+   if (v8086_mode(regs))
+   linear_addr &= 0xf;
+
 out:
return (void __user *)linear_addr;
 }
-- 
2.7.4



[PATCH v9 22/29] x86/insn-eval: Add support to resolve 16-bit addressing encodings

2017-10-03 Thread Ricardo Neri
Tasks running in virtual-8086 mode, in protected mode with code segment
descriptors that specify 16-bit default address sizes via the
D bit, or via an address override prefix will use 16-bit addressing form
encodings as described in the Intel 64 and IA-32 Architecture Software
Developer's Manual Volume 2A Section 2.1.5, Table 2-1.

16-bit addressing encodings differ in several ways from the 32-bit/64-bit
addressing form encodings: ModRM.rm points to different registers and, in
some cases, effective addresses are indicated by the addition of the value
of two registers. Also, there is no support for SIB bytes. Thus, a
separate function is needed to parse this form of addressing.

A couple of functions are introduced. get_reg_offset_16() obtains the
offset from the base of pt_regs of the registers indicated by the ModRM
byte of the address encoding. get_addr_ref_16() computes the linear
address indicated by the instructions using the value of the registers
given by ModRM and the base address of the applicable segment.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 182 +++
 1 file changed, 182 insertions(+)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index d43808c..2f859a1 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -462,6 +462,80 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
 }
 
 /**
+ * get_reg_offset_16() - Obtain offset of register indicated by instruction
+ * @insn:  Instruction containing ModRM byte
+ * @regs:  Register values as seen when entering kernel mode
+ * @offs1: Offset of the first operand register
+ * @offs2: Offset of the second opeand register, if applicable
+ *
+ * Obtain the offset, in pt_regs, of the registers indicated by the ModRM byte
+ * within insn. This function is to be used with 16-bit address encodings. The
+ * offs1 and offs2 will be written with the offset of the two registers
+ * indicated by the instruction. In cases where any of the registers is not
+ * referenced by the instruction, the value will be set to -EDOM.
+ *
+ * Returns:
+ *
+ * 0 on success, -EINVAL on error.
+ */
+static int get_reg_offset_16(struct insn *insn, struct pt_regs *regs,
+int *offs1, int *offs2)
+{
+   /*
+* 16-bit addressing can use one or two registers. Specifics of
+* encodings are given in Table 2-1. "16-Bit Addressing Forms with the
+* ModR/M Byte" of the Intel Software Development Manual.
+*/
+   static const int regoff1[] = {
+   offsetof(struct pt_regs, bx),
+   offsetof(struct pt_regs, bx),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, bx),
+   };
+
+   static const int regoff2[] = {
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+   -EDOM,
+   -EDOM,
+   -EDOM,
+   -EDOM,
+   };
+
+   if (!offs1 || !offs2)
+   return -EINVAL;
+
+   /* Operand is a register, use the generic function. */
+   if (X86_MODRM_MOD(insn->modrm.value) == 3) {
+   *offs1 = insn_get_modrm_rm_off(insn, regs);
+   *offs2 = -EDOM;
+   return 0;
+   }
+
+   *offs1 = regoff1[X86_MODRM_RM(insn->modrm.value)];
+   *offs2 = regoff2[X86_MODRM_RM(insn->modrm.value)];
+
+   /*
+* If ModRM.mod is 0 and ModRM.rm is 110b, then we use displacement-
+* only addressing. This means that no registers are involved in
+* computing the effective address. Thus, ensure that the first
+* register offset is invalild. The second register offset is already
+* invalid under the aforementioned conditions.
+*/
+   if ((X86_MODRM_MOD(insn->modrm.value) == 0) &&
+   (X86_MODRM_RM(insn->modrm.value) == 6))
+   *offs1 = -EDOM;
+
+   return 0;
+}
+
+/**
  * get_desc() - Obtain address of segment descriptor
  * @sel:   Segment selector
  *
@@ -713,6 +787,112 @@ int insn_get_modrm_rm_off(struct insn *insn, struct 
pt_regs *regs)
 }
 
 /**
+ * get_addr_ref_16() - Obtain the 16-bit address referred by instruction
+ * @insn:  Instruction containing ModRM byte and displacement
+ * @regs:  Register values as seen when entering kernel mode
+ *
+ * Thi

[PATCH v9 18/29] x86/insn-eval: Incorporate segment base in linear address computation

2017-10-03 Thread Ricardo Neri
insn_get_addr_ref() returns the effective address as defined by the
section 3.7.5.1 Vol 1 of the Intel 64 and IA-32 Architectures Software
Developer's Manual. In order to compute the linear address, we must add
to the effective address the segment base address as set in the segment
descriptor. The segment descriptor to use depends on the register used as
operand and segment override prefixes, if any.

In most cases, the segment base address will be 0 if the USER_DS/USER32_DS
segment is used or if segmentation is not used. However, the base address
is not necessarily zero if a user programs defines its own segments. This
is possible by using a local descriptor table.

Since the effective address is a signed quantity, the unsigned segment
base address is saved in a separate variable and added to the final,
unsigned, effective address.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index dd84819..b3aa891 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -719,8 +719,8 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs 
*regs)
  */
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
 {
-   int addr_offset, base_offset, indx_offset;
-   unsigned long linear_addr = -1L;
+   int addr_offset, base_offset, indx_offset, seg_reg_indx;
+   unsigned long linear_addr = -1L, seg_base_addr;
long eff_addr, base, indx;
insn_byte_t sib;
 
@@ -734,6 +734,14 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
goto out;
 
eff_addr = regs_get_register(regs, addr_offset);
+
+   seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset);
+   if (seg_reg_indx < 0)
+   goto out;
+
+   seg_base_addr = insn_get_seg_base(regs, seg_reg_indx);
+   if (seg_base_addr == -1L)
+   goto out;
} else {
if (insn->sib.nbytes) {
/*
@@ -760,6 +768,14 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
indx = regs_get_register(regs, indx_offset);
 
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
+
+   seg_reg_indx = resolve_seg_reg(insn, regs, base_offset);
+   if (seg_reg_indx < 0)
+   goto out;
+
+   seg_base_addr = insn_get_seg_base(regs, seg_reg_indx);
+   if (seg_base_addr == -1L)
+   goto out;
} else {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
/*
@@ -777,12 +793,20 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
} else {
eff_addr = regs_get_register(regs, addr_offset);
}
+
+   seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset);
+   if (seg_reg_indx < 0)
+   goto out;
+
+   seg_base_addr = insn_get_seg_base(regs, seg_reg_indx);
+   if (seg_base_addr == -1L)
+   goto out;
}
 
eff_addr += insn->displacement.value;
}
 
-   linear_addr = (unsigned long)eff_addr;
+   linear_addr = (unsigned long)eff_addr + seg_base_addr;
 
 out:
return (void __user *)linear_addr;
-- 
2.7.4



[PATCH v9 04/29] uprobes/x86: Use existing definitions for segment override prefixes

2017-10-03 Thread Ricardo Neri
Rather than using hard-coded values of the segment override prefixes,
leverage the existing definitions provided in inat.h.

Suggested-by: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: Masami Hiramatsu 
Cc: Denys Vlasenko 
Cc: Srikar Dronamraju 
Cc: Ravi V. Shankar 
Signed-off-by: Ricardo Neri 
---
 arch/x86/kernel/uprobes.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 495c776..a3755d2 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -271,12 +271,15 @@ static bool is_prefix_bad(struct insn *insn)
int i;
 
for (i = 0; i < insn->prefixes.nbytes; i++) {
-   switch (insn->prefixes.bytes[i]) {
-   case 0x26:  /* INAT_PFX_ES   */
-   case 0x2E:  /* INAT_PFX_CS   */
-   case 0x36:  /* INAT_PFX_DS   */
-   case 0x3E:  /* INAT_PFX_SS   */
-   case 0xF0:  /* INAT_PFX_LOCK */
+   insn_attr_t attr;
+
+   attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]);
+   switch (attr) {
+   case INAT_MAKE_PREFIX(INAT_PFX_ES):
+   case INAT_MAKE_PREFIX(INAT_PFX_CS):
+   case INAT_MAKE_PREFIX(INAT_PFX_DS):
+   case INAT_MAKE_PREFIX(INAT_PFX_SS):
+   case INAT_MAKE_PREFIX(INAT_PFX_LOCK):
return true;
}
}
-- 
2.7.4



[PATCH v9 13/29] x86/insn-eval: Add utility functions to get segment selector

2017-10-03 Thread Ricardo Neri
When computing a linear address and segmentation is used, we need to know
the base address of the segment involved in the computation. In most of
the cases, the segment base address will be zero as in USER_DS/USER32_DS.
However, it may be possible that a user space program defines its own
segments via a local descriptor table. In such a case, the segment base
address may not be zero. Thus, the segment base address is needed to
calculate correctly the linear address.

If running in protected mode, the segment selector to be used when
computing a linear address is determined by either any of segment override
prefixes in the instruction or inferred from the registers involved in the
computation of the effective address; in that order. Also, there are cases
when the segment override prefixes shall be ignored (i.e., code segments
are always selected by the CS segment register; string instructions always
use the ES segment register when using rDI register as operand). In long
mode, segment registers are ignored, except for FS and GS. In these two
cases, base addresses are obtained from the respective MSRs.

For clarity, this process can be split into four steps (and an equal
number of functions): determine if segment prefixes overrides can be used;
parse the segment override prefixes, and use them if found; if not found
or cannot be used, use the default segment registers associated with the
operand registers. Once the segment register to use has been identified,
read its value to obtain the segment selector.

The method to obtain the segment selector depends on several factors. In
32-bit builds, segment selectors are saved into a pt_regs structure
when switching to kernel mode. The same is also true for virtual-8086
mode. In 64-bit builds, segmentation is mostly ignored, except when
running a program in 32-bit legacy mode. In this case, CS and SS can be
obtained from pt_regs. DS, ES, FS and GS can be read directly from
the respective segment registers.

In order to identify the segment registers, a new set of #defines is
introduced. It also includes two special identifiers. One of them
indicates when the default segment register associated with instruction
operands shall be used. Another one indicates that the contents of the
segment register shall be ignored; this identifier is used when in long
mode.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/inat.h |  10 ++
 arch/x86/lib/insn-eval.c| 321 
 2 files changed, 331 insertions(+)

diff --git a/arch/x86/include/asm/inat.h b/arch/x86/include/asm/inat.h
index 02aff08..1c78580 100644
--- a/arch/x86/include/asm/inat.h
+++ b/arch/x86/include/asm/inat.h
@@ -97,6 +97,16 @@
 #define INAT_MAKE_GROUP(grp)   ((grp << INAT_GRP_OFFS) | INAT_MODRM)
 #define INAT_MAKE_IMM(imm) (imm << INAT_IMM_OFFS)
 
+/* Identifiers for segment registers */
+#define INAT_SEG_REG_IGNORE0
+#define INAT_SEG_REG_DEFAULT   1
+#define INAT_SEG_REG_CS2
+#define INAT_SEG_REG_SS3
+#define INAT_SEG_REG_DS4
+#define INAT_SEG_REG_ES5
+#define INAT_SEG_REG_FS6
+#define INAT_SEG_REG_GS7
+
 /* Attribute search APIs */
 extern insn_attr_t inat_get_opcode_attribute(insn_byte_t opcode);
 extern int inat_get_last_prefix_id(insn_byte_t last_pfx);
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index ac7b87c..77b48f9 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #undef pr_fmt
 #define pr_fmt(fmt) "insn: " fmt
@@ -47,6 +48,326 @@ static bool is_string_insn(struct insn *insn)
}
 }
 
+/**
+ * get_overridden_seg_reg_idx() - obtain segment register override index
+ * @insn:  Instruction with segment override prefixes
+ *
+ * Inspect the instruction prefixes and find segment overrides, if any.
+ *
+ * Returns:
+ *
+ * A constant identifying the segment register to use, among CS, SS, DS,
+ * ES, FS, or GS. INAT_SEG_REG_DEFAULT is returned if no segment override
+ * prefixes were found.
+ *
+ * -EINVAL in case of error.
+ */
+static int get_overridden_seg_reg_idx(struct insn *insn)
+{
+   int idx = INAT_SEG_REG_DEFAULT;
+   int sel_overrides = 0, i;
+
+   if (!insn)
+   return -EINVAL;
+
+   insn_get_prefixes(insn);
+
+   /* Look for any segment override prefixes. */
+   for (i = 0; i < insn->prefixes.nbytes; i++) {
+   insn_attr_t attr;
+
+   attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]);
+   switch (attr) {
+   case INAT_MAKE_PREFIX(INAT_PFX_CS):
+

[PATCH v9 14/29] x86/insn-eval: Add utility function to get segment descriptor

2017-10-03 Thread Ricardo Neri
The segment descriptor contains information that is relevant to how linear
addresses need to be computed. It contains the default size of addresses
as well as the base address of the segment. Thus, given a segment
selector, we ought to look at segment descriptor to correctly calculate
the linear address.

In protected mode, the segment selector might indicate a segment
descriptor from either the global descriptor table or a local descriptor
table. Both cases are considered in this function.

This function is a prerequisite for functions in subsequent commits that
will obtain the aforementioned attributes of the segment descriptor.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 57 
 1 file changed, 57 insertions(+)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 77b48f9..d599dc3 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -6,9 +6,13 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #undef pr_fmt
@@ -450,6 +454,59 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
 }
 
 /**
+ * get_desc() - Obtain address of segment descriptor
+ * @sel:   Segment selector
+ *
+ * Given a segment selector, obtain a pointer to the segment descriptor.
+ * Both global and local descriptor tables are supported.
+ *
+ * Returns:
+ *
+ * Pointer to segment descriptor on success.
+ *
+ * NULL on error.
+ */
+static struct desc_struct *get_desc(unsigned short sel)
+{
+   struct desc_ptr gdt_desc = {0, 0};
+   unsigned long desc_base;
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+   struct desc_struct *desc = NULL;
+   struct ldt_struct *ldt;
+
+   if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+   /* Bits [15:3] contain the index of the desired entry. */
+   sel >>= 3;
+
+   mutex_lock(¤t->active_mm->context.lock);
+   ldt = current->active_mm->context.ldt;
+   if (ldt && sel < ldt->nr_entries)
+   desc = &ldt->entries[sel];
+
+   mutex_unlock(¤t->active_mm->context.lock);
+
+   return desc;
+   }
+#endif
+   native_store_gdt(&gdt_desc);
+
+   /*
+* Segment descriptors have a size of 8 bytes. Thus, the index is
+* multiplied by 8 to obtain the memory offset of the desired descriptor
+* from the base of the GDT. As bits [15:3] of the segment selector
+* contain the index, it can be regarded as multiplied by 8 already.
+* All that remains is to clear bits [2:0].
+*/
+   desc_base = sel & ~(SEGMENT_RPL_MASK | SEGMENT_TI_MASK);
+
+   if (desc_base > gdt_desc.size)
+   return NULL;
+
+   return (struct desc_struct *)(gdt_desc.address + desc_base);
+}
+
+/**
  * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte
  * @insn:  Instruction containing the ModRM byte
  * @regs:  Register values as seen when entering kernel mode
-- 
2.7.4



[PATCH v9 12/29] x86/insn-eval: Add utility function to identify string instructions

2017-10-03 Thread Ricardo Neri
String instructions are special because, in protected mode, the linear
address is always obtained via the ES segment register in operands that
use the (E)DI register; the DS segment register in operands that use
the (E)SI register. Furthermore, segment override prefixes are ignored
when calculating a linear address involving the (E)DI register; segment
override prefixes can be used when calculating linear addresses involving
the (E)SI register.

It follows that linear addresses are calculated differently for the case of
string instructions. The purpose of this utility function is to identify
such instructions for callers to determine a linear address correctly.

Note that this function only identifies string instructions; it does not
determine what segment register to use in the address computation. That is
left to callers. A subsequent commmit introduces a function to determine
the segment register to use given the instruction, operands and
segment override prefixes.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 405ffeb..ac7b87c 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -19,6 +19,34 @@ enum reg_type {
REG_TYPE_BASE,
 };
 
+/**
+ * is_string_insn() - Determine if instruction is a string instruction
+ * @insn:  Instruction containing the opcode to inspect
+ *
+ * Returns:
+ *
+ * true if the instruction, determined by the opcode, is any of the
+ * string instructions as defined in the Intel Software Development manual.
+ * False otherwise.
+ */
+static bool is_string_insn(struct insn *insn)
+{
+   insn_get_opcode(insn);
+
+   /* All string instructions have a 1-byte opcode. */
+   if (insn->opcode.nbytes != 1)
+   return false;
+
+   switch (insn->opcode.bytes[0]) {
+   case 0x6c ... 0x6f: /* INS, OUTS */
+   case 0xa4 ... 0xa7: /* MOVS, CMPS */
+   case 0xaa ... 0xaf: /* STOS, LODS, SCAS */
+   return true;
+   default:
+   return false;
+   }
+}
+
 static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
  enum reg_type type)
 {
-- 
2.7.4



[PATCH v9 15/29] x86/insn-eval: Add utility functions to get segment descriptor base address and limit

2017-10-03 Thread Ricardo Neri
With segmentation, the base address of the segment is needed to compute a
linear address. This base address is obtained from the applicable segment
descriptor. Such segment descriptor is referenced from a segment selector.
These new functions obtain the segment base and limit of the segment
selector indicated by segment register index given as argument. This index
is any of the INAT_SEG_REG_* family of #define's.

The logic to obtain the segment selector is wrapped in the function
get_seg_selector() with the inputs described above. Once the selector is
known, the base address is determined. In protected mode, the selector is
used to obtain the segment descriptor and then its base address. In 64-bit
user mode, the segment base address is zero except when FS or GS are used.
In virtual-8086 mode, the base address is computed as the value of the
segment selector shifted 4 positions to the left.

In protected mode, segment limits are enforced. Thus, a function to
determine the limit of the segment is added. Segment limits are not
enforced in long or virtual-8086. For the latter, addresses are limited
to 20 bits; address size will be handled when computing the linear
address.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/insn-eval.h |   1 +
 arch/x86/lib/insn-eval.c | 117 +++
 2 files changed, 118 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 7e8c963..25d6e44 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -13,5 +13,6 @@
 
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
+unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index d599dc3..02c4498 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -507,6 +507,123 @@ static struct desc_struct *get_desc(unsigned short sel)
 }
 
 /**
+ * insn_get_seg_base() - Obtain base address of segment descriptor.
+ * @regs:  Register values as seen when entering kernel mode
+ * @seg_reg_idx:   Index of the segment register pointing to seg descriptor
+ *
+ * Obtain the base address of the segment as indicated by the segment 
descriptor
+ * pointed by the segment selector. The segment selector is obtained from the
+ * input segment register index seg_reg_idx.
+ *
+ * Returns:
+ *
+ * In protected mode, base address of the segment. Zero in long mode,
+ * except when FS or GS are used. In virtual-8086 mode, the segment
+ * selector shifted 4 positions to the right.
+ *
+ * -1L in case of error.
+ */
+unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+   struct desc_struct *desc;
+   short sel;
+
+   sel = get_segment_selector(regs, seg_reg_idx);
+   if (sel < 0)
+   return -1L;
+
+   if (v8086_mode(regs))
+   /*
+* Base is simply the segment selector shifted 4
+* positions to the right.
+*/
+   return (unsigned long)(sel << 4);
+
+   if (user_64bit_mode(regs)) {
+   /*
+* Only FS or GS will have a base address, the rest of
+* the segments' bases are forced to 0.
+*/
+   unsigned long base;
+
+   if (seg_reg_idx == INAT_SEG_REG_FS)
+   rdmsrl(MSR_FS_BASE, base);
+   else if (seg_reg_idx == INAT_SEG_REG_GS)
+   /*
+* swapgs was called at the kernel entry point. Thus,
+* MSR_KERNEL_GS_BASE will have the user-space GS base.
+*/
+   rdmsrl(MSR_KERNEL_GS_BASE, base);
+   else if (seg_reg_idx != INAT_SEG_REG_IGNORE)
+   /* We should ignore the rest of segment registers. */
+   base = -1L;
+   else
+   base = 0;
+   return base;
+   }
+
+   /* In protected mode the segment selector cannot be null. */
+   if (!sel)
+   return -1L;
+
+   desc = get_desc(sel);
+   if (!desc)
+   return -1L;
+
+   return get_desc_base(desc);
+}
+
+/**
+ * get_seg_limit() - Obtain the limit of a segment descriptor
+ * @regs:  Register values as seen when entering kernel mode
+ * @seg_reg_idx:   Index of the segment register pointing to seg descriptor
+ *
+ * Obtain the limit of the segment as indicated by the 

[PATCH v9 11/29] x86/insn-eval: Add a utility function to get register offsets

2017-10-03 Thread Ricardo Neri
The function get_reg_offset() returns the offset to the register the
argument specifies as indicated in an enumeration of type offset. Callers
of this function would need the definition of such enumeration. This is
not needed. Instead, add helper functions for this purpose. These functions
are useful in cases when, for instance, the caller needs to decide whether
the operand is a register or a memory location by looking at the rm part
of the ModRM byte. As of now, this is the only helper function that is
needed.

Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c | 17 +
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 5cab1b1..7e8c963 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -12,5 +12,6 @@
 #include 
 
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
+int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 4931d92..405ffeb 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -100,6 +100,23 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
return regoff[regno];
 }
 
+/**
+ * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte
+ * @insn:  Instruction containing the ModRM byte
+ * @regs:  Register values as seen when entering kernel mode
+ *
+ * Returns:
+ *
+ * The register indicated by the r/m part of the ModRM byte. The
+ * register is obtained as an offset from the base of pt_regs. In specific
+ * cases, the returned value can be -EDOM to indicate that the particular value
+ * of ModRM does not refer to a register and shall be ignored.
+ */
+int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs)
+{
+   return get_reg_offset(insn, regs, REG_TYPE_RM);
+}
+
 /*
  * return the address being referenced be instruction
  * for rm=3 returning the content of the rm reg
-- 
2.7.4



[PATCH v9 07/29] x86/mpx: Do not use SIB.index if its value is 100b and ModRM.mod is not 11b

2017-10-03 Thread Ricardo Neri
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that when ModRM.mod !=11b and
ModRM.rm = 100b indexed register-indirect addressing is used. In other
words, a SIB byte follows the ModRM byte. In the specific case of
SIB.index = 100b, the scale*index portion of the computation of the
effective address is null. To signal callers of this particular situation,
get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an
error when decoding the SIB byte).

An example of this situation can be the following instruction:

   8b 4c 23 80   mov -0x80(%rbx,%riz,1),%rcx
   ModRM:0x4c [mod:1b][reg:1b][rm:100b]
   SIB:  0x23 [scale:0b][index:100b][base:11b]
   Displacement: 0x80  (1-byte, as per ModRM.mod = 1b)

The %riz 'register' indicates a null index.

In long mode, a REX prefix may be used. When a REX prefix is present,
REX.X adds a fourth bit to the register selection of SIB.index. This gives
the ability to refer to all the 16 general purpose registers. When REX.X is
1b and SIB.index is 100b, the index is indicated in %r12. In our example,
this would look like:

   42 8b 4c 23 80mov -0x80(%rbx,%r12,1),%rcx
   REX:  0x42 [W:0b][R:0b][X:1b][B:0b]
   ModRM:0x4c [mod:1b][reg:1b][rm:100b]
   SIB:  0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b]
   Displacement: 0x80  (1-byte, as per ModRM.mod = 1b)

%r12 is a valid register to use in the scale*index part of the effective
address computation.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Peter Zijlstra 
Cc: Nathan Howard 
Cc: Adan Hawthorn 
Cc: Joe Perches 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/mm/mpx.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 57e5bf5..2ad1d4a 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -110,6 +110,15 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
regno = X86_SIB_INDEX(insn->sib.value);
if (X86_REX_X(insn->rex_prefix.value))
regno += 8;
+
+   /*
+* If ModRM.mod != 3 and SIB.index = 4 the scale*index
+* portion of the address computation is null. This is
+* true only if REX.X is 0. In such a case, the SIB index
+* is used in the address computation.
+*/
+   if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4)
+   return -EDOM;
break;
 
case REG_TYPE_BASE:
@@ -160,11 +169,19 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
goto out;
 
indx_offset = get_reg_offset(insn, regs, 
REG_TYPE_INDEX);
-   if (indx_offset < 0)
+   /*
+* A negative offset generally means a error, except
+* -EDOM, which means that the contents of the register
+* should not be used as index.
+*/
+   if (indx_offset == -EDOM)
+   indx = 0;
+   else if (indx_offset < 0)
goto out;
+   else
+   indx = regs_get_register(regs, indx_offset);
 
base = regs_get_register(regs, base_offset);
-   indx = regs_get_register(regs, indx_offset);
 
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
} else {
-- 
2.7.4



[PATCH v9 06/29] x86/mpx: Use signed variables to compute effective addresses

2017-10-03 Thread Ricardo Neri
Even though memory addresses are unsigned, the operands used to compute the
effective address do have a sign. This is true for ModRM.rm, SIB.base,
SIB.index as well as the displacement bytes. Thus, signed variables shall
be used when computing the effective address from these operands. Once the
signed effective address has been computed, it is casted to an unsigned
long to determine the linear address.

Variables are renamed to better reflect the type of address being
computed.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Peter Zijlstra 
Cc: Nathan Howard 
Cc: Adan Hawthorn 
Cc: Joe Perches 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/mm/mpx.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index f4c48a0..57e5bf5 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -138,8 +138,9 @@ static int get_reg_offset(struct insn *insn, struct pt_regs 
*regs,
  */
 static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs)
 {
-   unsigned long addr = -1L, base, indx;
int addr_offset, base_offset, indx_offset;
+   unsigned long linear_addr = -1L;
+   long eff_addr, base, indx;
insn_byte_t sib;
 
insn_get_modrm(insn);
@@ -150,7 +151,8 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
goto out;
-   addr = regs_get_register(regs, addr_offset);
+
+   eff_addr = regs_get_register(regs, addr_offset);
} else {
if (insn->sib.nbytes) {
base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE);
@@ -163,17 +165,23 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
 
base = regs_get_register(regs, base_offset);
indx = regs_get_register(regs, indx_offset);
-   addr = base + indx * (1 << X86_SIB_SCALE(sib));
+
+   eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
} else {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
goto out;
-   addr = regs_get_register(regs, addr_offset);
+
+   eff_addr = regs_get_register(regs, addr_offset);
}
-   addr += insn->displacement.value;
+
+   eff_addr += insn->displacement.value;
}
+
+   linear_addr = (unsigned long)eff_addr;
+
 out:
-   return (void __user *)addr;
+   return (void __user *)linear_addr;
 }
 
 static int mpx_insn_decode(struct insn *insn,
-- 
2.7.4



[PATCH v9 08/29] x86/mpx: Do not use SIB.base if its value is 101b and ModRM.mod = 0

2017-10-03 Thread Ricardo Neri
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that if a SIB byte is used and
SIB.base is 101b and ModRM.mod is zero, then the base part of the base
part of the effective address computation is null. To signal this
situation, a -EDOM error is returned to indicate callers to ignore the
base value present in the register operand.

In this scenario, a 32-bit displacement follows the SIB byte. Displacement
is obtained when the instruction decoder parses the operands.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Peter Zijlstra 
Cc: Nathan Howard 
Cc: Adan Hawthorn 
Cc: Joe Perches 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/mm/mpx.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 2ad1d4a..581a960 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -123,6 +123,14 @@ static int get_reg_offset(struct insn *insn, struct 
pt_regs *regs,
 
case REG_TYPE_BASE:
regno = X86_SIB_BASE(insn->sib.value);
+   /*
+* If ModRM.mod is 0 and SIB.base == 5, the base of the
+* register-indirect addressing is 0. In this case, a
+* 32-bit displacement follows the SIB byte.
+*/
+   if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5)
+   return -EDOM;
+
if (X86_REX_B(insn->rex_prefix.value))
regno += 8;
break;
@@ -164,16 +172,22 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
eff_addr = regs_get_register(regs, addr_offset);
} else {
if (insn->sib.nbytes) {
+   /*
+* Negative values in the base and index offset means
+* an error when decoding the SIB byte. Except -EDOM,
+* which means that the registers should not be used
+* in the address computation.
+*/
base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE);
-   if (base_offset < 0)
+   if (base_offset == -EDOM)
+   base = 0;
+   else if (base_offset < 0)
goto out;
+   else
+   base = regs_get_register(regs, base_offset);
 
indx_offset = get_reg_offset(insn, regs, 
REG_TYPE_INDEX);
-   /*
-* A negative offset generally means a error, except
-* -EDOM, which means that the contents of the register
-* should not be used as index.
-*/
+
if (indx_offset == -EDOM)
indx = 0;
else if (indx_offset < 0)
@@ -181,8 +195,6 @@ static void __user *mpx_get_addr_ref(struct insn *insn, 
struct pt_regs *regs)
else
indx = regs_get_register(regs, indx_offset);
 
-   base = regs_get_register(regs, base_offset);
-
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
} else {
addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
-- 
2.7.4



[PATCH v9 09/29] x86/mpx, x86/insn: Relocate insn util functions to a new insn-eval file

2017-10-03 Thread Ricardo Neri
Other kernel submodules can benefit from using the utility functions
defined in mpx.c to obtain the addresses and values of operands contained
in the general purpose registers. An instance of this is the emulation code
used for instructions protected by the Intel User-Mode Instruction
Prevention feature.

Thus, these functions are relocated to a new insn-eval.c file. The reason
to not relocate these utilities into insn.c is that the latter solely
analyses instructions given by a struct insn without any knowledge of the
meaning of the values of instruction operands. This new utility insn-
eval.c aims to be used to resolve userspace linear addresses based on
the contents of the instruction operands as well as the contents of pt_regs
structure.

These utilities come with a separate header. This is to avoid taking insn.c
out of sync from the instructions decoders under tools/obj and tools/perf.
This also avoids adding cumbersome #ifdef's for the #include'd files
required to decode instructions in a kernel context.

Functions are simply relocated. There are not functional or indentation
changes.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
The checkpatch script issues the following warning with this
commit:

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code
rather than BUG() or BUG_ON()
+   BUG();

This warning will be fixed in a subsequent patch.
---
 arch/x86/include/asm/insn-eval.h |  16 
 arch/x86/lib/Makefile|   2 +-
 arch/x86/lib/insn-eval.c | 163 +++
 arch/x86/mm/mpx.c| 156 +
 4 files changed, 182 insertions(+), 155 deletions(-)
 create mode 100644 arch/x86/include/asm/insn-eval.h
 create mode 100644 arch/x86/lib/insn-eval.c

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
new file mode 100644
index 000..5cab1b1
--- /dev/null
+++ b/arch/x86/include/asm/insn-eval.h
@@ -0,0 +1,16 @@
+#ifndef _ASM_X86_INSN_EVAL_H
+#define _ASM_X86_INSN_EVAL_H
+/*
+ * A collection of utility functions for x86 instruction analysis to be
+ * used in a kernel context. Useful when, for instance, making sense
+ * of the registers indicated by operands.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
+
+#endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 34a7413..675d7b0 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,7 +23,7 @@ lib-y := delay.o misc.o cmdline.o cpu.o
 lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
-lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
+lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
new file mode 100644
index 000..df9418c
--- /dev/null
+++ b/arch/x86/lib/insn-eval.c
@@ -0,0 +1,163 @@
+/*
+ * Utility functions for x86 operand and address decoding
+ *
+ * Copyright (C) Intel Corporation 2017
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum reg_type {
+   REG_TYPE_RM = 0,
+   REG_TYPE_INDEX,
+   REG_TYPE_BASE,
+};
+
+static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
+ enum reg_type type)
+{
+   int regno = 0;
+
+   static const int regoff[] = {
+   offsetof(struct pt_regs, ax),
+   offsetof(struct pt_regs, cx),
+   offsetof(struct pt_regs, dx),
+   offsetof(struct pt_regs, bx),
+   offsetof(struct pt_regs, sp),
+   offsetof(struct pt_regs, bp),
+   offsetof(struct pt_regs, si),
+   offsetof(struct pt_regs, di),
+#ifdef CONFIG_X86_64
+   offsetof(struct pt_regs, r8),
+   offsetof(struct pt_regs, r9),
+   offsetof(struct pt_regs, r10),
+   offsetof(struct pt_regs, r11),
+   offsetof(struct pt_regs, r12),
+   offsetof(struct pt_regs, r13),
+   offsetof(struct pt_regs, r14),
+   offsetof(struct pt_regs, r15),
+#endif
+   };
+   int nr_registers = ARRAY_SIZE(regoff);
+   /*
+* Don't possibly decode a 32-bit instructions as
+* reading a 64-bit-only register.
+*/
+   if (IS_ENABLED(CONFIG_X86_64) && !insn->x86_64)
+   nr_registers -= 8;
+
+   switch (type) {
+

[PATCH v9 10/29] x86/insn-eval: Do not BUG on invalid register type

2017-10-03 Thread Ricardo Neri
We are not in a critical failure path. The invalid register type is caused
when trying to decode invalid instruction bytes from a user-space program.
Thus, simply print an error message. To prevent this warning from being
abused from user space programs, use the rate-limited variant of pr_err().
along with a descriptive prefix.

Cc: Borislav Petkov 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Adam Buchbinder 
Cc: Colin Ian King 
Cc: Lorenzo Stoakes 
Cc: Qiaowei Ren 
Cc: Arnaldo Carvalho de Melo 
Cc: Masami Hiramatsu 
Cc: Adrian Hunter 
Cc: Kees Cook 
Cc: Thomas Garnier 
Cc: Peter Zijlstra 
Cc: Dmitry Vyukov 
Cc: Ravi V. Shankar 
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
 arch/x86/lib/insn-eval.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index df9418c..4931d92 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -5,10 +5,14 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
+#undef pr_fmt
+#define pr_fmt(fmt) "insn: " fmt
+
 enum reg_type {
REG_TYPE_RM = 0,
REG_TYPE_INDEX,
@@ -85,9 +89,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs 
*regs,
break;
 
default:
-   pr_err("invalid register type");
-   BUG();
-   break;
+   pr_err_ratelimited("invalid register type: %d\n", type);
+   return -EINVAL;
}
 
if (regno >= nr_registers) {
-- 
2.7.4



[PATCH v9 00/29] x86: Enable User-Mode Instruction Prevention

2017-10-03 Thread Ricardo Neri
This is v9 of this series. The seven previous submissions can be found
here [1], here [2], here[3], here[4], here[5], here[6], here[7] and here[8].
This version addresses the feedback comments from Borislav Petkov received on
v7. Please see details in the change log.

=== What is UMIP?

User-Mode Instruction Prevention (UMIP) is a security feature present in
new Intel Processors. If enabled, it prevents the execution of certain
instructions if the Current Privilege Level (CPL) is greater than 0. If
these instructions were executed while in CPL > 0, user space applications
could have access to system-wide settings such as the global and local
descriptor tables, the segment selectors to the current task state and the
local descriptor table. Hiding these system resources reduces the tools
available to craft privilege escalation attacks such as [9].

These are the instructions covered by UMIP:
* SGDT - Store Global Descriptor Table
* SIDT - Store Interrupt Descriptor Table
* SLDT - Store Local Descriptor Table
* SMSW - Store Machine Status Word
* STR - Store Task Register

If any of these instructions is executed with CPL > 0, a general protection
exception is issued when UMIP is enabled.

=== How does it impact applications?

When enabled, However, UMIP will change the behavior that certain
applications expect from the operating system. For instance, programs
running on WineHQ and DOSEMU2 rely on some of these instructions to
function. Stas Sergeev found that Microsoft Windows 3.1 and dos4gw use the
instruction SMSW when running in virtual-8086 mode[10]. SGDT and SIDT can
also be used on virtual-8086 mode.

In order to not change the behavior of the system. This patchset emulates
SGDT, SIDT and SMSW. This should be sufficient to not break the
applications mentioned above. Regarding the two remaining instructions, STR
and SLDT, the WineHQ team has shown interest catching the general protection
fault and use it as a vehicle to fix broken applications[11]. Furthermore,
STR and SLDT can only run in protected and long modes.

DOSEMU2 emulates virtual-8086 mode via KVM. No applications will be broken
unless DOSEMU2 decides to enable the CR4.UMIP bit in platforms that support
it. Also, this should not pose a security risk as no system resouces would
be revealed. Instead, code running inside the KVM would only see the KVM's
GDT, IDT and MSW.

Please note that UMIP is always enabled for both 64-bit and 32-bit Linux
builds. However, emulation of the UMIP-protected instructions is not done
for 64-bit processes. 64-bit user space applications will receive the
SIGSEGV signal when UMIP instructions causes a general protection fault.

=== How are UMIP-protected instructions emulated?

UMIP is kept enabled at all times when the CONFIG_x86_INTEL_UMIP option is
selected. If a general protection fault caused by the instructions
protected by UMIP is detected, such fault will be trapped and fixed-up. The
return values will be dummy as follows:
 
 * SGDT and SIDT return hard-coded dummy values as the base of the global
   descriptor and interrupt descriptor tables. These hard-coded values
   correspond to memory addresses that are near the end of the kernel
   memory map. This is also the case for virtual-8086 mode tasks. In all
   my experiments with 32-bit processes, the base of GDT and IDT was always
   a 4-byte address, even for 16-bit operands. Thus, my emulation code does
   the same. In all cases, the limit of the table is set to 0.
 * SMSW returns the value with which the CR0 register is programmed in
   head_32/64.S at boot time. This is, the following bits are enabled:
   CR0.0 for Protection Enable, CR.1 for Monitor Coprocessor, CR.4 for
   Extension Type, which will always be 1 in recent processors with UMIP;
   CR.5 for Numeric Error, CR0.16 for Write Protect, CR0.18 for Alignment
   Mask and CR0.31 for Paging. As per the Intel 64 and IA-32 Architectures
   Software Developer's Manual, SMSW returns a 16-bit results for memory
   operands. However, when the operand is a register, the results can be up
   to CR0[63:0]. Since the emulation code only kicks-in for 32-bit
   processes, we return up to CR[31:0].
 * The proposed emulation code is handles faults that happens in both
   protected and virtual-8086 mode.
 * Again, STR and SLDT are not emulated.

=== How is this series laid out?

++ Preparatory work
As per suggestions from Andy Lutormirsky and Borislav Petkov, I moved
the x86 page fault error codes to a header. Also, I made user_64bit_mode
available to x86_32 builds. This helps to reuse code and reduce the number
of #ifdef's in these patches. Borislav also suggested to uprobes should use
the existing definitions in arch/x86/include/asm/inat.h instead of hard-
coded values when checking instruction prefixes. I included this change
in the series.

++ Fix bugs in MPX address decoder
I found very useful the code for Intel MPX (Memory Protection Extensions)
used to parse opcodes and the memory locations contained in th

[PATCH v9 02/29] x86/boot: Relocate definition of the initial state of CR0

2017-10-03 Thread Ricardo Neri
Both head_32.S and head_64.S utilize the same value to initialize the
control register CR0. Also, other parts of the kernel might want to access
this initial definition (e.g., emulation code for User-Mode Instruction
Prevention uses this state to provide a sane dummy value for CR0 when
emulating the smsw instruction). Thus, relocate this definition to a
header file from which it can be conveniently accessed.

Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
Suggested-by: Borislav Petkov 
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/uapi/asm/processor-flags.h | 3 +++
 arch/x86/kernel/head_32.S   | 3 ---
 arch/x86/kernel/head_64.S   | 3 ---
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index 185f3d1..39946d0 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -151,5 +151,8 @@
 #define CX86_ARR_BASE  0xc4
 #define CX86_RCR_BASE  0xdc
 
+#define CR0_STATE  (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
+X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
+X86_CR0_PG)
 
 #endif /* _UAPI_ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 9ed3074..c3cfc65 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -211,9 +211,6 @@ ENTRY(startup_32_smp)
 #endif
 
 .Ldefault_entry:
-#define CR0_STATE  (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
-X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
-X86_CR0_PG)
movl $(CR0_STATE & ~X86_CR0_PG),%eax
movl %eax,%cr0
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 42e32c2..205dabc 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -152,9 +152,6 @@ ENTRY(secondary_startup_64)
 1: wrmsr   /* Make changes effective */
 
/* Setup cr0 */
-#define CR0_STATE  (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
-X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
-X86_CR0_PG)
movl$CR0_STATE, %eax
/* Make changes effective */
movq%rax, %cr0
-- 
2.7.4



[PATCH v9 01/29] x86/mm: Relocate page fault error codes to traps.h

2017-10-03 Thread Ricardo Neri
Up to this point, only fault.c used the definitions of the page fault error
codes. Thus, it made sense to keep them within such file. Other portions of
code might be interested in those definitions too. For instance, the User-
Mode Instruction Prevention emulation code will use such definitions to
emulate a page fault when it is unable to successfully copy the results
of the emulated instructions to user space.

While relocating the error code enumeration, the prefix X86_ is used to
make it consistent with the rest of the definitions in traps.h. Of course,
code using the enumeration had to be updated as well. No functional changes
were performed.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: "Kirill A. Shutemov" 
Cc: Josh Poimboeuf 
Cc: Dave Hansen 
Cc: Paul Gortmaker 
Cc: x...@kernel.org
Reviewed-by: Andy Lutomirski 
Reviewed-by: Borislav Petkov 
Signed-off-by: Ricardo Neri 
---
 arch/x86/include/asm/traps.h | 18 +
 arch/x86/mm/fault.c  | 88 +---
 2 files changed, 52 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 5545f64..da3c3a3 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -144,4 +144,22 @@ enum {
X86_TRAP_IRET = 32, /* 32, IRET Exception */
 };
 
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==   0: no page found   1: protection fault
+ *   bit 1 ==   0: read access 1: write access
+ *   bit 2 ==   0: kernel-mode access  1: user-mode access
+ *   bit 3 ==  1: use of reserved bit detected
+ *   bit 4 ==  1: fault was an instruction fetch
+ *   bit 5 ==  1: protection keys block access
+ */
+enum x86_pf_error_code {
+   X86_PF_PROT =   1 << 0,
+   X86_PF_WRITE=   1 << 1,
+   X86_PF_USER =   1 << 2,
+   X86_PF_RSVD =   1 << 3,
+   X86_PF_INSTR=   1 << 4,
+   X86_PF_PK   =   1 << 5,
+};
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa..db71c73 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -29,26 +29,6 @@
 #include 
 
 /*
- * Page fault error code bits:
- *
- *   bit 0 ==   0: no page found   1: protection fault
- *   bit 1 ==   0: read access 1: write access
- *   bit 2 ==   0: kernel-mode access  1: user-mode access
- *   bit 3 ==  1: use of reserved bit detected
- *   bit 4 ==  1: fault was an instruction fetch
- *   bit 5 ==  1: protection keys block access
- */
-enum x86_pf_error_code {
-
-   PF_PROT =   1 << 0,
-   PF_WRITE=   1 << 1,
-   PF_USER =   1 << 2,
-   PF_RSVD =   1 << 3,
-   PF_INSTR=   1 << 4,
-   PF_PK   =   1 << 5,
-};
-
-/*
  * Returns 0 if mmiotrace is disabled, or if the fault is not
  * handled by mmiotrace:
  */
@@ -149,7 +129,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, 
unsigned long addr)
 * If it was a exec (instruction fetch) fault on NX page, then
 * do not ignore the fault:
 */
-   if (error_code & PF_INSTR)
+   if (error_code & X86_PF_INSTR)
return 0;
 
instr = (void *)convert_ip_to_linear(current, regs);
@@ -179,7 +159,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, 
unsigned long addr)
  * siginfo so userspace can discover which protection key was set
  * on the PTE.
  *
- * If we get here, we know that the hardware signaled a PF_PK
+ * If we get here, we know that the hardware signaled a X86_PF_PK
  * fault and that there was a VMA once we got in the fault
  * handler.  It does *not* guarantee that the VMA we find here
  * was the one that we faulted on.
@@ -204,7 +184,7 @@ static void fill_sig_info_pkey(int si_code, siginfo_t 
*info, u32 *pkey)
/*
 * force_sig_info_fault() is called from a number of
 * contexts, some of which have a VMA and some of which
-* do not.  The PF_PK handing happens after we have a
+* do not.  The X86_PF_PK handing happens after we have a
 * valid VMA, so we should never reach this without a
 * valid VMA.
 */
@@ -697,7 +677,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code,
if (!oops_may_print())
return;
 
-   if (error_code & PF_INSTR) {
+   if (error_code & X86_PF_INSTR) {
unsigned int level;
pgd_t *pgd;
pte_t *pte;
@@ -779,7 +759,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 */
if (current->thread.sig_on_uaccess_err && signal) {
   

Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-03 Thread David Miller
From: Santosh Shilimkar 
Date: Tue, 3 Oct 2017 17:42:39 -0700

> Hi Dave,
> 
> On 10/2/2017 1:30 PM, Santosh Shilimkar wrote:
>> On 10/1/2017 10:56 PM, David Miller wrote:
>>> From: David Miller 
> [...]
> 
>>>
>>> Actually, reverted, this breaks the build.
>>>
>>> net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or
>>> directory
>>>   #include "ib.h"
>>>
>>> Although I can't see how in the world this patch is causing such
>>> an error.
>>>
>> Weird indeed. Will sort this out with Avinash. Thanks Dave.
>> 
> I tried few build combinations on net-next but couldn't
> reproduce the build failure. AFAIU, the build error can only
> happen if for some reason the ib.h file got deleted
> accidentally. I did delete ib.h file and saw below error
> 
>   CC [M]  net/rds/rdma_transport.o
> net/rds/rdma_transport.c:38:16: error: ib.h: No such file or directory
> 
> Could it be the case for some reason the ib.h file got
> deleted or mangled while applying the $subject patch ?

Please post the patch again, I will try to sort it out.


Re: [RFC PATCH 1/2] kbuild: Add a cache for generated variables

2017-10-03 Thread Masahiro Yamada
Hi Douglas,


2017-09-27 2:55 GMT+09:00 Douglas Anderson :
> While timing a "no-op" build of the kernel (incrementally building the
> kernel even though nothing changed) in the Chrome OS build system I
> found that it was much slower than I expected.
>
> Digging into things a bit, I found that quite a bit of the time was
> spent invoking the C compiler even though we weren't actually building
> anything.  Currently in the Chrome OS build system the C compiler is
> called through a number of wrappers (one of which is written in
> python!) and can take upwards of 100 ms to invoke even if we're not
> doing anything difficult, so these invocations of the compiler were
> taking a lot of time.  Worse the invocations couldn't seem to take
> advantage of the multiple cores on my system.
>
> Certainly it seems like we could make the compiler invocations in the
> Chrome OS build system faster, but only to a point.  Inherently
> invoking a program as big as a C compiler is a fairly heavy
> operation.  Thus even if we can speed the compiler calls it made sense
> to track down what was happening.
>
> It turned out that all the compiler invocations were coming from
> usages like this in the kernel's Makefile:
>
> KBUILD_CFLAGS += $(call cc-option,-fno-delete-null-pointer-checks,)
>
> Due to the way cc-option and similar statements work the above
> contains an implicit call to the C compiler.  ...and due to the fact
> that we're storing the result in KBUILD_CFLAGS, a simply expanded
> variable, the call will happen every time the Makefile is parsed, even
> if there are no users of KBUILD_CFLAGS.
>
> Rather than redoing this computation every time, it makes a lot of
> sense to cache the result of all of the Makefile's compiler calls just
> like we do when we compile a ".c" file to a ".o" file.  Conceptually
> this is quite a simple idea.  ...and since the calls to invoke the
> compiler and similar tools are centrally located in the Kbuild.include
> file this doesn't even need to be super invasive.
>
> Implementing the cache in a simple-to-use and efficient way is not
> quite as simple as it first sounds, though.  To get maximum speed we
> really want the cache in a format that make can natively understand
> and make doesn't really have an ability to load/parse files. ...but
> make _can_ import other Makefiles, so the solution is to store the
> cache in Makefile format.  This requires coming up with a valid/unique
> Makefile variable name for each value to be cached, but that's
> solvable with some cleverness.
>
> After this change, we'll automatically create a "Makefile-cache.o"
> file that will contain our cached variables.  We'll load this on each
> invocation of make and will avoid recomputing anything that's already
> in our cache.  The cache is stored in a format that it shouldn't need
> any invalidation since anything that might change should affect the
> "key" and any old cached value won't be used.  The cache will be
> cleaned by virtue of the ".o" suffix by a "make clean".
>
> Signed-off-by: Douglas Anderson 



Thanks for the patches.  The idea is interesting.

I am not a Chrome developer, but cc-option could be improved somehow.


I examined two approaches to mitigate the pain.

[1] Skip cc-option completely when we run non-build targets
such as "make help", "make clean", etc.

[2] Cache the result of cc-option like this patch


I wrote some patches for [1]
https://patchwork.kernel.org/patch/9983825/
https://patchwork.kernel.org/patch/9983829/
https://patchwork.kernel.org/patch/9983833/
https://patchwork.kernel.org/patch/9983827/

Comments are welcome.  :)


[1] does not solve the slowness in the incremental build.
Instead, it makes non-build targets faster
from a clean source tree because cc-option is zero cost.


[2] solves the issues in the incremental build.
One funny thing is, it computes cc-option and store the cache file
even for "make clean", where we know the cache file will be
immediately deleted.


We can apply [1] or [2], or maybe both of them
because [1] and [2] are trying to solve the different issues.



The cache approach seemed clever, but I do not like the implementation.
The code is even more unreadable with lots of escape sequence.


Here is my suggestion for simpler implementation (including bike-shed)


Implement a new function "shell-cache".  It works like $(shell ...)

The difference is
$(call shell-cached,...) returns the cached result, or run the command
if not cached.



Also, add try-run-cached.  This is a cached variant of try-run.


I just played with it, and seems working.
(I did not have spend much time for testing, more wider test is needed.)


The code is like something like this:



make-cache := $(if $(KBUILD_EXTMOD),$(KBUILD_EXTMOD)/,$(if
$(obj),$(obj)/)).cache.mk

-include $(make-cache)

__sanitize-opt = $(subst $$,_,$(subst $(right_paren),_,$(subst
$(left_paren),_,$(subst \,_,$(subst =,_,$(subst $(space),_,$(subst
$(comma),_,$(1

define __run-and-store
ifeq ($(

Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles

2017-10-03 Thread Guenter Roeck
On Tue, Oct 3, 2017 at 8:20 PM, Masahiro Yamada
 wrote:
> 2017-10-04 5:48 GMT+09:00 Douglas Anderson :
>> The main Linux Makefiles and the tools sub-Makefiles have different
>> conventions for passing in CC / CFLAGS.
>>
>> Here's brief summary for the kernel:
>> * CC: target C compiler (must be passed as an argument to make to
>>   override)
>> * HOSTCC: host C compiler (must be passed as an argument to make to
>>   override)
>> * CFLAGS: ignored (kernel manages its own CFLAGS)
>> * KCFLAGS: extra cflags for the target (expected as an env var)
>> * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of
>>   make)
>>
>> Here's a brief summary for the tools:
>> * CC: host C compiler (must be passed as an argument to make to
>>   override)
>> * CFLAGS: base arguments for the host C compiler which are added to by
>>   sub builds (expected as an env var)
>>
>> When the main kernel Makefile calls into the tools Makefile, it should
>> adapt things from its syntax to the tools Makefile syntax.
>>
>> If we don't do this, we have a few issues:
>> * If someone wants to user another compiler (like clang, maybe) for
>>   hostcc then the tools will still be compiled with gcc.
>> * If you happen to be building and you left ${CFLAGS} set to something
>>   random (maybe it matches ${CC}, the _target_ C compiler, not the
>>   _host_ C compiler) then this random value will be used to compile
>>   the tools.
>>
>> In any case, it seems like it makes sense to pass CC, CFLAGS, and
>> similar properly into the tools Makefile.
>>
>> NOTE: in order to do this properly, we need to add some new
>> definitions of HOSTAS and HOSTLD into the main kernel Makefile.  If we
>> don't do this and someone overrides "AS" or "LD" on the command line
>> then those (which were intended for the target) will be used to build
>> host side tools.  We also make up some imaginary "HOSTASFLAGS" and
>> "HOSTLDFLAGS".  Someone could specify these, but if they don't at
>> least these blank variables will properly clobber ASFLAGS and LDFLAGS
>> from the kernel.
>>
>> This was discovered in the Chrome OS build system where CFLAGS (an
>> environment variable) was accidentally left pointing to flags that
>> would be an appropriate match to CC.  The kernel didn't use these
>> CFLAGS so it was never an issue.  Turning on STACKVALIDATION, however,
>> suddenly invoked a tools build that blew up.
>>
>> Signed-off-by: Douglas Anderson 
>> ---
>>
>>  Makefile | 30 --
>>  1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/Makefile b/Makefile
>> index cf007a31d575..0d3af0677d88 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS)
>>  HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS)
>>  HOST_LFS_LIBS := $(shell getconf LFS_LIBS)
>>
>> +HOSTAS   = as
>>  HOSTCC   = gcc
>> +HOSTLD   = ld
>>  HOSTCXX  = g++
>>  HOSTCFLAGS   := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \
>> -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS)
>> @@ -1616,11 +1618,35 @@ image_name:
>>  # Clear a bunch of variables before executing the submake
>>  tools/: FORCE
>> $(Q)mkdir -p $(objtree)/tools
>> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
>> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/
>> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
>> +   LDFLAGS="$(HOSTLDFLAGS)" \
>> +   CFLAGS="$(HOSTCFLAGS)" \
>> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
>> +   $(MAKE) \
>> +   AS="$(HOSTAS)" \
>> +   CC="$(HOSTCC)" \
>> +   CXX="$(HOSTCXX)" \
>> +   LD="$(HOSTLD)" \
>> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
>> +   O=$(abspath $(objtree)) \
>> +   subdir=tools \
>> +   -C $(src)/tools/
>>
>>  tools/%: FORCE
>> $(Q)mkdir -p $(objtree)/tools
>> -   $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% 
>> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $*
>> +   $(Q)ASFLAGS="$(HOSTASFLAGS)"\
>> +   LDFLAGS="$(HOSTLDFLAGS)" \
>> +   CFLAGS="$(HOSTCFLAGS)" \
>> +   CXXFLAGS="$(HOSTCXXFLAGS)" \
>> +   $(MAKE) \
>> +   AS="$(HOSTAS)" \
>> +   CC="$(HOSTCC)" \
>> +   CXX="$(HOSTCXX)" \
>> +   LD="$(HOSTLD)" \
>> +   MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \
>> +   O=$(abspath $(objtree)) \
>> +   subdir=tools \
>> +   -C $(src)/tools/ $*
>>
>
> Please do not do this.
>
>
>
> CC: for building tools that run on the target machine
> HOSTCC: for building tools that run on the build machine
>
>
> IMHO, this should be consistent to avoid confusion.
>
>
> Grepping CROSS_COMPILE under tools/ directory,
> I see many Makefile expect CC for the t

[PATCH] kbuild: fix dependency of dtbs targets

2017-10-03 Thread Masahiro Yamada
The target "dtbs" should depend on "scripts" because it needs to
build dtc.  The "prepare" target is unneeded here.

Signed-off-by: Masahiro Yamada 
---

 arch/arm/Makefile   | 2 +-
 arch/arm64/Makefile | 2 +-
 arch/nios2/Makefile | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/Makefile b/arch/arm/Makefile
index 47d3a1a..cdb5b55 100644
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -342,7 +342,7 @@ $(INSTALL_TARGETS):
 
 PHONY += dtbs dtbs_install
 
-dtbs: prepare scripts
+dtbs: scripts
$(Q)$(MAKE) $(build)=$(boot)/dts
 
 dtbs_install:
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 939b310..fd56fbc 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -133,7 +133,7 @@ zinstall install:
 
 PHONY += dtbs dtbs_install
 
-dtbs: prepare scripts
+dtbs: scripts
$(Q)$(MAKE) $(build)=$(boot)/dts
 
 dtbs_install:
diff --git a/arch/nios2/Makefile b/arch/nios2/Makefile
index 8673a79..abc3f5b 100644
--- a/arch/nios2/Makefile
+++ b/arch/nios2/Makefile
@@ -61,7 +61,7 @@ archclean:
 %.dtb: | scripts
$(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@
 
-dtbs:
+dtbs: scripts
$(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@
 
 $(BOOT_TARGETS): vmlinux
-- 
2.7.4



Re: [PATCH 3/8] ARM: dts: aspeed: Add I2C buses

2017-10-03 Thread Andrew Jeffery
On Thu, 2017-09-28 at 11:35 -0700, Brendan Higgins wrote:
> > On Thu, Sep 28, 2017 at 12:51 AM, Joel Stanley  wrote:
> > Now with an upstream i2c bus driver, we can add  the 14 i2c buses that
> > exist in ASPEED G4 and G5 generation SoCs.
> > 
> > It also adds aliases for the 14 built-in I2C busses to ensure userspace
> > sees the numbering staring from zero and counting up.
> > 
> > Signed-off-by: Joel Stanley 
> > ---
> 
> > Reviewed-by: Brendan Higgins 
> 
> nit: can we make the i2c labels and the pinctrl labels match?
> 
> For example:
> 
> > +
> > +   i2c13: i2c-bus@480 {
> > +   #address-cells = <1>;
> > +   #size-cells = <0>;
> > +   #interrupt-cells = <1>;
> > +
> > +   reg = <0x480 0x40>;
> > +   compatible = "aspeed,ast2500-i2c-bus";
> > +   clocks = <&clk_apb>;
> > +   bus-frequency = <10>;
> > +   interrupts = <13>;
> > +   interrupt-parent = <&i2c_ic>;
> > +   pinctrl-names = "default";
> > +   pinctrl-0 = <&pinctrl_i2c14_default>;
> > +   status = "disabled";
> > +   };
> 
> "i2c13" has a pinctrl-0 of "pinctrl_i2c14_default"
> 
> I know that pinctrl_i2c14_default is consistent with the function and
> groups it uses, but I would like to see them all be consistent at some
> point in the future.

I doubt we'll be making them consistent: How the i2c devices are aliased by
Linux are independent of how they're labelled in hardware or the datasheet.

For sanity's sake pinmux's functions and groups follow the datasheet (though if
you want to maintain the Aspeed pinctrl drivers, be my guest :D)

We could rename the pinmux nodes to offset them from the function and group,
but it feels less self-contained if the nodes are named at the whim of the
device requesting the function if there's no further configuration on top of
the function and group*. It also makes it harder to auto-explode the function
name into the node definitions as we'd then need some further mangling to make
things line up.

* The pinctrl devicetree bindings allow for definition of nested nodes to
specify collections and combinations of pin function and group, and pin
configuration such as drive-strength. Thus the nodes can be a lot more complex
than what we've defined up until now, but what we've got enables a lot of
pretty straight-forward uses-cases.

We could alias the i2c devices in line with the hardware, but that breaks the
convention of starting device numbering at 0.

Ultimately Joel and I chose to make the break at the point where the i2c device
requests the mux function. It's up to the device to know what hardware mux
configuration it needs and that's what this approach documents. The
inconsistency is irritating but I don't feel like it's the end of the world,
and I think the alternatives are (slightly) worse.

Cheers,

Andrew

signature.asc
Description: This is a digitally signed message part


  1   2   3   4   5   6   7   8   9   10   >