Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
Hi Dave, On 10/2/2017 1:30 PM, Santosh Shilimkar wrote: On 10/1/2017 10:56 PM, David Miller wrote: From: David Miller [...] Actually, reverted, this breaks the build. net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory #include "ib.h" Although I can't see how in the world this patch is causing such an error. Weird indeed. Will sort this out with Avinash. Thanks Dave. I tried few build combinations on net-next but couldn't reproduce the build failure. AFAIU, the build error can only happen if for some reason the ib.h file got deleted accidentally. I did delete ib.h file and saw below error CC [M] net/rds/rdma_transport.o net/rds/rdma_transport.c:38:16: error: ib.h: No such file or directory Could it be the case for some reason the ib.h file got deleted or mangled while applying the $subject patch ? Regards, Santosh
Re: [PATCH 1/6] f2fs: support issuing/waiting discard in range
Hi Jaegeuk, On 2017/10/4 4:16, Jaegeuk Kim wrote: > On 09/19, Chao Yu wrote: >> From: Chao Yu >> >> Fstrim intends to trim invalid blocks of filesystem only with specified >> range and granularity, but actually, it will issue all previous cached >> discard commands which may be out-of-range and be with unmatched >> granularity, it's unneeded. >> >> In order to fix above issues, this patch introduces new helps to support >> to issue and wait discard in range and adds a new fstrim_list for tracking >> in-flight discard from ->fstrim. >> >> Signed-off-by: Chao Yu >> --- >> fs/f2fs/f2fs.h| 1 + >> fs/f2fs/segment.c | 125 >> +- >> 2 files changed, 106 insertions(+), 20 deletions(-) >> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h >> index 89b4927dcd79..cb13c7df6971 100644 >> --- a/fs/f2fs/f2fs.h >> +++ b/fs/f2fs/f2fs.h >> @@ -249,6 +249,7 @@ struct discard_cmd_control { >> struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */ >> unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */ >> struct list_head wait_list; /* store on-flushing entries */ >> +struct list_head fstrim_list; /* in-flight discard from >> fstrim */ >> wait_queue_head_t discard_wait_queue; /* waiting queue for wake-up */ >> unsigned int discard_wake; /* to wake up discard thread */ >> struct mutex cmd_lock; >> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c >> index dedf0209d820..088936c61905 100644 >> --- a/fs/f2fs/segment.c >> +++ b/fs/f2fs/segment.c >> @@ -845,9 +845,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi, >> >> /* this function is copied from blkdev_issue_discard from block/blk-lib.c */ >> static void __submit_discard_cmd(struct f2fs_sb_info *sbi, >> -struct discard_cmd *dc) >> +struct discard_cmd *dc, bool fstrim) >> { >> struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; >> +struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) : >> +&(dcc->wait_list); >> struct bio *bio = NULL; >> >> if (dc->state != D_PREP) >> @@ -869,7 +871,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info >> *sbi, >> bio->bi_end_io = f2fs_submit_discard_endio; >> bio->bi_opf |= REQ_SYNC; >> submit_bio(bio); >> -list_move_tail(&dc->list, &dcc->wait_list); >> +list_move_tail(&dc->list, wait_list); >> __check_sit_bitmap(sbi, dc->start, dc->start + dc->len); >> >> f2fs_update_iostat(sbi, FS_DISCARD, 1); >> @@ -1054,6 +1056,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info >> *sbi, >> return 0; >> } >> >> +static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, >> +unsigned int start, unsigned int end, >> +unsigned int granularity) >> +{ >> +struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; >> +struct discard_cmd *prev_dc = NULL, *next_dc = NULL; >> +struct rb_node **insert_p = NULL, *insert_parent = NULL; >> +struct discard_cmd *dc; >> +struct blk_plug plug; >> +int issued; >> + >> +next: >> +issued = 0; >> + >> +mutex_lock(&dcc->cmd_lock); >> +f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root)); >> + >> +dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root, >> +NULL, start, >> +(struct rb_entry **)&prev_dc, >> +(struct rb_entry **)&next_dc, >> +&insert_p, &insert_parent, true); >> +if (!dc) >> +dc = next_dc; >> + >> +blk_start_plug(&plug); >> + >> +while (dc && dc->lstart <= end) { >> +struct rb_node *node; >> + >> +if (dc->len < granularity) >> +goto skip; >> + >> +if (dc->state != D_PREP) { >> +list_move_tail(&dc->list, &dcc->fstrim_list); >> +goto skip; >> +} >> + >> +__submit_discard_cmd(sbi, dc, true); >> + >> +if (++issued >= DISCARD_ISSUE_RATE) { >> +start = dc->lstart + dc->len; >> + >> +blk_finish_plug(&plug); >> +mutex_unlock(&dcc->cmd_lock); >> + >> +schedule(); >> + >> +goto next; >> +} >> +skip: >> +node = rb_next(&dc->rb_node); >> +dc = rb_entry_safe(node, struct discard_cmd, rb_node); >> + >> +if (fatal_signal_pending(current)) >> +break; >> +} >> + >> +blk_finish_plug(&plug); >> +mutex_unlock(&dcc->cmd_lock); >> +} >> + >> static int __issue_disca
Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
Jerome Glisse wrote: > On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote: > >> I'd like some more explanation about the inner working of "that new >> user" as per comment above. >> >> It would be enough to drop mmu_notifier_invalidate_range from above >> without adding it to the filebacked case. The above gives higher prio >> to the hypothetical and uncertain future case, than to the current >> real filebacked case that doesn't need ->invalidate_range inside the >> PT lock, or do you see something that might already need such >> ->invalidate_range? > > No i don't see any new user today that might need such invalidate but > i was trying to be extra cautious as i have a tendency to assume that > someone might do a patch that use try_to_unmap() without going through > all the comments in the function and thus possibly using it in a an > unexpected way from mmu_notifier callback point of view. I am fine > with putting the burden on new user to get it right and adding an > extra warning in the function description to try to warn people in a > sensible way. I must be missing something. After the PTE is changed, but before the secondary TLB notification/invalidation - What prevents another thread from changing the mappings (e.g., using munmap/mmap), and setting a new page at that PTE? Wouldn’t it end with the page being mapped without a secondary TLB flush in between? Nadav
Re: [RFC 5/5] pm: remove kernel thread freezing
On Tue, Oct 03, 2017 at 11:15:07PM +0200, Rafael J. Wysocki wrote: > On Tuesday, October 3, 2017 8:59:00 PM CEST Rafael J. Wysocki wrote: > > On Tuesday, October 3, 2017 8:53:13 PM CEST Luis R. Rodriguez wrote: > > > Now that all filesystems which used to rely on kthread > > > freezing have been converted to filesystem freeze/thawing > > > we can remove the kernel kthread freezer. > > > > > > Signed-off-by: Luis R. Rodriguez > > > > I like this one. :-) > > However, suspend_freeze/thaw_processes() require some more work. > > In particular, the freezing of workqueues is being removed here > without a replacement. Hrm, where do you see that? freeze_workqueues_busy() is still called on try_to_freeze_tasks(). Likewise thaw_processes() also calls thaw_workqueues(). I did forget to nuke pm_nosig_freezing though. Also as I have noted a few times now we have yet to determine if we can remove all freezer calls on kthreads without causing a regression. Granted I'm being overly careful here, I still would not be surprised to learn about a stupid use case where this will be hard to remove now. Only once this is done should this patch be considered. This will take time. So I'd like more general consensus on long term approach: 1) first address each fs to use its freezer calls on susend/hibernate / and thaw on resume. While at it, remove freezer API calls on their respective kthreads. 2) In parallel address removing freezer API calls on non-IO kthreads, these should be trivial, but who knows what surprises lurk here 3) Lookup for kthreads which seem to generate IO -- address / review if removal of the freezer API can be done somehow with a quescing. This is currently for example being done on SCSI / md. 4) Only after all the above is done should we consider this patch or some form of it. Luis
Re: [f2fs-dev] [PATCH 1/6] f2fs: support issuing/waiting discard in range
Hi Jaegeuk, On 2017/10/4 4:52, Jaegeuk Kim wrote: > Hi Chao, > > Could you please rebase the following patches on top of dev-test? > > f2fs-support-issuing-waiting-discard-in-range > f2fs-wrap-discard-policy > f2fs-split-discard-policy > f2fs-reduce-cmd_lock-coverage-in-__issue_discard_cmd > f2fs-trace-f2fs_remove_discard > f2fs-give-up-CP_TRIMMED_FLAG-if-it-drops-discards > > It seems we need to rearrange some of patches for better review. No problem, let me rebase them. :) Thanks, > > Thanks, > > On 10/03, Jaegeuk Kim wrote: >> On 09/19, Chao Yu wrote: >>> From: Chao Yu >>> >>> Fstrim intends to trim invalid blocks of filesystem only with specified >>> range and granularity, but actually, it will issue all previous cached >>> discard commands which may be out-of-range and be with unmatched >>> granularity, it's unneeded. >>> >>> In order to fix above issues, this patch introduces new helps to support >>> to issue and wait discard in range and adds a new fstrim_list for tracking >>> in-flight discard from ->fstrim. >>> >>> Signed-off-by: Chao Yu >>> --- >>> fs/f2fs/f2fs.h| 1 + >>> fs/f2fs/segment.c | 125 >>> +- >>> 2 files changed, 106 insertions(+), 20 deletions(-) >>> >>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h >>> index 89b4927dcd79..cb13c7df6971 100644 >>> --- a/fs/f2fs/f2fs.h >>> +++ b/fs/f2fs/f2fs.h >>> @@ -249,6 +249,7 @@ struct discard_cmd_control { >>> struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */ >>> unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */ >>> struct list_head wait_list; /* store on-flushing entries */ >>> + struct list_head fstrim_list; /* in-flight discard from >>> fstrim */ >>> wait_queue_head_t discard_wait_queue; /* waiting queue for wake-up */ >>> unsigned int discard_wake; /* to wake up discard thread */ >>> struct mutex cmd_lock; >>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c >>> index dedf0209d820..088936c61905 100644 >>> --- a/fs/f2fs/segment.c >>> +++ b/fs/f2fs/segment.c >>> @@ -845,9 +845,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi, >>> >>> /* this function is copied from blkdev_issue_discard from block/blk-lib.c >>> */ >>> static void __submit_discard_cmd(struct f2fs_sb_info *sbi, >>> - struct discard_cmd *dc) >>> + struct discard_cmd *dc, bool fstrim) >>> { >>> struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; >>> + struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) : >>> + &(dcc->wait_list); >>> struct bio *bio = NULL; >>> >>> if (dc->state != D_PREP) >>> @@ -869,7 +871,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info >>> *sbi, >>> bio->bi_end_io = f2fs_submit_discard_endio; >>> bio->bi_opf |= REQ_SYNC; >>> submit_bio(bio); >>> - list_move_tail(&dc->list, &dcc->wait_list); >>> + list_move_tail(&dc->list, wait_list); >>> __check_sit_bitmap(sbi, dc->start, dc->start + dc->len); >>> >>> f2fs_update_iostat(sbi, FS_DISCARD, 1); >>> @@ -1054,6 +1056,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info >>> *sbi, >>> return 0; >>> } >>> >>> +static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, >>> + unsigned int start, unsigned int end, >>> + unsigned int granularity) >>> +{ >>> + struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; >>> + struct discard_cmd *prev_dc = NULL, *next_dc = NULL; >>> + struct rb_node **insert_p = NULL, *insert_parent = NULL; >>> + struct discard_cmd *dc; >>> + struct blk_plug plug; >>> + int issued; >>> + >>> +next: >>> + issued = 0; >>> + >>> + mutex_lock(&dcc->cmd_lock); >>> + f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root)); >>> + >>> + dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root, >>> + NULL, start, >>> + (struct rb_entry **)&prev_dc, >>> + (struct rb_entry **)&next_dc, >>> + &insert_p, &insert_parent, true); >>> + if (!dc) >>> + dc = next_dc; >>> + >>> + blk_start_plug(&plug); >>> + >>> + while (dc && dc->lstart <= end) { >>> + struct rb_node *node; >>> + >>> + if (dc->len < granularity) >>> + goto skip; >>> + >>> + if (dc->state != D_PREP) { >>> + list_move_tail(&dc->list, &dcc->fstrim_list); >>> + goto skip; >>> + } >>> + >>> + __submit_discard_cmd(sbi, dc, true); >>> + >>> + if (++issued >= DISCARD_ISSUE_RATE) { >>> + start = dc->lstart + dc-
Re: [f2fs-dev] [PATCH v4] f2fs: introduce discard_granularity sysfs entry
Hi Ju Hyung, Jaegeuk, On 2017/10/4 4:17, Jaegeuk Kim wrote: > On 10/04, Ju Hyung Park wrote: >> Hi Chao. >> >> Yep, that patch seems to have fixed it. >> Doing "while true; do fstrim -v /; done" while "rm -rf"ing a 2GB >> kbuild directory >> (with lots of small .o files and stuff) ended flawlessly. Thanks for the test. ;) >> >> I hope to see this patch merged with next 4.14 merge cycle. > > Cool! I'll merge this patch and submit for 4.14. :) Thanks for the quick merging. ;) Thanks, > > Thanks, > >> >> Thanks :) >> >> On Tue, Oct 3, 2017 at 12:59 AM, Chao Yu wrote: >>> Hi Park, >>> >>> Thanks for the report, could have a try with below patch: >>> >>> From 5fa30e8cdcb93f210e25142c48a884be383c6121 Mon Sep 17 00:00:00 2001 >>> From: Chao Yu >>> Date: Mon, 2 Oct 2017 02:50:16 +0800 >>> Subject: [PATCH] f2fs: fix potential panic during fstrim >>> >>> As Ju Hyung Park reported: >>> >>> "When 'fstrim' is called for manual trim, a BUG() can be triggered >>> randomly with this patch. >>> >>> I'm seeing this issue on both x86 Desktop and arm64 Android phone. >>> >>> On x86 Desktop, this was caused during Ubuntu boot-up. I have a >>> cronjob installed which calls 'fstrim -v /' during boot. On arm64 >>> Android, this was caused during GC looping with 1ms gc_min_sleep_time >>> & gc_max_sleep_time." >>> >>> Root cause of this issue is that f2fs_wait_discard_bios can only be >>> used by f2fs_put_super, because during put_super there must be no >>> other referrers, so it can ignore discard entry's reference count >>> when removing the entry, otherwise in other caller we will hit bug_on >>> in __remove_discard_cmd as there may be other issuer added reference >>> count in discard entry. >>> >>> Thread AThread B >>> - issue_discard_thread >>> - f2fs_ioc_fitrim >>> - f2fs_trim_fs >>> - f2fs_wait_discard_bios >>>- __issue_discard_cmd >>> - __submit_discard_cmd >>> - __wait_discard_cmd >>> - dc->ref++ >>> - __wait_one_discard_bio >>>- __wait_discard_cmd >>> - __remove_discard_cmd >>> - f2fs_bug_on(sbi, dc->ref) >>> >>> Fixes: 969d1b180d987c2be02de890d0fff0f66a0e80de >>> Reported-by: Ju Hyung Park >>> Signed-off-by: Chao Yu >>> --- >>> fs/f2fs/f2fs.h| 2 +- >>> fs/f2fs/segment.c | 6 +++--- >>> fs/f2fs/super.c | 2 +- >>> 3 files changed, 5 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h >>> index 9a7c90386947..4b4a72f392be 100644 >>> --- a/fs/f2fs/f2fs.h >>> +++ b/fs/f2fs/f2fs.h >>> @@ -2525,7 +2525,7 @@ void invalidate_blocks(struct f2fs_sb_info *sbi, >>> block_t addr); >>> bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr); >>> void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new); >>> void stop_discard_thread(struct f2fs_sb_info *sbi); >>> -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi); >>> +void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount); >>> void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control >>> *cpc); >>> void release_discard_addrs(struct f2fs_sb_info *sbi); >>> int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra); >>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c >>> index dedf0209d820..e28245b7e44e 100644 >>> --- a/fs/f2fs/segment.c >>> +++ b/fs/f2fs/segment.c >>> @@ -1210,11 +1210,11 @@ void stop_discard_thread(struct f2fs_sb_info *sbi) >>> } >>> >>> /* This comes from f2fs_put_super and f2fs_trim_fs */ >>> -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi) >>> +void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount) >>> { >>> __issue_discard_cmd(sbi, false); >>> __drop_discard_cmd(sbi); >>> - __wait_discard_cmd(sbi, false); >>> + __wait_discard_cmd(sbi, !umount); >>> } >>> >>> static void mark_discard_range_all(struct f2fs_sb_info *sbi) >>> @@ -2244,7 +2244,7 @@ int f2fs_trim_fs(struct f2fs_sb_info *sbi, struct >>> fstrim_range *range) >>> } >>> /* It's time to issue all the filed discards */ >>> mark_discard_range_all(sbi); >>> - f2fs_wait_discard_bios(sbi); >>> + f2fs_wait_discard_bios(sbi, false); >>> out: >>> range->len = F2FS_BLK_TO_BYTES(cpc.trimmed); >>> return err; >>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c >>> index 89f61eb3d167..933c3d529e65 100644 >>> --- a/fs/f2fs/super.c >>> +++ b/fs/f2fs/super.c >>> @@ -801,7 +801,7 @@ static void f2fs_put_super(struct super_block *sb) >>> } >>> >>> /* be sure to wait for any on-going discard commands */ >>> - f2fs_wait_discard_bios(sbi); >>> + f2fs_wait_discard_bios(sbi, true); >>> >>> if (f2fs_discard_en(sbi) && !sbi->discard_blks) { >>> struct cp_control cpc = { >>> -- >>> 2.14.1.145.gb3622a4ee >>> >>> On 2017/10/2 3:29, Ju Hy
[PATCH net] RDS: IB: Initialize max_items based on underlying device attributes
Use max_1m_mrs/max_8k_mrs while setting max_items, as the former variables are set based on the underlying device attricutes. Signed-off-by: Avinash Repaka --- net/rds/ib_rdma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 9a3c54e..e678699 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -601,11 +601,11 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev, if (pool_type == RDS_IB_MR_1M_POOL) { /* +1 allows for unaligned MRs */ pool->fmr_attr.max_pages = RDS_MR_1M_MSG_SIZE + 1; - pool->max_items = RDS_MR_1M_POOL_SIZE; + pool->max_items = rds_ibdev->max_1m_mrs; } else { /* pool_type == RDS_IB_MR_8K_POOL */ pool->fmr_attr.max_pages = RDS_MR_8K_MSG_SIZE + 1; - pool->max_items = RDS_MR_8K_POOL_SIZE; + pool->max_items = rds_ibdev->max_8k_mrs; } pool->max_free_pinned = pool->max_items * pool->fmr_attr.max_pages / 4; -- 2.4.11
Re: [RFC v3 7/7] platform/x86: intel_scu_ipc: Use generic Intel IPC device calls
Hi Guenter, On 09/28/2017 06:35 AM, Guenter Roeck wrote: On 09/28/2017 05:55 AM, Alexandre Belloni wrote: On 04/09/2017 at 22:37:27 -0700, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy Sathyanarayanan Removed redundant IPC helper functions and refactored the driver to use generic IPC device driver APIs. This patch also cleans-up SCU IPC user drivers to use APIs provided by generic IPC driver. Signed-off-by: Kuppuswamy Sathyanarayanan For the RTC part: Acked-by: Alexandre Belloni > --- arch/x86/include/asm/intel_scu_ipc.h | 23 +- arch/x86/platform/intel-mid/intel-mid.c | 12 +- drivers/platform/x86/Kconfig | 1 + drivers/platform/x86/intel_scu_ipc.c | 483 +--- drivers/rtc/rtc-mrst.c | 15 +- drivers/watchdog/intel-mid_wdt.c | 12 +- drivers/watchdog/intel_scu_watchdog.c | 17 +- 7 files changed, 251 insertions(+), 312 deletions(-) I think it would help in general to mention at least in the description which drivers are touched. I will add this info in my next patch version. I see calls to intel_ipc_dev_get() but not associated put calls. Missing the context, it may well be that those are not necessary, but then how does the ipc code know that the device reference is no longer needed ? I also noticed this issue. Initially I thought we don't need to deal with ref count because in most cases IPC client drivers are compiled as built-in drivers ( because these APIs are required during early boot process to send IPC commands to ARC processor) and hence will have life-time until the device is powered-off. But after looking into Kconfig params, I noticed that at least intel_pmc_ipc and intel_punit_ipc drivers can be compiled as modules. And hence we will have dangling pointer reference issue if we don't maintain proper refcount for device references. So, I have already fixed this problem by adding intel_ipc_dev_put() and devm_intel_ipc_dev_get() APIs. This fix will be available in next version. Guenter diff --git a/arch/x86/include/asm/intel_scu_ipc.h b/arch/x86/include/asm/intel_scu_ipc.h index 81d3d87..5842534 100644 --- a/arch/x86/include/asm/intel_scu_ipc.h +++ b/arch/x86/include/asm/intel_scu_ipc.h @@ -14,9 +14,19 @@ #define IPCMSG_COLD_BOOT 0xF3 #define IPCMSG_VRTC 0xFA /* Set vRTC device */ - /* Command id associated with message IPCMSG_VRTC */ - #define IPC_CMD_VRTC_SETTIME 1 /* Set time */ - #define IPC_CMD_VRTC_SETALARM 2 /* Set alarm */ + +/* Command id associated with message IPCMSG_VRTC */ +#define IPC_CMD_VRTC_SETTIME 1 /* Set time */ +#define IPC_CMD_VRTC_SETALARM 2 /* Set alarm */ + +#define INTEL_SCU_IPC_DEV "intel_scu_ipc" +#define SCU_PARAM_LEN 2 + +static inline void scu_cmd_init(u32 *cmd, u32 param1, u32 param2) +{ + cmd[0] = param1; + cmd[1] = param2; +} /* Read single register */ int intel_scu_ipc_ioread8(u16 addr, u8 *data); @@ -45,13 +55,6 @@ int intel_scu_ipc_writev(u16 *addr, u8 *data, int len); /* Update single register based on the mask */ int intel_scu_ipc_update_register(u16 addr, u8 data, u8 mask); -/* Issue commands to the SCU with or without data */ -int intel_scu_ipc_simple_command(int cmd, int sub); -int intel_scu_ipc_command(int cmd, int sub, u32 *in, int inlen, - u32 *out, int outlen); -int intel_scu_ipc_raw_command(int cmd, int sub, u8 *in, int inlen, - u32 *out, int outlen, u32 dptr, u32 sptr); - /* I2C control api */ int intel_scu_ipc_i2c_cntrl(u32 addr, u32 *data); diff --git a/arch/x86/platform/intel-mid/intel-mid.c b/arch/x86/platform/intel-mid/intel-mid.c index 12a2725..27541e9 100644 --- a/arch/x86/platform/intel-mid/intel-mid.c +++ b/arch/x86/platform/intel-mid/intel-mid.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -68,18 +69,23 @@ static void *(*get_intel_mid_ops[])(void) = INTEL_MID_OPS_INIT; enum intel_mid_cpu_type __intel_mid_cpu_chip; EXPORT_SYMBOL_GPL(__intel_mid_cpu_chip); +static struct intel_ipc_dev *ipc_dev; + static void intel_mid_power_off(void) { + u32 cmds[SCU_PARAM_LEN] = {IPCMSG_COLD_OFF, 1}; /* Shut down South Complex via PWRMU */ intel_mid_pwr_power_off(); /* Only for Tangier, the rest will ignore this command */ - intel_scu_ipc_simple_command(IPCMSG_COLD_OFF, 1); + ipc_dev_simple_cmd(ipc_dev, cmds, SCU_PARAM_LEN); }; static void intel_mid_reboot(void) { - intel_scu_ipc_simple_command(IPCMSG_COLD_BOOT, 0); + u32 cmds[SCU_PARAM_LEN] = {IPCMSG_COLD_BOOT, 0}; + + ipc_dev_simple_cmd(ipc_dev, cmds, SCU_PARAM_LEN); } static unsigned long __init intel_mid_calibrate_tsc(void) @@ -206,6 +212,8 @@ void __init x86_intel_mid_early_setup(void) x86_init.mpparse.find_smp_config = x86_init_noop; x86_init.mpparse.get_smp_config = x86_init_uint_noop; set_bit(MP_BUS_ISA,
Re: [RFC v3 1/7] platform/x86: intel_pmc_ipc: Use devm_* calls in driver probe function
Hi, On 10/01/2017 07:34 AM, Andy Shevchenko wrote: On Tue, Sep 5, 2017 at 8:37 AM, wrote: From: Kuppuswamy Sathyanarayanan This patch cleans up unnecessary free/alloc calls in ipc_plat_probe(), ipc_pci_probe() and ipc_plat_get_res() functions by using devm_* calls. This patch also adds proper error handling for failure cases in ipc_pci_probe() function. Pushed (this patch only) to my review queue with slight style changes, thanks. Thanks Andy. Will rebase my rest of the patches on top of this patch. Signed-off-by: Kuppuswamy Sathyanarayanan --- drivers/platform/x86/intel_pmc_ipc.c | 104 --- 1 file changed, 34 insertions(+), 70 deletions(-) Changes since v2: * Used pcim_* device managed functions. Changes since v1: * Merged devm_* related changes into a single function. * Instead of removing free_irq, use devm_free_irq function. diff --git a/drivers/platform/x86/intel_pmc_ipc.c b/drivers/platform/x86/intel_pmc_ipc.c index bb792a5..5eef649 100644 --- a/drivers/platform/x86/intel_pmc_ipc.c +++ b/drivers/platform/x86/intel_pmc_ipc.c @@ -480,52 +480,39 @@ static irqreturn_t ioc(int irq, void *dev_id) static int ipc_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { - resource_size_t pci_resource; int ret; - int len; + struct intel_pmc_ipc_dev *pmc = &ipcdev; - ipcdev.dev = &pci_dev_get(pdev)->dev; - ipcdev.irq_mode = IPC_TRIGGER_MODE_IRQ; + /* Only one PMC is supported */ + if (pmc->dev) + return -EBUSY; - ret = pci_enable_device(pdev); + pmc->irq_mode = IPC_TRIGGER_MODE_IRQ; + + ret = pcim_enable_device(pdev); if (ret) return ret; - ret = pci_request_regions(pdev, "intel_pmc_ipc"); + ret = pcim_iomap_regions(pdev, 1 << 0, pci_name(pdev)); if (ret) return ret; - pci_resource = pci_resource_start(pdev, 0); - len = pci_resource_len(pdev, 0); - if (!pci_resource || !len) { - dev_err(&pdev->dev, "Failed to get resource\n"); - return -ENOMEM; - } + init_completion(&pmc->cmd_complete); - init_completion(&ipcdev.cmd_complete); + pmc->ipc_base = pcim_iomap_table(pdev)[0]; - if (request_irq(pdev->irq, ioc, 0, "intel_pmc_ipc", &ipcdev)) { + ret = devm_request_irq(&pdev->dev, pdev->irq, ioc, 0, "intel_pmc_ipc", + pmc); + if (ret) { dev_err(&pdev->dev, "Failed to request irq\n"); - return -EBUSY; + return ret; } - ipcdev.ipc_base = ioremap_nocache(pci_resource, len); - if (!ipcdev.ipc_base) { - dev_err(&pdev->dev, "Failed to ioremap ipc base\n"); - free_irq(pdev->irq, &ipcdev); - ret = -ENOMEM; - } + pmc->dev = &pdev->dev; - return ret; -} + pci_set_drvdata(pdev, pmc); -static void ipc_pci_remove(struct pci_dev *pdev) -{ - free_irq(pdev->irq, &ipcdev); - pci_release_regions(pdev); - pci_dev_put(pdev); - iounmap(ipcdev.ipc_base); - ipcdev.dev = NULL; + return 0; } static const struct pci_device_id ipc_pci_ids[] = { @@ -540,7 +527,6 @@ static struct pci_driver ipc_pci_driver = { .name = "intel_pmc_ipc", .id_table = ipc_pci_ids, .probe = ipc_pci_probe, - .remove = ipc_pci_remove, }; static ssize_t intel_pmc_ipc_simple_cmd_store(struct device *dev, @@ -849,18 +835,16 @@ static int ipc_plat_get_res(struct platform_device *pdev) dev_err(&pdev->dev, "Failed to get ipc resource\n"); return -ENXIO; } - size = PLAT_RESOURCE_IPC_SIZE + PLAT_RESOURCE_GCR_SIZE; + res->end = (res->start + PLAT_RESOURCE_IPC_SIZE + + PLAT_RESOURCE_GCR_SIZE - 1); - if (!request_mem_region(res->start, size, pdev->name)) { - dev_err(&pdev->dev, "Failed to request ipc resource\n"); - return -EBUSY; - } - addr = ioremap_nocache(res->start, size); - if (!addr) { - dev_err(&pdev->dev, "I/O memory remapping failed\n"); - release_mem_region(res->start, size); - return -ENOMEM; + addr = devm_ioremap_resource(&pdev->dev, res); + if (IS_ERR(addr)) { + dev_err(&pdev->dev, + "PMC I/O memory remapping failed\n"); + return PTR_ERR(addr); } + ipcdev.ipc_base = addr; ipcdev.gcr_mem_base = addr + PLAT_RESOURCE_GCR_OFFSET; @@ -917,7 +901,6 @@ MODULE_DEVICE_TABLE(acpi, ipc_acpi_ids); static int ipc_plat_probe(struct platform_device *pdev) { - struct resource *res; int ret; ipcdev.dev = &pdev->dev; @@ -939,61 +922,42 @@ static int ipc_plat_probe(struct platform_device *pdev) ret = ipc_create_pmc_devices(); if (ret) {
Re: [RFC v3 2/7] platform/x86: intel_pmc_ipc: Use MFD framework to create dependent devices
Hi Andy, On 10/01/2017 07:44 AM, Andy Shevchenko wrote: On Tue, Sep 5, 2017 at 8:37 AM, wrote: From: Kuppuswamy Sathyanarayanan Currently, we have lot of repetitive code in dependent device resource allocation and device creation handling code. This logic can be improved if we use MFD framework for dependent device creation. This patch adds this support. + punit_cell.id = -1; I will remove this line in next version. + return devm_mfd_add_devices(&pdev->dev, PLATFORM_DEVID_AUTO, + &punit_cell, 1, NULL, 0, NULL); IIRC you don't need to file cell ID in case of DEVID_AUTO. I am planning to use DEVID_NONE here to match the current behavior. Unless you have some concerns? -- Sathyanarayanan Kuppuswamy Linux kernel developer
Re: [RFC 5/5] pm: remove kernel thread freezing
On Wed, 2017-10-04 at 02:47 +0200, Luis R. Rodriguez wrote: > 3) Lookup for kthreads which seem to generate IO -- address / review if > removal of the freezer API can be done somehow with a quescing. This > is currently for example being done on SCSI / md. > 4) Only after all the above is done should we consider this patch or some > form of it. After having given this more thought, I think we should omit these last two steps. Modifying the md driver such that it does not submit I/O requests while processes are frozen requires either to use the freezer API or to open-code it. I think there is general agreement in the kernel community that open-coding a single mechanism in multiple drivers is wrong. Does this mean that instead of removing the freezer API we should keep it and review all its users instead? Bart.
Re: d82fed7529 ("locking/lockdep/selftests: Fix mixed read-write .."): BUG: -1 unexpected failures (out of 262) - debugging disabled! |
On Wed, Aug 30, 2017 at 09:41:59AM +0200, Peter Zijlstra wrote: On Wed, Aug 30, 2017 at 08:29:47AM +0200, Peter Zijlstra wrote: On Wed, Aug 30, 2017 at 11:37:21AM +0800, kernel test robot wrote: > [0.004000] - > [0.004000] BUG: -1 unexpected failures (out of 262) - debugging disabled! | > [0.004000] - lol.. however did that happen.. /me goes look. Ah, I think this should cure.. It works! Tested-by: Fengguang Wu --- diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index cd0b5c964bd0..2b827b8a1d8c 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -2031,11 +2031,13 @@ void locking_selftest(void) print_testname("mixed read-lock/lock-write ABBA"); pr_cont(" |"); dotest(rlock_ABBA1, FAILURE, LOCKTYPE_RWLOCK); +#ifdef CONFIG_PROVE_LOCKING /* * Lockdep does indeed fail here, but there's nothing we can do about * that now. Don't kill lockdep for it. */ unexpected_testcase_failures--; +#endif pr_cont(" |"); dotest(rwsem_ABBA1, FAILURE, LOCKTYPE_RWSEM);
Re: [PATCH 02/18] tracing/filter: use ARRAY_SIZE
On Sun, 1 Oct 2017 15:30:40 -0400 Jérémy Lefaure wrote: > It is useless to re-invent the ARRAY_SIZE macro so let's use it instead > of DATA_CNT. > > Found with Coccinelle with the following semantic patch: > @r depends on (org || report)@ > type T; > T[] E; > position p; > @@ > ( > (sizeof(E)@p /sizeof(*E)) > | > (sizeof(E)@p /sizeof(E[...])) > | > (sizeof(E)@p /sizeof(T)) > ) > > Signed-off-by: Jérémy Lefaure > --- > kernel/trace/trace_events_filter.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/kernel/trace/trace_events_filter.c > b/kernel/trace/trace_events_filter.c > index 61e7f0678d33..02d0f378dc5c 100644 > --- a/kernel/trace/trace_events_filter.c > +++ b/kernel/trace/trace_events_filter.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include "trace.h" > #include "trace_output.h" > @@ -2376,8 +2377,6 @@ static struct test_filter_data_t { > #undef YES > #undef NO > > -#define DATA_CNT (sizeof(test_filter_data)/sizeof(struct test_filter_data_t)) Just change DATA_CNT to be: #define DATA_CNT ARRAY_SIZE(test_filter_data) and don't change the rest. -- Steve > - > static int test_pred_visited; > > static int test_pred_visited_fn(struct filter_pred *pred, void *event) > @@ -2417,7 +2416,7 @@ static __init int ftrace_test_event_filter(void) > > printk(KERN_INFO "Testing ftrace filter: "); > > - for (i = 0; i < DATA_CNT; i++) { > + for (i = 0; i < ARRAY_SIZE(test_filter_data); i++) { > struct event_filter *filter = NULL; > struct test_filter_data_t *d = &test_filter_data[i]; > int err; > @@ -2463,7 +2462,7 @@ static __init int ftrace_test_event_filter(void) > } > } > > - if (i == DATA_CNT) > + if (i == ARRAY_SIZE(test_filter_data)) > printk(KERN_CONT "OK\n"); > > return 0;
Re: [RFC v3 0/7] PMC/PUNIT IPC driver cleanup
Hi Andy, Thanks for the review. I really appreciate you taking the time to review such big chunk of code changes. On 10/01/2017 07:46 AM, Andy Shevchenko wrote: On Tue, Sep 5, 2017 at 8:37 AM, wrote: From: Kuppuswamy Sathyanarayanan Hi All, Currently intel_pmc_ipc.c, intel_punit_ipc.c, intel_scu_ipc.c drivers implements the same IPC features. This code duplication could be avoided if we implement the IPC driver as a generic library and let custom device drivers use API provided by generic driver. This patchset mainly addresses this issue. Along with above code duplication issue, This patchset also addresses following issues in intel_pmc_ipc and intel_punit_ipc drivers. 1. Intel_pmc_ipc.c driver does not use any resource managed (devm_*) calls. 2. In Intel_pmc_ipc.c driver, dependent devices like PUNIT, Telemetry and iTCO are created manually and uses lot of redundant buffer code. 3. Global variable is used to store the IPC device structure and it is used across all functions in intel_pmc_ipc.c and intel_punit_ipc.c. More info on Intel IPC device library: - A generic Intel IPC class driver has been implemented and all common IPC helper functions has been moved to this driver. It exposes APIs to create IPC device channel, send raw IPC command and simple IPC commands. It also creates device attribute to send IPC command from user space. API for creating a new IPC channel device is, struct intel_ipc_dev *devm_intel_ipc_dev_create(struct device *dev, const char *devname, struct intel_ipc_dev_cfg *cfg, struct intel_ipc_dev_ops *ops) The IPC channel drivers (PUNIT/PMC/SCU) when creating a new device can configure their device params like register mapping, irq, irq-mode, channel type,etc using intel_ipc_dev_cfg and intel_ipc_dev_ops arguments. After a new IPC channel device is created, IPC users can use the generic APIs to make IPC calls. For example, after using this new model, IPC call to PMC device will look like, pmc_ipc_dev = intel_ipc_dev_get(INTEL_PMC_IPC_DEV); ipc_dev_raw_cmd(pmc_ipc_dev, cmd, PMC_PARAM_LEN, (u32 *)ipc_in, 1, NULL, 0, 0, 0); I am still testing the driver in different products. But posted it to get some early comments. I also welcome any PMC/PUNIT driver users to check these patches in their product. I have applied first patch from the series with some small changes. Please rebase the rest on top of my review branch (review-andy). Will do. I will soon send v4 version. Before sending new version, address comments that others and me did. Yes. -- Sathyanarayanan Kuppuswamy Linux kernel developer
Re: [RFC v3 4/7] platform: x86: Add generic Intel IPC driver
Hi, On 10/01/2017 07:59 AM, Andy Shevchenko wrote: On Tue, Sep 5, 2017 at 8:37 AM, wrote: From: Kuppuswamy Sathyanarayanan Currently intel_scu_ipc.c, intel_pmc_ipc.c and intel_punit_ipc.c redundantly implements the same IPC features and has lot of code duplication between them. This driver addresses this issue by grouping the common IPC functionalities under the same driver. +static const char *ipc_dev_err_string(struct intel_ipc_dev *ipc_dev, + int error) +{ + switch (error) { + case IPC_DEV_ERR_NONE: + return "No error"; + case IPC_DEV_ERR_CMD_NOT_SUPPORTED: + return "Command not-supported/Invalid"; + case IPC_DEV_ERR_CMD_NOT_SERVICED: + return "Command not-serviced/Invalid param"; + case IPC_DEV_ERR_UNABLE_TO_SERVICE: + return "Unable-to-service/Cmd-timeout"; + case IPC_DEV_ERR_CMD_INVALID: + return "Command-invalid/Cmd-locked"; + case IPC_DEV_ERR_CMD_FAILED: + return "Command-failed/Invalid-VR-id"; + case IPC_DEV_ERR_EMSECURITY: + return "Invalid Battery/VR-Error"; + case IPC_DEV_ERR_UNSIGNEDKERNEL: + return "Unsigned kernel"; + default: + return "Unknown Command"; + }; +} Since it's continuous list you can define an array of messages like const char * const *errors = { [..._ERR_...] = "", ... }; Also you can use enum in the header and define _ERR_MAX there. Thus, code would be transformed to if (error < _ERR_MAX) return errors[error]; return "Unknown Command"; Thanks will fix it in next version. -- Sathyanarayanan Kuppuswamy Linux kernel developer
Re: [PATCH net] RDS: IB: Initialize max_items based on underlying device attributes
Hi Avinash, On 10/3/2017 5:50 PM, Avinash Repaka wrote: Use max_1m_mrs/max_8k_mrs while setting max_items, as the former variables are set based on the underlying device attricutes. s/attricutes/attributes Signed-off-by: Avinash Repaka --- net/rds/ib_rdma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) This patch is fine but it will be nice if you can collect these changes in series as you are trying to update the FRWR support. Like this patch and other cleanup patch you posted yesterday. With that log fixed, Acked-by: Santosh Shilimkar
[PATCH v2 5/6] f2fs: trace f2fs_remove_discard
From: Chao Yu This patch adds tracepoint to trace f2fs_remove_discard. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/segment.c | 2 ++ include/trace/events/f2fs.h | 7 +++ 2 files changed, 9 insertions(+) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index ba433828c170..4a108321233d 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -822,6 +822,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi, { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; + trace_f2fs_remove_discard(dc->bdev, dc->start, dc->len); + f2fs_bug_on(sbi, dc->ref); if (dc->error == -EOPNOTSUPP) diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h index 5d216f7fb05a..0e7a31694ff5 100644 --- a/include/trace/events/f2fs.h +++ b/include/trace/events/f2fs.h @@ -1286,6 +1286,13 @@ DEFINE_EVENT(f2fs_discard, f2fs_issue_discard, TP_ARGS(dev, blkstart, blklen) ); +DEFINE_EVENT(f2fs_discard, f2fs_remove_discard, + + TP_PROTO(struct block_device *dev, block_t blkstart, block_t blklen), + + TP_ARGS(dev, blkstart, blklen) +); + TRACE_EVENT(f2fs_issue_reset_zone, TP_PROTO(struct block_device *dev, block_t blkstart), -- 2.14.1.145.gb3622a4ee
[PATCH v2 4/6] f2fs: reduce cmd_lock coverage in __issue_discard_cmd
From: Chao Yu __submit_discard_cmd may lead long latency due to exhaustion of I/O request resource in block layer, so issuing all discard under cmd_lock may lead to hangtask, in order to avoid that, let's reduce it's coverage. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/segment.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index e8fcc1eb4e0a..ba433828c170 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1158,14 +1158,14 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi, int i, iter = 0, issued = 0; bool io_interrupted = false; - mutex_lock(&dcc->cmd_lock); - f2fs_bug_on(sbi, - !__check_rb_tree_consistence(sbi, &dcc->root)); - blk_start_plug(&plug); for (i = MAX_PLIST_NUM - 1; i >= 0; i--) { if (i + 1 < dpolicy->granularity) break; pend_list = &dcc->pend_list[i]; + + mutex_lock(&dcc->cmd_lock); + f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root)); + blk_start_plug(&plug); list_for_each_entry_safe(dc, tmp, pend_list, list) { f2fs_bug_on(sbi, dc->state != D_PREP); @@ -1179,12 +1179,14 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi, issued++; skip: if (++iter >= dpolicy->max_requests) - goto out; + break; } + blk_finish_plug(&plug); + mutex_unlock(&dcc->cmd_lock); + + if (iter >= dpolicy->max_requests) + break; } -out: - blk_finish_plug(&plug); - mutex_unlock(&dcc->cmd_lock); if (!issued && io_interrupted) issued = -1; -- 2.14.1.145.gb3622a4ee
[PATCH v2 6/6] f2fs: give up CP_TRIMMED_FLAG if it drops discards
From: Chao Yu In ->umount, once we drop remained discard entries, we should not set CP_TRIMMED_FLAG with another checkpoint. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/f2fs.h| 2 +- fs/f2fs/segment.c | 15 +++ fs/f2fs/super.c | 5 +++-- 3 files changed, 15 insertions(+), 7 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index f274805e231d..c85f49c41003 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -2565,7 +2565,7 @@ void init_discard_policy(struct discard_policy *dpolicy, int discard_type, unsigned int granularity); void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new); void stop_discard_thread(struct f2fs_sb_info *sbi); -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi); +bool f2fs_wait_discard_bios(struct f2fs_sb_info *sbi); void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc); void release_discard_addrs(struct f2fs_sb_info *sbi); int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra); diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 4a108321233d..bfbcff8339c5 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1196,12 +1196,13 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi, return issued; } -static void __drop_discard_cmd(struct f2fs_sb_info *sbi) +static bool __drop_discard_cmd(struct f2fs_sb_info *sbi) { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; struct list_head *pend_list; struct discard_cmd *dc, *tmp; int i; + bool dropped = false; mutex_lock(&dcc->cmd_lock); for (i = MAX_PLIST_NUM - 1; i >= 0; i--) { @@ -1209,9 +1210,12 @@ static void __drop_discard_cmd(struct f2fs_sb_info *sbi) list_for_each_entry_safe(dc, tmp, pend_list, list) { f2fs_bug_on(sbi, dc->state != D_PREP); __remove_discard_cmd(sbi, dc); + dropped = true; } } mutex_unlock(&dcc->cmd_lock); + + return dropped; } static void __wait_one_discard_bio(struct f2fs_sb_info *sbi, @@ -1306,15 +1310,18 @@ void stop_discard_thread(struct f2fs_sb_info *sbi) } /* This comes from f2fs_put_super */ -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi) +bool f2fs_wait_discard_bios(struct f2fs_sb_info *sbi) { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; struct discard_policy dpolicy; + bool dropped; init_discard_policy(&dpolicy, DPOLICY_UMOUNT, dcc->discard_granularity); __issue_discard_cmd(sbi, &dpolicy); - __drop_discard_cmd(sbi); + dropped = __drop_discard_cmd(sbi); __wait_all_discard_cmd(sbi, &dpolicy); + + return dropped; } static int issue_discard_thread(void *data) @@ -1659,7 +1666,7 @@ void init_discard_policy(struct discard_policy *dpolicy, dpolicy->max_interval = DEF_MAX_DISCARD_ISSUE_TIME; dpolicy->max_requests = DEF_MAX_DISCARD_REQUEST; dpolicy->io_aware_gran = MAX_PLIST_NUM; - dpolicy->io_aware = false; + dpolicy->io_aware = true; } else if (discard_type == DPOLICY_FSTRIM) { dpolicy->max_requests = DEF_MAX_DISCARD_REQUEST; dpolicy->io_aware_gran = MAX_PLIST_NUM; diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index a13269d1a1f0..1d68c18a487b 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -807,6 +807,7 @@ static void f2fs_put_super(struct super_block *sb) { struct f2fs_sb_info *sbi = F2FS_SB(sb); int i; + bool dropped; f2fs_quota_off_umount(sb); @@ -827,9 +828,9 @@ static void f2fs_put_super(struct super_block *sb) } /* be sure to wait for any on-going discard commands */ - f2fs_wait_discard_bios(sbi, true); + dropped = f2fs_wait_discard_bios(sbi); - if (f2fs_discard_en(sbi) && !sbi->discard_blks) { + if (f2fs_discard_en(sbi) && !sbi->discard_blks && !dropped) { struct cp_control cpc = { .reason = CP_UMOUNT | CP_TRIMMED, }; -- 2.14.1.145.gb3622a4ee
[PATCH v2 2/6] f2fs: wrap discard policy
From: Chao Yu This patch wraps scattered optional parameters into discard policy as below, later, with it we expect that we can adjust these parameters with proper strategy in different scenario. struct discard_policy { unsigned int min_interval; /* used for candidates exist */ unsigned int max_interval; /* used for candidates not exist */ unsigned int max_requests; /* # of discards issued per round */ unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */ bool io_aware; /* issue discard in idle time */ bool sync; /* submit discard with REQ_SYNC flag */ }; This patch doesn't change any logic of codes. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/f2fs.h| 12 +++- fs/f2fs/segment.c | 38 +- 2 files changed, 40 insertions(+), 10 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 6d51e8342513..67dd6a317384 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -149,7 +149,7 @@ enum { #define BATCHED_TRIM_BLOCKS(sbi) \ (BATCHED_TRIM_SEGMENTS(sbi) << (sbi)->log_blocks_per_seg) #define MAX_DISCARD_BLOCKS(sbi)BLKS_PER_SEC(sbi) -#define DISCARD_ISSUE_RATE 8 +#define DEF_MAX_DISCARD_REQUEST8 /* issue 8 discards per round */ #define DEF_MIN_DISCARD_ISSUE_TIME 50 /* 50 ms, if exists */ #define DEF_MAX_DISCARD_ISSUE_TIME 6 /* 60 s, if no candidates */ #define DEF_CP_INTERVAL60 /* 60 secs */ @@ -245,6 +245,15 @@ struct discard_cmd { int error; /* bio error */ }; +struct discard_policy { + unsigned int min_interval; /* used for candidates exist */ + unsigned int max_interval; /* used for candidates not exist */ + unsigned int max_requests; /* # of discards issued per round */ + unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */ + bool io_aware; /* issue discard in idle time */ + bool sync; /* submit discard with REQ_SYNC flag */ +}; + struct discard_cmd_control { struct task_struct *f2fs_issue_discard; /* discard thread */ struct list_head entry_list;/* 4KB discard entry list */ @@ -263,6 +272,7 @@ struct discard_cmd_control { atomic_t issing_discard;/* # of issing discard */ atomic_t discard_cmd_cnt; /* # of cached cmd count */ struct rb_root root;/* root of discard rb-tree */ + struct discard_policy dpolicy; /* current discard policy */ }; /* for the list of fsync inodes, used only during recovery */ diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index a0a0a887fc31..5853187230e7 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -879,6 +879,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi, struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) : &(dcc->wait_list); struct bio *bio = NULL; + int flag = dcc->dpolicy.sync ? REQ_SYNC : 0; if (dc->state != D_PREP) return; @@ -897,7 +898,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi, if (bio) { bio->bi_private = dc; bio->bi_end_io = f2fs_submit_discard_endio; - bio->bi_opf |= REQ_SYNC; + bio->bi_opf |= flag; submit_bio(bio); list_move_tail(&dc->list, wait_list); __check_sit_bitmap(sbi, dc->start, dc->start + dc->len); @@ -1092,6 +1093,7 @@ static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, struct discard_cmd *prev_dc = NULL, *next_dc = NULL; struct rb_node **insert_p = NULL, *insert_parent = NULL; struct discard_cmd *dc; + struct discard_policy *dpolicy = &dcc->dpolicy; struct blk_plug plug; int issued; @@ -1124,7 +1126,7 @@ static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, __submit_discard_cmd(sbi, dc, true); - if (++issued >= DISCARD_ISSUE_RATE) { + if (++issued >= dpolicy->max_requests) { start = dc->lstart + dc->len; blk_finish_plug(&plug); @@ -1152,6 +1154,7 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi, bool issue_cond) struct list_head *pend_list; struct discard_cmd *dc, *tmp; struct blk_plug plug; + struct discard_policy *dpolicy = &dcc->dpolicy; int iter = 0, issued = 0; int i; bool io_interrupted = false; @@ -1179,14 +1182,16 @@ static int __issue_discard_cmd(struct
[PATCH v2 3/6] f2fs: split discard policy
From: Chao Yu There are many different scenarios such as fstrim, umount, urgent or background where we will issue discards, actually, they need use different policy in aspect of io aware, discard granularity, delay interval and so on. But now they just share one common discard policy, so there will be race when changing policy in between these scenarios, the interference of changing discard policy will be very serious. This patch changes to split discard policy for different scenarios. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/f2fs.h| 17 +-- fs/f2fs/segment.c | 149 ++ fs/f2fs/segment.h | 5 +- fs/f2fs/sysfs.c | 13 - 4 files changed, 88 insertions(+), 96 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 67dd6a317384..f274805e231d 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -210,10 +210,6 @@ struct discard_entry { #define plist_idx(blk_num) ((blk_num) >= MAX_PLIST_NUM ? \ (MAX_PLIST_NUM - 1) : (blk_num - 1)) -#define P_ACTIVE 0x01 -#define P_TRIM 0x02 -#define plist_issue(tag) (((tag) & P_ACTIVE) || ((tag) & P_TRIM)) - enum { D_PREP, D_SUBMIT, @@ -245,13 +241,23 @@ struct discard_cmd { int error; /* bio error */ }; +enum { + DPOLICY_BG, + DPOLICY_FORCE, + DPOLICY_FSTRIM, + DPOLICY_UMOUNT, + MAX_DPOLICY, +}; + struct discard_policy { + int type; /* type of discard */ unsigned int min_interval; /* used for candidates exist */ unsigned int max_interval; /* used for candidates not exist */ unsigned int max_requests; /* # of discards issued per round */ unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */ bool io_aware; /* issue discard in idle time */ bool sync; /* submit discard with REQ_SYNC flag */ + unsigned int granularity; /* discard granularity */ }; struct discard_cmd_control { @@ -272,7 +278,6 @@ struct discard_cmd_control { atomic_t issing_discard;/* # of issing discard */ atomic_t discard_cmd_cnt; /* # of cached cmd count */ struct rb_root root;/* root of discard rb-tree */ - struct discard_policy dpolicy; /* current discard policy */ }; /* for the list of fsync inodes, used only during recovery */ @@ -2556,6 +2561,8 @@ int f2fs_flush_device_cache(struct f2fs_sb_info *sbi); void destroy_flush_cmd_control(struct f2fs_sb_info *sbi, bool free); void invalidate_blocks(struct f2fs_sb_info *sbi, block_t addr); bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr); +void init_discard_policy(struct discard_policy *dpolicy, int discard_type, + unsigned int granularity); void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new); void stop_discard_thread(struct f2fs_sb_info *sbi); void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi); diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 5853187230e7..e8fcc1eb4e0a 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -873,13 +873,14 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi, /* this function is copied from blkdev_issue_discard from block/blk-lib.c */ static void __submit_discard_cmd(struct f2fs_sb_info *sbi, - struct discard_cmd *dc, bool fstrim) + struct discard_policy *dpolicy, + struct discard_cmd *dc) { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; - struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) : - &(dcc->wait_list); + struct list_head *wait_list = (dpolicy->type == DPOLICY_FSTRIM) ? + &(dcc->fstrim_list) : &(dcc->wait_list); struct bio *bio = NULL; - int flag = dcc->dpolicy.sync ? REQ_SYNC : 0; + int flag = dpolicy->sync ? REQ_SYNC : 0; if (dc->state != D_PREP) return; @@ -1086,14 +1087,13 @@ static int __queue_discard_cmd(struct f2fs_sb_info *sbi, } static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, - unsigned int start, unsigned int end, - unsigned int granularity) + struct discard_policy *dpolicy, + unsigned int start, unsigned int end) { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; struct discard_cmd *prev_dc = NULL, *next_dc = NULL; struct rb_node **insert_p = NULL, *insert_parent
[PATCH v2 1/6] f2fs: support issuing/waiting discard in range
From: Chao Yu Fstrim intends to trim invalid blocks of filesystem only with specified range and granularity, but actually, it will issue all previous cached discard commands which may be out-of-range and be with unmatched granularity, it's unneeded. In order to fix above issues, this patch introduces new helps to support to issue and wait discard in range and adds a new fstrim_list for tracking in-flight discard from ->fstrim. Signed-off-by: Chao Yu --- v2: - rebase on last codes of Jaegeuk's dev-test branch. fs/f2fs/f2fs.h| 3 +- fs/f2fs/segment.c | 127 +- 2 files changed, 108 insertions(+), 22 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index a3fdcdade50b..6d51e8342513 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -251,6 +251,7 @@ struct discard_cmd_control { struct list_head pend_list[MAX_PLIST_NUM];/* store pending entries */ unsigned char pend_list_tag[MAX_PLIST_NUM];/* tag for pending entries */ struct list_head wait_list; /* store on-flushing entries */ + struct list_head fstrim_list; /* in-flight discard from fstrim */ wait_queue_head_t discard_wait_queue; /* waiting queue for wake-up */ unsigned int discard_wake; /* to wake up discard thread */ struct mutex cmd_lock; @@ -2547,7 +2548,7 @@ void invalidate_blocks(struct f2fs_sb_info *sbi, block_t addr); bool is_checkpointed_data(struct f2fs_sb_info *sbi, block_t blkaddr); void refresh_sit_entry(struct f2fs_sb_info *sbi, block_t old, block_t new); void stop_discard_thread(struct f2fs_sb_info *sbi); -void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi, bool umount); +void f2fs_wait_discard_bios(struct f2fs_sb_info *sbi); void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc); void release_discard_addrs(struct f2fs_sb_info *sbi); int npages_for_summary_flush(struct f2fs_sb_info *sbi, bool for_ra); diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index d5bad5aedded..a0a0a887fc31 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -873,9 +873,11 @@ void __check_sit_bitmap(struct f2fs_sb_info *sbi, /* this function is copied from blkdev_issue_discard from block/blk-lib.c */ static void __submit_discard_cmd(struct f2fs_sb_info *sbi, - struct discard_cmd *dc) + struct discard_cmd *dc, bool fstrim) { struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; + struct list_head *wait_list = fstrim ? &(dcc->fstrim_list) : + &(dcc->wait_list); struct bio *bio = NULL; if (dc->state != D_PREP) @@ -897,7 +899,7 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi, bio->bi_end_io = f2fs_submit_discard_endio; bio->bi_opf |= REQ_SYNC; submit_bio(bio); - list_move_tail(&dc->list, &dcc->wait_list); + list_move_tail(&dc->list, wait_list); __check_sit_bitmap(sbi, dc->start, dc->start + dc->len); f2fs_update_iostat(sbi, FS_DISCARD, 1); @@ -1082,6 +1084,68 @@ static int __queue_discard_cmd(struct f2fs_sb_info *sbi, return 0; } +static void __issue_discard_cmd_range(struct f2fs_sb_info *sbi, + unsigned int start, unsigned int end, + unsigned int granularity) +{ + struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info; + struct discard_cmd *prev_dc = NULL, *next_dc = NULL; + struct rb_node **insert_p = NULL, *insert_parent = NULL; + struct discard_cmd *dc; + struct blk_plug plug; + int issued; + +next: + issued = 0; + + mutex_lock(&dcc->cmd_lock); + f2fs_bug_on(sbi, !__check_rb_tree_consistence(sbi, &dcc->root)); + + dc = (struct discard_cmd *)__lookup_rb_tree_ret(&dcc->root, + NULL, start, + (struct rb_entry **)&prev_dc, + (struct rb_entry **)&next_dc, + &insert_p, &insert_parent, true); + if (!dc) + dc = next_dc; + + blk_start_plug(&plug); + + while (dc && dc->lstart <= end) { + struct rb_node *node; + + if (dc->len < granularity) + goto skip; + + if (dc->state != D_PREP) { + list_move_tail(&dc->list, &dcc->fstrim_list); + goto skip; + } + + __submit_discard_cmd(sbi, dc, true); + + if (++issued >= DISCARD_ISSUE_RATE) { + start = dc->lstart + dc->len; + + blk_finish_plug(&plug); + mutex_unlock(&dcc->cmd_lock);
[PATCH] objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER
With CONFIG_ORC_UNWINDER, if the user doesn't have libelf-devel installed, and they don't see the make warning, their ORC unwinder will be silently broken. Upgrade the warning to an error. Reported-and-tested-by: Borislav Petkov Signed-off-by: Josh Poimboeuf --- Makefile | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index cf007a31d575..5445cbf4d9a0 100644 --- a/Makefile +++ b/Makefile @@ -933,7 +933,11 @@ ifdef CONFIG_STACK_VALIDATION ifeq ($(has_libelf),1) objtool_target := tools/objtool FORCE else -$(warning "Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev, libelf-devel or elfutils-libelf-devel") +ifdef CONFIG_ORC_UNWINDER + $(error "Cannot generate ORC metadata for CONFIG_ORC_UNWINDER, please install libelf-dev, libelf-devel or elfutils-libelf-devel") +else + $(warning "Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev, libelf-devel or elfutils-libelf-devel") +endif SKIP_STACK_VALIDATION := 1 export SKIP_STACK_VALIDATION endif -- 2.13.6
Re: [RFC v3 3/7] platform/x86: intel_pmc_ipc: Use regmap calls for GCR updates
Hi Andy, On 10/01/2017 07:48 AM, Andy Shevchenko wrote: On Tue, Sep 5, 2017 at 8:37 AM, wrote: From: Kuppuswamy Sathyanarayanan Currently, update_no_reboot_bit() function implemented in this driver uses mutex_lock to protect its register updates. But this function is called with in atomic context in iTCO_wdt_start() and iTCO_wdt_stop() functions in iTCO_wdt.c driver, which in turn causes "sleeping into atomic context" issue. This patch fixes this issue by refactoring the current GCR read/write/update functions with regmap APIs. Since it sounds as candidate for stable, Yes. can we have split it to just as less as possible intrusive fix + moving to regmap as a separate change? If we have to split it into two patches then, Patch #1 will fix the "sleep in atomic context issue" by replacing mutex_lock() with spin_lock() in GCR read/write APIs to protect the GCR memory updates. Patch #2 will remove GCR read/write/update APIs and replace it with regmap APIs. But along with this change we will also remove the spin_lock() added in previous patch because regmap calls are already protected by its own locking mechanism. Since Patch #2 will clean up what we do in Patch #1, Do we need to split it into two patches? It should include Fixes: tag as well I suppose. Agree. I will add Fixes tag in next version. Signed-off-by: Kuppuswamy Sathyanarayanan --- drivers/platform/x86/Kconfig | 1 + drivers/platform/x86/intel_pmc_ipc.c | 115 --- 2 files changed, 40 insertions(+), 76 deletions(-) diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig index 80b8795..45f4e79 100644 --- a/drivers/platform/x86/Kconfig +++ b/drivers/platform/x86/Kconfig @@ -1054,6 +1054,7 @@ config PVPANIC config INTEL_PMC_IPC tristate "Intel PMC IPC Driver" depends on ACPI + select REGMAP_MMIO ---help--- This driver provides support for PMC control on some Intel platforms. The PMC is an ARC processor which defines IPC commands for communication diff --git a/drivers/platform/x86/intel_pmc_ipc.c b/drivers/platform/x86/intel_pmc_ipc.c index 021dcf6..40a25f8 100644 --- a/drivers/platform/x86/intel_pmc_ipc.c +++ b/drivers/platform/x86/intel_pmc_ipc.c @@ -31,9 +31,11 @@ #include #include #include +#include #include #include #include +#include #include @@ -125,7 +127,7 @@ static struct intel_pmc_ipc_dev { /* gcr */ void __iomem *gcr_mem_base; - bool has_gcr_regs; + struct regmap *gcr_regs; /* Telemetry */ u8 telem_res_inval; @@ -150,6 +152,14 @@ static char *ipc_err_sources[] = { "Unsigned kernel", }; +static struct regmap_config gcr_regmap_config = { +.reg_bits = 32, +.reg_stride = 4, +.val_bits = 32, + .fast_io = true, + .max_register = PLAT_RESOURCE_GCR_SIZE, +}; + /* Prevent concurrent calls to the PMC */ static DEFINE_MUTEX(ipclock); @@ -183,21 +193,6 @@ static inline u32 ipc_data_readl(u32 offset) return readl(ipcdev.ipc_base + IPC_READ_BUFFER + offset); } -static inline u64 gcr_data_readq(u32 offset) -{ - return readq(ipcdev.gcr_mem_base + offset); -} - -static inline int is_gcr_valid(u32 offset) -{ - if (!ipcdev.has_gcr_regs) - return -EACCES; - - if (offset > PLAT_RESOURCE_GCR_SIZE) - return -EINVAL; - - return 0; -} /** * intel_pmc_gcr_read() - Read PMC GCR register @@ -210,21 +205,10 @@ static inline int is_gcr_valid(u32 offset) */ int intel_pmc_gcr_read(u32 offset, u32 *data) { - int ret; - - mutex_lock(&ipclock); - - ret = is_gcr_valid(offset); - if (ret < 0) { - mutex_unlock(&ipclock); - return ret; - } - - *data = readl(ipcdev.gcr_mem_base + offset); - - mutex_unlock(&ipclock); + if (!ipcdev.gcr_regs) + return -EACCES; - return 0; + return regmap_read(ipcdev.gcr_regs, offset, data); } EXPORT_SYMBOL_GPL(intel_pmc_gcr_read); @@ -240,21 +224,10 @@ EXPORT_SYMBOL_GPL(intel_pmc_gcr_read); */ int intel_pmc_gcr_write(u32 offset, u32 data) { - int ret; - - mutex_lock(&ipclock); - - ret = is_gcr_valid(offset); - if (ret < 0) { - mutex_unlock(&ipclock); - return ret; - } - - writel(data, ipcdev.gcr_mem_base + offset); - - mutex_unlock(&ipclock); + if (!ipcdev.gcr_regs) + return -EACCES; - return 0; + return regmap_write(ipcdev.gcr_regs, offset, data); } EXPORT_SYMBOL_GPL(intel_pmc_gcr_write); @@ -271,33 +244,10 @@ EXPORT_SYMBOL_GPL(intel_pmc_gcr_write); */ int intel_pmc_gcr_update(u32 offset, u32 mask, u32 val) { - u32 new_val; - int ret = 0; - - mutex_lock(&ipclock); - - ret = is_gcr_valid(offset); - if (ret < 0) - goto gcr_ipc_unlo
Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
On Tue, Oct 03, 2017 at 05:43:47PM -0700, Nadav Amit wrote: > Jerome Glisse wrote: > > > On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote: > > > >> I'd like some more explanation about the inner working of "that new > >> user" as per comment above. > >> > >> It would be enough to drop mmu_notifier_invalidate_range from above > >> without adding it to the filebacked case. The above gives higher prio > >> to the hypothetical and uncertain future case, than to the current > >> real filebacked case that doesn't need ->invalidate_range inside the > >> PT lock, or do you see something that might already need such > >> ->invalidate_range? > > > > No i don't see any new user today that might need such invalidate but > > i was trying to be extra cautious as i have a tendency to assume that > > someone might do a patch that use try_to_unmap() without going through > > all the comments in the function and thus possibly using it in a an > > unexpected way from mmu_notifier callback point of view. I am fine > > with putting the burden on new user to get it right and adding an > > extra warning in the function description to try to warn people in a > > sensible way. > > I must be missing something. After the PTE is changed, but before the > secondary TLB notification/invalidation - What prevents another thread from > changing the mappings (e.g., using munmap/mmap), and setting a new page > at that PTE? > > Wouldn’t it end with the page being mapped without a secondary TLB flush in > between? munmap would call mmu_notifier to invalidate the range too so secondary TLB would be properly flush before any new pte can be setup in for that particular virtual address range. Unlike CPU TLB flush, secondary TLB flush are un-conditional and thus current pte value does not play any role. Cheers, Jérôme
Re: btusb "firmware request while host is not available" at resume
On Tue, Oct 3, 2017 at 5:20 PM, Luis R. Rodriguez wrote: > the ordering devised currently there is: > > o device driver pm ops called > o notifier for suspend issued - *going to suspend* > o userspace frozen > o filesystem freeze > > On the way back up this order is inverted: > > o filesystem freeze > o userspace frozen > o notifier for suspend issued - *going to suspend* > o device driver pm ops called Fortunately I had it a tad bit wrong, but in a good way. Our ordering on our way down is: o notifier for suspend issued - *going to suspend* o userspace frozen o filesystem freeze (new, being proposed) o device driver pm ops called Then on our way up: o device driver pm ops called o filesystem thaw o userspace thaw o notifier for resume issued - *thawing* So the driver callbacks get called *later*, so anything called in notifiers do get a chance to quiesce things properly. Luis
Re: [f2fs-dev] [PATCH v2] f2fs: support flexible inline xattr size
Hi Jaegeuk, When I rebase on last dev-test branch, I found that there are splitted codes from this patch which contains fixing codes for recovery flow. I guess you have missed this v2 patch. Could you merge and test this v2? Thanks, On 2017/9/6 21:59, Chao Yu wrote: > From: Chao Yu > > Now, in product, more and more features based on file encryption were > introduced, their demand of xattr space is increasing, however, inline > xattr has fixed-size of 200 bytes, once inline xattr space is full, new > increased xattr data would occupy additional xattr block which may bring > us more space usage and performance regression during persisting. > > In order to resolve above issue, it's better to expand inline xattr size > flexibly according to user's requirement. > > So this patch introduces new filesystem feature 'flexible inline xattr', > and new mount option 'inline_xattr_size=%u', once mkfs enables the > feature, we can use the option to make f2fs supporting flexible inline > xattr size. > > To support this feature, we add extra attribute i_inline_xattr_size in > inode layout, indicating that how many space inline xattr borrows from > block address mapping space in inode layout, by this, we can easily > locate and store flexible-sized inline xattr data in inode. > > Inode disk layout: > +--+ > | .i_mode | > | ... | > | .i_ext | > +--+ > | .i_extra_isize | > | .i_inline_xattr_size |---+ > | ... | | > +--+ | > | .i_addr | | > | - block address or | | > | - inline data | | > +--+<---+ v > |inline xattr |+---inline xattr range > +--+<---+ > | .i_nid | > +--+ > | node_footer| > | (nid, ino, offset) | > +--+ > > Signed-off-by: Chao Yu > --- > v2: > - recover i_inline_xattr_size during SPOR. > fs/f2fs/f2fs.h | 43 ++- > fs/f2fs/inode.c | 12 > fs/f2fs/namei.c | 6 ++ > fs/f2fs/node.c | 10 -- > fs/f2fs/super.c | 32 +++- > fs/f2fs/sysfs.c | 7 +++ > fs/f2fs/xattr.c | 18 +- > include/linux/f2fs_fs.h | 5 +++-- > 8 files changed, 106 insertions(+), 27 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 0d4bf497d566..83d3ec57e303 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -93,6 +93,7 @@ extern char *fault_name[FAULT_MAX]; > #define F2FS_MOUNT_GRPQUOTA 0x0010 > #define F2FS_MOUNT_PRJQUOTA 0x0020 > #define F2FS_MOUNT_QUOTA 0x0040 > +#define F2FS_MOUNT_INLINE_XATTR_SIZE 0x0080 > > #define clear_opt(sbi, option) ((sbi)->mount_opt.opt &= > ~F2FS_MOUNT_##option) > #define set_opt(sbi, option) ((sbi)->mount_opt.opt |= F2FS_MOUNT_##option) > @@ -118,6 +119,7 @@ struct f2fs_mount_info { > #define F2FS_FEATURE_EXTRA_ATTR 0x0008 > #define F2FS_FEATURE_PRJQUOTA0x0010 > #define F2FS_FEATURE_INODE_CHKSUM0x0020 > +#define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040 > > #define F2FS_HAS_FEATURE(sb, mask) \ > ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0) > @@ -379,11 +381,14 @@ struct f2fs_flush_device { > > /* for inline stuff */ > #define DEF_INLINE_RESERVED_SIZE 1 > +#define DEF_MIN_INLINE_SIZE 1 > static inline int get_extra_isize(struct inode *inode); > -#define MAX_INLINE_DATA(inode) (sizeof(__le32) * \ > - (CUR_ADDRS_PER_INODE(inode) - \ > - DEF_INLINE_RESERVED_SIZE - \ > - F2FS_INLINE_XATTR_ADDRS)) > +static inline int get_inline_xattr_addrs(struct inode *inode); > +#define F2FS_INLINE_XATTR_ADDRS(inode) get_inline_xattr_addrs(inode) > +#define MAX_INLINE_DATA(inode) (sizeof(__le32) * > \ > + (CUR_ADDRS_PER_INODE(inode) - \ > + F2FS_INLINE_XATTR_ADDRS(inode) -\ > + DEF_INLINE_RESERVED_SIZE)) > > /* for inline dir */ > #define NR_INLINE_DENTRY(inode) (MAX_INLINE_DATA(inode) * BITS_PER_BYTE > / \ > @@ -592,6 +597,7 @@ struct f2fs_inode_info { > > int i_extra_isize; /* size of extra space located in > i_addr */ > kprojid_t i_projid; /* id for project quota */ > + int i_inline_xattr_size;/* inline xattr size */ > }; > > static inline void get_extent_info(struct extent_info *ext, > @@ -1043,6 +1049,7 @@ struct f2fs_sb_info { > loff_t max_file_blocks; /* max block index of
Re: [GIT PULL] ARM: at91: fixes for 4.14 #1
On Mon, Sep 18, 2017 at 06:20:29PM +0200, Nicolas Ferre wrote: > From: Nicolas Ferre > > Arnd, Olof, > > Here are some fixes for 4.14. I send them while Alexandre is in the US. The > most of this PR are modifications to the sama5d27 SOM1 EK board that was > introducts lately. These modifications are errors due to typos, > misunderstanding > of the DT binding or simply last minute hardware modifications. > I also added one patch for the AT91 PM subsystem as it is the re-alignment on > one treewide change by Russell to get rid of virt_to_phys() in low-level code. > I think that unifying this move the soonest for all ARM platform is better. > > Thanks, best regards, > > The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e: > > Linux 4.14-rc1 (2017-09-16 15:47:51 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/nferre/linux-at91.git > tags/at91-fixes > > for you to fetch changes up to 093d79f62a89f47d9b5fd0746768146d9696535c: > > ARM: at91: Replace uses of virt_to_phys with __pa_symbol (2017-09-18 > 10:05:38 +0200) Applied to fixes. -Olof
Re: [PATCH] ARM: defconfig: FRAMEBUFFER_CONSOLE can no longer be =m
On Mon, Sep 18, 2017 at 05:47:50PM +0200, Arnd Bergmann wrote: > It is no longer possible to load this at runtime, so let's > change the few remaining users to have it built-in all > the time. > > arch/arm/configs/zeus_defconfig:115:warning: symbol value 'm' invalid for > FRAMEBUFFER_CONSOLE > arch/arm/configs/viper_defconfig:116:warning: symbol value 'm' invalid for > FRAMEBUFFER_CONSOLE > arch/arm/configs/pxa_defconfig:474:warning: symbol value 'm' invalid for > FRAMEBUFFER_CONSOLE > > Reported-by: kernelci.org bot > Fixes: 6104c37094e7 ("fbcon: Make fbcon a built-time depency for fbdev") > Signed-off-by: Arnd Bergmann Applied to fixes. -Olof
[PATCH] block/drbd: Fix a sleep-in-atomic bug in drbd_bcast_event
The driver may sleep under a RCU lock, and the function call path is: drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper drbd_bcast_event genlmsg_new(GFP_NOIO) --> may sleep To fix it, GFP_NOIO is replaced with GFP_ATOMIC. This bug is found by my static analysis tool and my code review. Signed-off-by: Jia-Ju Bai --- drivers/block/drbd/drbd_nl.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index a12f77e..713c965 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -4537,7 +4537,7 @@ void drbd_bcast_event(struct drbd_device *device, const struct sib_info *sib) int err = -ENOMEM; seq = atomic_inc_return(&drbd_genl_seq); - msg = genlmsg_new(NLMSG_GOODSIZE, GFP_NOIO); + msg = genlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC); if (!msg) goto failed; -- 1.7.9.5
[PATCH] block/drbd: Fix a sleep-in-atomic bug in notify_helper
The driver may sleep under a RCU lock, and the function call path is: drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper notify_helper genlmsg_new(GFP_NOIO) --> may sleep To fix it, GFP_NOIO is replaced with GFP_ATOMIC. This bug is found by my static analysis tool and my code review. Signed-off-by: Jia-Ju Bai --- drivers/block/drbd/drbd_nl.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index a12f77e..ad093da 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -4790,7 +4790,7 @@ void notify_helper(enum drbd_notification_type type, helper_info.helper_name_len = min(strlen(name), sizeof(helper_info.helper_name)); helper_info.helper_status = status; - skb = genlmsg_new(NLMSG_GOODSIZE, GFP_NOIO); + skb = genlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC); err = -ENOMEM; if (!skb) goto fail; -- 1.7.9.5
[BUG] drbd/block: A sleep-in-atomic bug in notify_helper
The driver may sleep under a RCU lock, and the function call path is: drbd_sync_handshake (acquire the RCU lock) drbd_asb_recover_1p drbd_khelper notify_helper mutex_lock --> may sleep I hope to fix it, but I do not find a proper way at present ... This bug is found by my static analysis tool and my code review. Thanks, Jia-Ju Bai
Re: [PATCH 1/7] gpio: brcmstb: allow all instances to be wakeup sources
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > This commit allows a wakeup parent interrupt to be shared between > instances. > > It also removes the redundant can_wake member of the private data > structure by using whether the parent_wake_irq has been defined to > indicate that the GPIO device can wake. > > Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support") > Signed-off-by: Doug Berger Acked-by: Gregory Fong
Re: [RFC 4/5] ext4: add fs freezing support on suspend/hibernation
On Tue, Oct 03, 2017 at 10:13:21PM +0200, Luis R. Rodriguez wrote: > > After we remove add the NEEDS_RECOVERY flag, we need to make sure > > recovery flag is pushed out to disk before any other changes are > > allowed to be pushed out to disk. That's why we originally did the > > update synchronously. > > I see. I had to try to dig through linux-history to see why this was done, and > I actually could not find an exact commit which explained what you just did. > Thanks! > > Hrm, so on freeze we issue the same commit as well, so is a sync *really* > needed > on thaw? So let me explain what's going on. When we do a freeze, we make sure all of the blocks that are written to the journal are writen to the final location on disk, and the journal is is truncated. (This is called a "journal checkpoint".) Then we clear the NEEDS RECOVERY feature flag and set the EXT4_VALID_FS flags in the superblock. In the no journal case, we flush out all metadata writes, and we set the EXT4_VALID_FS flag. In both cases, we call ext4_commit_super(sb, 1) to make sure the flags update is pushed out to disk. This puts the file system into a "quiscent" state; in fact, it looks like the file system has been unmounted, so that it becomes safe to run dump/restore, run fsck -n on the file system, etc. The problem on the thaw side is that we need to make sure that EXT4_VALID_FS flag is removed (and if journaling is enabled, the NEEDS RECOVERY feature flag is set) and the superblock is flushed out to the storage device before any other writes are persisted on the disk. In the case where we have journalling enabled, we could wait until the first journal commit to write out the superblock, since in journal mode all metadata updates to the disk are suppressed until the journal commit. We don't do that today, but it is a change we could make. However, in the no journal node we need to make sure the EXT4_VALID_FS flag is cleared on disk before any other metadata operations can take place, and calling ext4_commit_super(sb, 1) is the only real way to do that. > No, it was am empirical evaluation done at testing, I observed bio_submit() > stalls, we never get completion from the device. Even if we call thaw at the > very end of resume, after the devices should be up, we still end up in the > same > situation. Given how I order freezing filesystems after freezing userspace I > do > believe we should thaw filesystems prior unfreezing userspace, its why I > placed > the call where it is now. So when we call ext4_commit_super(sb, 1), we issue the bio, and then we block waiting for the I/O to complete. That's a blocking call. Is the thaw context one which allows blocking? If userspace is still frozen, does that imply that the scheduler isn't allow to run? If that's the case, then that's probably the problem. More generally, if the thawing code needs to allocate memory, or do any number of things that could potentially block, this could potentially be an issue. Or if we have a network block device, or something else in the storage stack that needs to run a kernel thread context (or a workqueue, etc.) --- is the fact that userspace is frozen mean the scheduler is going to refuse to schedule()? I know at one point we made a distinction between freezing userspace threads and kernel threads, but were there people who didn't like this because of unnecessary complexity. It sounds to me like on the thaw side, we might also need to unfreeze kernel threads first, and then allow userspace threads to run. Do we do that today, or in the new freeze/thaw code? It's been quite a while since I've looked at that part of the kernel. - Ted
Re: [PATCH v2 1/4] KVM: LAPIC: Fix lapic timer mode transition
2017-10-04 1:05 GMT+08:00 Radim Krčmář : > 2017-09-28 18:04-0700, Wanpeng Li: >> From: Wanpeng Li >> >> SDM 10.5.4.1 TSC-Deadline Mode mentioned that "Transitioning between >> TSC-Deadline >> mode and other timer modes also disarms the timer". So the APIC Timer >> Initial Count >> Register for one-shot/periodic mode should be reset. This patch do it. > > At the beginning of the secion is also: > > A write to the LVT Timer Register that changes the timer mode disarms > the local APIC timer. The supported timer modes are given in Table > 10-2. The three modes of the local APIC timer are mutually exclusive. Yeah, I saw it before sending out the patches, but it is mentioned in TSC-deadline section which looks strange, if the timer is still disarmed when switching between one-shot and periodic mode before TSC-deadline is introduced and w/o TSC-deadline section? > > So we should also disarm when switching between one-shot and periodic. > > apic_update_lvtt() already has logic to determine whether the timer mode > has changed and is the perfect place to clear APIC_TMICT. Agreed, thanks for your review, Radim. :) Regards, Wanpeng Li > > Thanks. > >> Cc: Paolo Bonzini >> Cc: Radim Krčmář >> Signed-off-by: Wanpeng Li >> --- >> arch/x86/include/asm/apicdef.h | 1 + >> arch/x86/kvm/lapic.c | 3 +++ >> 2 files changed, 4 insertions(+) >> >> diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h >> index c46bb99..d8ef1b4 100644 >> --- a/arch/x86/include/asm/apicdef.h >> +++ b/arch/x86/include/asm/apicdef.h >> @@ -100,6 +100,7 @@ >> #define APIC_TIMER_BASE_CLKIN 0x0 >> #define APIC_TIMER_BASE_TMBASE 0x1 >> #define APIC_TIMER_BASE_DIV 0x2 >> +#define APIC_LVT_TIMER_MASK (3 << 17) >> #define APIC_LVT_TIMER_ONESHOT (0 << 17) >> #define APIC_LVT_TIMER_PERIODIC (1 << 17) >> #define APIC_LVT_TIMER_TSCDEADLINE (2 << 17) >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index 69c5612..a739cbb 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1722,6 +1722,9 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 >> reg, u32 val) >> break; >> >> case APIC_LVTT: >> + if (apic_lvtt_tscdeadline(apic) != ((val & >> + APIC_LVT_TIMER_MASK) == APIC_LVT_TIMER_TSCDEADLINE)) >> + kvm_lapic_set_reg(apic, APIC_TMICT, 0); >> if (!kvm_apic_sw_enabled(apic)) >> val |= APIC_LVT_MASKED; >> val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask); >> -- >> 2.7.4 >>
Re: [PATCH v2 2/4] KVM: LAPIC: Keep timer running when switching between one-shot and periodic mode
2017-10-04 1:06 GMT+08:00 Radim Krčmář : > 2017-09-28 18:04-0700, Wanpeng Li: >> From: Wanpeng Li >> >> If we take TSC-deadline mode timer out of the picture, the Intel SDM >> does not say that the timer is disable when the timer mode is change, >> either from one-shot to periodic or vice versa. > > I think it does, please see comment under [v2 1/4]. As I replied to [v2 1/4]. Regards, Wanpeng Li > >> After this patch, the timer is no longer disarmed on change of mode, so >> the counter (TMCCT) keeps counting down. >> >> So what does a write to LVTT changes ? On baremetal, the change of mode >> is probably taken into account only when the counter reach 0. When this >> happen, LVTT is use to figure out if the counter should restard counting >> down from TMICT (so periodic mode) or stop counting (if one-shot mode). >> >> This patch is based on observation of the behavior of the APIC timer on >> baremetal as well as check that they does not go against the description >> written in the Intel SDM. >> >> Cc: Paolo Bonzini >> Cc: Radim Krčmář >> Signed-off-by: Wanpeng Li >> --- >> arch/x86/kvm/lapic.c | 40 >> 1 file changed, 28 insertions(+), 12 deletions(-) >> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index a739cbb..946c11b 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1301,7 +1301,7 @@ static void update_divide_count(struct kvm_lapic *apic) >> apic->divide_count); >> } >> >> -static void apic_update_lvtt(struct kvm_lapic *apic) >> +static bool apic_update_lvtt(struct kvm_lapic *apic) >> { >> u32 timer_mode = kvm_lapic_get_reg(apic, APIC_LVTT) & >> apic->lapic_timer.timer_mode_mask; >> @@ -1309,7 +1309,9 @@ static void apic_update_lvtt(struct kvm_lapic *apic) >> if (apic->lapic_timer.timer_mode != timer_mode) { >> apic->lapic_timer.timer_mode = timer_mode; >> hrtimer_cancel(&apic->lapic_timer.timer); >> + return true; >> } >> + return false; >> } >> >> static void apic_timer_expired(struct kvm_lapic *apic) >> @@ -1430,11 +1432,12 @@ static void start_sw_period(struct kvm_lapic *apic) >> HRTIMER_MODE_ABS_PINNED); >> } >> >> -static bool set_target_expiration(struct kvm_lapic *apic) >> +static bool set_target_expiration(struct kvm_lapic *apic, bool timer_update) >> { >> - ktime_t now; >> - u64 tscl = rdtsc(); >> + ktime_t now, remaining; >> + u64 tscl = rdtsc(), delta; >> >> + /* Calculate the next time the timer should trigger an interrupt */ >> now = ktime_get(); >> apic->lapic_timer.period = (u64)kvm_lapic_get_reg(apic, APIC_TMICT) >> * APIC_BUS_CYCLE_NS * apic->divide_count; >> @@ -1470,9 +1473,21 @@ static bool set_target_expiration(struct kvm_lapic >> *apic) >> ktime_to_ns(ktime_add_ns(now, >> apic->lapic_timer.period))); >> >> + if (!timer_update) >> + delta = apic->lapic_timer.period; >> + else { >> + remaining = ktime_sub(apic->lapic_timer.target_expiration, >> now); >> + if (ktime_to_ns(remaining) < 0) >> + remaining = 0; >> + delta = mod_64(ktime_to_ns(remaining), >> apic->lapic_timer.period); >> + } >> + >> + if (!delta) >> + return false; >> + >> apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) + >> - nsec_to_cycles(apic->vcpu, apic->lapic_timer.period); >> - apic->lapic_timer.target_expiration = ktime_add_ns(now, >> apic->lapic_timer.period); >> + nsec_to_cycles(apic->vcpu, delta); >> + apic->lapic_timer.target_expiration = ktime_add_ns(now, delta); >> >> return true; >> } >> @@ -1609,12 +1624,12 @@ void kvm_lapic_restart_hv_timer(struct kvm_vcpu >> *vcpu) >> restart_apic_timer(apic); >> } >> >> -static void start_apic_timer(struct kvm_lapic *apic) >> +static void start_apic_timer(struct kvm_lapic *apic, bool timer_update) >> { >> atomic_set(&apic->lapic_timer.pending, 0); >> >> if ((apic_lvtt_period(apic) || apic_lvtt_oneshot(apic)) >> - && !set_target_expiration(apic)) >> + && !set_target_expiration(apic, timer_update)) >> return; >> >> restart_apic_timer(apic); >> @@ -1729,7 +1744,8 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 >> reg, u32 val) >> val |= APIC_LVT_MASKED; >> val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask); >> kvm_lapic_set_reg(apic, APIC_LVTT, val); >> - apic_update_lvtt(apic); >> + if (apic_update_lvtt(apic) && !apic_lvtt_tscdeadline(apic)) >> + start_apic_timer(apic, true); >> break; >> >> case APIC_TMICT: >> @@ -1738,7 +1754,7 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 >> reg, u32 val) >> >>
Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers
Hi Doug, On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > The basic memory-mapped GPIO controller lock must be released > before calling the registered GPIO interrupt handlers to allow > the interrupt handlers to access the hardware. Otherwise, the > hardware accesses will deadlock when they attempt to grab the > lock. I was having some trouble understanding exactly what the problem was here, but I think I see it now. Since this locks the entire bank, where some GPIOs might be set as inputs and some as inputs (and interrupt sources), then an interrupt on a GPIO that is supposed to set another GPIO in the bank would result in deadlock. Is that correct? If so, please update the commit message to make that clear, and nice fix. If not that, it would be nice to know what scenario can cause a problem. Thanks, Gregory
Re: [PATCH v2 3/4] KVM: LAPIC: Apply change to TDCR right away to the timer
2017-10-04 1:28 GMT+08:00 Radim Krčmář : > 2017-09-28 18:04-0700, Wanpeng Li: >> From: Wanpeng Li >> >> The description in the Intel SDM of how the divide configuration >> register is used: "The APIC timer frequency will be the processor's bus >> clock or core crystal clock frequency divided by the value specified in >> the divide configuration register." >> >> Observation of baremetal shown that when the TDCR is change, the TMCCT >> does not change or make a big jump in value, but the rate at which it >> count down change. >> >> The patch update the emulation to APIC timer to so that a change to the >> divide configuration would be reflected in the value of the counter and >> when the next interrupt is triggered. >> >> Cc: Paolo Bonzini >> Cc: Radim Krčmář >> Signed-off-by: Wanpeng Li >> --- > > Why do we need to do more than just restart the timer? Because the current timer (hv or sw) are still running. I think the goal of this commit is to runtime update the rate of the current timer which is running. Our restart_apic_timer() implementation just cancels the current timer when switch between preemption timer and hrtimer. Regards, Wanpeng Li > > The TMCCT should remain roughly at the same level -- changing divide > count modifies target_expiration and it looks like apic_get_tmcct() > would get the same result like before changing divide count. > > Thanks.
Re: [PATCH 3/7] gpio: brcmstb: switch to handle_level_irq flow
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > Reading and writing the gpio bank status register each time a pending > interrupt bit is serviced could cause new pending bits to be cleared > without servicing the associated interrupts. > > By using the handle_level_irq flow instead of the handle_simple_irq > flow we get proper handling of interrupt masking as well as acking > of interrupts. The irq_ack method is added to support this. > > Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support") > Signed-off-by: Doug Berger Acked-by: Gregory Fong
Re: [PATCH 4/7] gpio: brcmstb: correct the configuration of level interrupts
On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > This commit corrects a bug when configuring the GPIO hardware for > IRQ_TYPE_LEVEL_LOW and IRQ_TYPE_LEVEL_HIGH interrupt types. The > hardware is now correctly configured to support those types. > > Fixes: 19a7b6940b78 ("gpio: brcmstb: Add interrupt and wakeup source support") > Signed-off-by: Doug Berger Acked-by: Gregory Fong
Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM
On Mon, Oct 02, 2017 at 10:12:49AM -0700, Jacob Pan wrote: > On Sat, 30 Sep 2017 01:21:43 +0200 > "Rafael J. Wysocki" wrote: > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > wrote: > > > intel idle driver does not DTRT when running within a VM: > > > when going into a deep power state, the right thing to > > > do is to exit to hypervisor rather than to keep polling > > > within guest using mwait. > > > > > > Currently the solution is just to exit to hypervisor each time we go > > > idle - this is why kvm does not expose the mwait leaf to guests even > > > when it allows guests to do mwait. > > > > > > But that's not ideal - it seems better to use the idle driver to > > > guess when will the next interrupt arrive. > > > > The idle driver alone is not sufficient for that, though. > > > I second that. Why try to solve this problem at vendor specific driver > level? Well we still want to e.g. mwait if possible - saves power. > perhaps just a pv idle driver that decide whether to vmexit > based on something like local per vCPU timer expiration? I guess we > can't predict other wake events such as interrupts. > e.g. > if (get_next_timer_interrupt() > kvm_halt_target_residency) > vmexit > else > poll > > Jacob It's not always a poll, on x86 putting the CPU in a low power state is possible within a VM. Does not seem possible on other CPUs that's why it's vendor specific. -- MST
Re: e4dace3615 ("lib: add test module for CONFIG_DEBUG_VIRTUAL"): [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!
Hi Fengguang, On 10/03/2017 07:24 AM, Fengguang Wu wrote: > Hi Florian, > > FYI we see this BUG in latest RC kernel and linux-next. > Have it been fixed somewhere else? The purpose of the test module is to make tests for CONFIG_DEBUG_VIRTUAL trigger such warnings/VIRTUAL_BUG_ON, because of how the test is coded, you may want to not load such module. If the module needs to be moved to another directory because lib/* contains mostly functional tests, let me know and I will cook something. Thanks! > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > commit e4dace3615526fd66c86dd535ee4bc9e8c706e37 > Author: Florian Fainelli > AuthorDate: Fri Sep 8 16:15:31 2017 -0700 > Commit: Linus Torvalds > CommitDate: Fri Sep 8 18:26:49 2017 -0700 > > lib: add test module for CONFIG_DEBUG_VIRTUAL > > Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works > correctly, at least that it can catch invalid calls to virt_to_phys() > against the non-linear kernel virtual address map. > > Link: > http://lkml.kernel.org/r/20170808164035.26725-1-f.faine...@gmail.com > Signed-off-by: Florian Fainelli > Cc: "Luis R. Rodriguez" > Cc: Kees Cook > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds > > 9888a588ea lib/hexdump.c: return -EINVAL in case of error in hex2bin() > e4dace3615 lib: add test module for CONFIG_DEBUG_VIRTUAL > 9e66317d3c Linux 4.14-rc3 > 1418b85217 Add linux-next specific files for 20170929 > +--+++---+---+ > | | 9888a588ea | e4dace3615 | > v4.14-rc3 | next-20170929 | > +--+++---+---+ > | boot_successes | 34 | 0 | 0 >| 0 | > | boot_failures| 0 | 15 | 11 >| 11| > | kernel_BUG_at_arch/x86/mm/physaddr.c | 0 | 15 | 11 >| 11| > | invalid_opcode:#[##] | 0 | 15 | 11 >| 11| > | EIP:__phys_addr | 0 | 15 | 11 >| 11| > | Kernel_panic-not_syncing:Fatal_exception | 0 | 15 | 11 >| 11| > +--+++---+---+ > > [9.837344] test_firmware: interface ready > [9.838286] test passed > [9.839498] test_printf: all 260 tests passed > [ 10.183840] test_bitmap: all 460506 tests passed > [ 10.184882] [ cut here ] > [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75! > [ 10.186568] invalid opcode: [#1] > [ 10.187104] CPU: 0 PID: 1 Comm: swapper Not tainted 4.13.0-09302-ge4dace3 > #1 > [ 10.187547] task: 91c45000 task.stack: 91c54000 > [ 10.187547] EIP: __phys_addr+0x244/0x2a0 > [ 10.187547] EFLAGS: 00210246 CPU: 0 > [ 10.187547] EAX: EBX: 92fe ECX: EDX: 0001 > [ 10.187547] ESI: 12fe EDI: EBP: 8464e539 ESP: 91c55f08 > [ 10.187547] DS: 007b ES: 007b FS: GS: SS: 0068 > [ 10.187547] CR0: 80050033 CR2: CR3: 04792000 CR4: 0690 > [ 10.187547] Call Trace: > [ 10.187547] ? vprintk_func+0x88/0x150 > [ 10.187547] ? test_bitmap_init+0xe7/0xe7 > [ 10.187547] ? test_debug_virtual_init+0x22/0xf0 > [ 10.187547] ? do_one_initcall+0x127/0x3c8 > [ 10.187547] ? kernel_init_freeable+0x217/0x3de > [ 10.187547] ? kernel_init_freeable+0x245/0x3de > [ 10.187547] ? rest_init+0x200/0x200 > [ 10.187547] ? kernel_init+0x17/0x340 > [ 10.187547] ? rest_init+0x200/0x200 > [ 10.187547] ? schedule_tail_wrapper+0x6/0x8 > [ 10.187547] ? rest_init+0x200/0x200 > [ 10.187547] ? ret_from_fork+0x19/0x24 > [ 10.187547] Code: c9 ba 01 00 00 00 c7 04 24 00 00 00 00 b8 08 31 17 84 e8 > 30 58 19 00 ff 05 74 c1 1f 84 83 05 20 01 7e 84 01 83 15 24 01 7e 84 00 <0f> > 0b 83 05 30 01 7e 84 01 83 15 34 01 7e 84 00 8d 74 26 00 83 > [ 10.187547] EIP: __phys_addr+0x244/0x2a0 SS:ESP: 0068:91c55f08 > [ 10.205865] ---[ end trace 2cb481e044fc1df9 ]--- > [ 10.206542] Kernel panic - not syncing: Fatal exception > ># HH:MM RESULT > GOOD BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD > git bisect start 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e v4.13 -- > git bisect good 3aea311c1b4002bd730a09530f80f2f2ad3bf495 # 04:45 G 10 > 00 0 genksyms: fix gperf removal conversion > git bisect bad 6d8ef53e8b2fed8b0f91df0c6da7cc92747d934a # 05:12 B 0 > 9 21 0 Merge tag 'f2fs-for-4.14' of > git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs > git bisect bad d719518d9ce9132bad8a06e8029aeead328f66a3 # 0
Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers
On 10/03/2017 06:55 PM, Gregory Fong wrote: > Hi Doug, > > On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: >> The basic memory-mapped GPIO controller lock must be released >> before calling the registered GPIO interrupt handlers to allow >> the interrupt handlers to access the hardware. Otherwise, the >> hardware accesses will deadlock when they attempt to grab the >> lock. > > I was having some trouble understanding exactly what the problem was > here, but I think I see it now. Since this locks the entire bank, > where some GPIOs might be set as inputs and some as inputs (and > interrupt sources), then an interrupt on a GPIO that is supposed to > set another GPIO in the bank would result in deadlock. Is that > correct? If so, please update the commit message to make that clear, > and nice fix. If not that, it would be nice to know what scenario can > cause a problem. That is an example, but there are really many possibilities. Basically, if a registered interrupt handler wants to access its GPIO you are likely to run into trouble. Another example might be an interrupt that is configured to trigger on either edge transition and the handler wants to know whether the input is currently high or low. I can submit a V2 with a change in the description if you would like, but I'm not sure what the clearest example would be. > > Thanks, > Gregory > Thanks for the feedback, Doug
Re: [PATCH RFC hack dont apply] intel_idle: support running within a VM
On Tue, Oct 03, 2017 at 11:02:55PM +0200, Thomas Gleixner wrote: > On Mon, 2 Oct 2017, Jacob Pan wrote: > > On Sat, 30 Sep 2017 01:21:43 +0200 > > "Rafael J. Wysocki" wrote: > > > > > On Sat, Sep 30, 2017 at 12:01 AM, Michael S. Tsirkin > > > wrote: > > > > intel idle driver does not DTRT when running within a VM: > > > > when going into a deep power state, the right thing to > > > > do is to exit to hypervisor rather than to keep polling > > > > within guest using mwait. > > > > > > > > Currently the solution is just to exit to hypervisor each time we go > > > > idle - this is why kvm does not expose the mwait leaf to guests even > > > > when it allows guests to do mwait. > > > > > > > > But that's not ideal - it seems better to use the idle driver to > > > > guess when will the next interrupt arrive. > > > > > > The idle driver alone is not sufficient for that, though. > > > > > I second that. Why try to solve this problem at vendor specific driver > > level? perhaps just a pv idle driver that decide whether to vmexit > > based on something like local per vCPU timer expiration? I guess we > > can't predict other wake events such as interrupts. > > e.g. > > if (get_next_timer_interrupt() > kvm_halt_target_residency) > > Bah. no. get_next_timer_interrupt() is not available for abuse in random > cpuidle driver code. It has state and its tied to the nohz code. > > There is the series from Audrey which makes use of the various idle > prediction mechanisms, scheduler, irq timings, idle governor to get an idea > about the estimated idle time. Exactly this information can be fed to the > kvmidle driver which can act accordingly. > > Hacking a random hardware specific idle driver is definitely the wrong > approach. It might be useful to chain the kvmidle driver and hardware > specific drivers at some point, i.e. if the kvmdriver decides not to exit > it delegates the mwait decision to the proper hardware driver in order not > to reimplement all the required logic again. By making changes to idle core to allow that chaining? Does this sound like something reasonable? > But that's a different story. > > See > http://lkml.kernel.org/r/1506756034-6340-1-git-send-email-aubrey...@intel.com Will read that, thanks a lot. > Thanks, > > tglx > > >
Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type
Hi Doug, On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > Mask the GPIO interrupt while its type is being changed, just in case > it can prevent a spurious interrupt. "Just in case"? I don't have access to hardware documentation for this anymore, but I'd expect to some stronger claim that the hardware actually requires masking before changing the trigger type. If you can quote documentation for this or explain an actual problem seen, that would be good. > > Signed-off-by: Doug Berger > --- > drivers/gpio/gpio-brcmstb.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpio/gpio-brcmstb.c b/drivers/gpio/gpio-brcmstb.c > index 0418cb266586..e2fff559c1ca 100644 > --- a/drivers/gpio/gpio-brcmstb.c > +++ b/drivers/gpio/gpio-brcmstb.c > @@ -363,7 +363,9 @@ static int brcmstb_gpio_irq_setup(struct platform_device > *pdev, > bank->irq_chip.irq_set_type = brcmstb_gpio_irq_set_type; > > /* Ensures that all non-wakeup IRQs are disabled at suspend */ > - bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND; > + /* and that interrupts are masked when changing their type */ This doesn't follow the kernel multi-line comment style, please adjust. > + bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND | > + IRQCHIP_SET_TYPE_MASKED; > > if (IS_ENABLED(CONFIG_PM_SLEEP) && !priv->parent_wake_irq && > of_property_read_bool(np, "wakeup-source")) { > -- > 2.14.1 > Thanks, Gregory
Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type
On 10/03/2017 07:10 PM, Gregory Fong wrote: > Hi Doug, > > On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: >> Mask the GPIO interrupt while its type is being changed, just in case >> it can prevent a spurious interrupt. > > "Just in case"? I don't have access to hardware documentation for > this anymore, but I'd expect to some stronger claim that the hardware > actually requires masking before changing the trigger type. If you > can quote documentation for this or explain an actual problem seen, > that would be good. Well spotted ;). This was a protectionist change added at the request of a user of the driver. I believe that it is superfluous and I suppose that belief leaked through in the language of my comment. In actuality, the GPIO APIs don't make provision for clearing GPIO interrupts before enabling them so this masking is really only a deferral of a spurious interrupt if one is triggered by changes in the hardware programming. I can strike this from the upstream submission and save a couple lines of source code (after IRQCHIP_MASK_ON_SUSPEND goes away) if you prefer. > >> >> Signed-off-by: Doug Berger >> --- >> drivers/gpio/gpio-brcmstb.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpio/gpio-brcmstb.c b/drivers/gpio/gpio-brcmstb.c >> index 0418cb266586..e2fff559c1ca 100644 >> --- a/drivers/gpio/gpio-brcmstb.c >> +++ b/drivers/gpio/gpio-brcmstb.c >> @@ -363,7 +363,9 @@ static int brcmstb_gpio_irq_setup(struct platform_device >> *pdev, >> bank->irq_chip.irq_set_type = brcmstb_gpio_irq_set_type; >> >> /* Ensures that all non-wakeup IRQs are disabled at suspend */ >> - bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND; >> + /* and that interrupts are masked when changing their type */ > > This doesn't follow the kernel multi-line comment style, please adjust. > >> + bank->irq_chip.flags = IRQCHIP_MASK_ON_SUSPEND | >> + IRQCHIP_SET_TYPE_MASKED; >> >> if (IS_ENABLED(CONFIG_PM_SLEEP) && !priv->parent_wake_irq && >> of_property_read_bool(np, "wakeup-source")) { >> -- >> 2.14.1 >> > > Thanks, > Gregory > Thanks again, Doug
Re: e4dace3615 ("lib: add test module for CONFIG_DEBUG_VIRTUAL"): [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75!
On Tue, Oct 03, 2017 at 07:09:37PM -0700, Florian Fainelli wrote: Hi Fengguang, On 10/03/2017 07:24 AM, Fengguang Wu wrote: Hi Florian, FYI we see this BUG in latest RC kernel and linux-next. Have it been fixed somewhere else? The purpose of the test module is to make tests for CONFIG_DEBUG_VIRTUAL trigger such warnings/VIRTUAL_BUG_ON, because of how the test is coded, If so, it may help clarify things if there is a printk notice before triggering the BUG? you may want to not load such module. If the module needs to be moved to another directory because lib/* contains mostly functional tests, let me know and I will cook something. Never mind, it looks easy to disable CONFIG_TEST_DEBUG_VIRTUAL. :) Thanks, Fengguang https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master commit e4dace3615526fd66c86dd535ee4bc9e8c706e37 Author: Florian Fainelli AuthorDate: Fri Sep 8 16:15:31 2017 -0700 Commit: Linus Torvalds CommitDate: Fri Sep 8 18:26:49 2017 -0700 lib: add test module for CONFIG_DEBUG_VIRTUAL Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works correctly, at least that it can catch invalid calls to virt_to_phys() against the non-linear kernel virtual address map. Link: http://lkml.kernel.org/r/20170808164035.26725-1-f.faine...@gmail.com Signed-off-by: Florian Fainelli Cc: "Luis R. Rodriguez" Cc: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds 9888a588ea lib/hexdump.c: return -EINVAL in case of error in hex2bin() e4dace3615 lib: add test module for CONFIG_DEBUG_VIRTUAL 9e66317d3c Linux 4.14-rc3 1418b85217 Add linux-next specific files for 20170929 +--+++---+---+ | | 9888a588ea | e4dace3615 | v4.14-rc3 | next-20170929 | +--+++---+---+ | boot_successes | 34 | 0 | 0 | 0 | | boot_failures| 0 | 15 | 11 | 11| | kernel_BUG_at_arch/x86/mm/physaddr.c | 0 | 15 | 11 | 11| | invalid_opcode:#[##] | 0 | 15 | 11 | 11| | EIP:__phys_addr | 0 | 15 | 11 | 11| | Kernel_panic-not_syncing:Fatal_exception | 0 | 15 | 11 | 11| +--+++---+---+ [9.837344] test_firmware: interface ready [9.838286] test passed [9.839498] test_printf: all 260 tests passed [ 10.183840] test_bitmap: all 460506 tests passed [ 10.184882] [ cut here ] [ 10.185572] kernel BUG at arch/x86/mm/physaddr.c:75! [ 10.186568] invalid opcode: [#1] [ 10.187104] CPU: 0 PID: 1 Comm: swapper Not tainted 4.13.0-09302-ge4dace3 #1 [ 10.187547] task: 91c45000 task.stack: 91c54000 [ 10.187547] EIP: __phys_addr+0x244/0x2a0 [ 10.187547] EFLAGS: 00210246 CPU: 0 [ 10.187547] EAX: EBX: 92fe ECX: EDX: 0001 [ 10.187547] ESI: 12fe EDI: EBP: 8464e539 ESP: 91c55f08 [ 10.187547] DS: 007b ES: 007b FS: GS: SS: 0068 [ 10.187547] CR0: 80050033 CR2: CR3: 04792000 CR4: 0690 [ 10.187547] Call Trace: [ 10.187547] ? vprintk_func+0x88/0x150 [ 10.187547] ? test_bitmap_init+0xe7/0xe7 [ 10.187547] ? test_debug_virtual_init+0x22/0xf0 [ 10.187547] ? do_one_initcall+0x127/0x3c8 [ 10.187547] ? kernel_init_freeable+0x217/0x3de [ 10.187547] ? kernel_init_freeable+0x245/0x3de [ 10.187547] ? rest_init+0x200/0x200 [ 10.187547] ? kernel_init+0x17/0x340 [ 10.187547] ? rest_init+0x200/0x200 [ 10.187547] ? schedule_tail_wrapper+0x6/0x8 [ 10.187547] ? rest_init+0x200/0x200 [ 10.187547] ? ret_from_fork+0x19/0x24 [ 10.187547] Code: c9 ba 01 00 00 00 c7 04 24 00 00 00 00 b8 08 31 17 84 e8 30 58 19 00 ff 05 74 c1 1f 84 83 05 20 01 7e 84 01 83 15 24 01 7e 84 00 <0f> 0b 83 05 30 01 7e 84 01 83 15 34 01 7e 84 00 8d 74 26 00 83 [ 10.187547] EIP: __phys_addr+0x244/0x2a0 SS:ESP: 0068:91c55f08 [ 10.205865] ---[ end trace 2cb481e044fc1df9 ]--- [ 10.206542] Kernel panic - not syncing: Fatal exception # HH:MM RESULT GOOD BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD git bisect start 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e v4.13 -- git bisect good 3aea311c1b4002bd730a09530f80f2f2ad3bf495 # 04:45 G 10 00 0 genksyms: fix gperf removal conversion git bisect bad 6d8ef53e8b2fed8b0f91df0c6da7cc92747d934a # 05:12 B 0 9 21 0 Merge tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs git bisect bad d71
[PATCH v3] sched/deadline: make it configurable
On most small systems, the deadline scheduler class is a luxury that rarely gets used if at all. It is preferable to have the ability to configure it out to reduce the kernel size in that case. Before: $ size -t kernel/sched/built-in.o textdata bss dec hex filename [...] 244353452 108 279956d5b (TOTALS) With CONFIG_SCHED_DL=n: $ size -t kernel/sched/built-in.o textdata bss dec hex filename [...] 183363388 92 218165538 (TOTALS) Signed-off-by: Nicolas Pitre --- Changes from v2: - rebased to v4.14-rc2 Changes from v1: - fix for a compilation error found by kbuild test robot include/linux/sched.h | 2 ++ include/linux/sched/deadline.h | 8 ++- init/Kconfig | 8 +++ kernel/locking/rtmutex.c | 6 ++--- kernel/sched/Makefile | 5 ++-- kernel/sched/core.c| 15 +++- kernel/sched/cpudeadline.h | 7 +- kernel/sched/debug.c | 4 kernel/sched/rt.c | 13 +++ kernel/sched/sched.h | 44 +--- kernel/sched/stop_task.c | 4 11 files changed, 90 insertions(+), 26 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 92fb8dd5a9..00b4bed170 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -565,7 +565,9 @@ struct task_struct { #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif +#ifdef CONFIG_SCHED_DL struct sched_dl_entity dl; +#endif #ifdef CONFIG_PREEMPT_NOTIFIERS /* List of struct preempt_notifier: */ diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h index 975be862e0..8f191a17dd 100644 --- a/include/linux/sched/deadline.h +++ b/include/linux/sched/deadline.h @@ -13,7 +13,7 @@ static inline int dl_prio(int prio) { - if (unlikely(prio < MAX_DL_PRIO)) + if (IS_ENABLED(CONFIG_SCHED_DL) && unlikely(prio < MAX_DL_PRIO)) return 1; return 0; } @@ -28,4 +28,10 @@ static inline bool dl_time_before(u64 a, u64 b) return (s64)(a - b) < 0; } +#ifdef CONFIG_SCHED_DL +#define dl_deadline(tsk) (tsk)->dl.deadline +#else +#define dl_deadline(tsk) 0 +#endif + #endif /* _LINUX_SCHED_DEADLINE_H */ diff --git a/init/Kconfig b/init/Kconfig index 78cb246101..f252e0dbee 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -959,6 +959,14 @@ config SCHED_AUTOGROUP desktop applications. Task group autogeneration is currently based upon task session. +config SCHED_DL + bool "Deadline Task Scheduling" if EXPERT + default y + help + This adds the sched_dl scheduling class to the kernel providing + support for the SCHED_DEADLINE policy. You might want to disable + this to reduce the kernel size. If unsure say y. + config SYSFS_DEPRECATED bool "Enable deprecated sysfs features to support old userspace tools" depends on SYSFS diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 6f3dba6e4e..12f5eb1953 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -228,7 +228,7 @@ static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, * Only use with rt_mutex_waiter_{less,equal}() */ #define task_to_waiter(p) \ - &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline } + &(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = dl_deadline(p) } static inline int rt_mutex_waiter_less(struct rt_mutex_waiter *left, @@ -680,7 +680,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * the values of the node being removed. */ waiter->prio = task->prio; - waiter->deadline = task->dl.deadline; + waiter->deadline = dl_deadline(task); rt_mutex_enqueue(lock, waiter); @@ -954,7 +954,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, waiter->task = task; waiter->lock = lock; waiter->prio = task->prio; - waiter->deadline = task->dl.deadline; + waiter->deadline = dl_deadline(task); /* Get the top priority waiter on the lock */ if (rt_mutex_has_waiters(lock)) diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 78f54932ea..0d3baba207 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -16,9 +16,10 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer endif obj-y += core.o loadavg.o clock.o cputime.o -obj-y += idle_task.o fair.o rt.o deadline.o +obj-y += idle_task.o fair.o rt.o obj-y += wait.o wait_bit.o swait.o completion.o idle.o -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o +obj-$(CONFIG_SCHED_DL) += deadline.o $(if $(CONFIG_SMP),cpudeadline.o) +obj-$(CONFIG_SMP) += cpupri.o topology.o stop_task.o obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o obj-$(CONFIG_SCHEDSTATS) += stat
Re: [PATCH] KVM: VMX: check match table
On Tue, Oct 03, 2017 at 11:42:18AM +0200, Paolo Bonzini wrote: > On 01/10/2017 01:22, Nick Desaulniers wrote: > > I don't follow (but I also don't know what any of these three letter > > acryonyms acronyms stand for), does svm depend on vmx or vice-versa? > Neither, one is Intel (VMX), the other is AMD (SVM). Oh, neat, did not realize the vendors did not have different names for their virtualization extensions. https://rtfmp.com/2016/03/21/what-does-vmx-svm-cpu-flags-mean/ > Would this work for you? It doesn't apply cleanly on Linus' tree, or the KVM tree master branch, so I couldn't fully test it. But it does look like it will do the trick. > And again, is this only with clang? Indeed the warning was coming from Clang, but looks like some additional cleanup was done, which is good. Reviewed-by: Nick Desaulniers
Re: [PATCH] KVM: VMX: check match table
- avi I think an extra `did not` made it in to that last email...sorry! On Tue, Oct 3, 2017 at 7:54 PM, Nick Desaulniers wrote: > On Tue, Oct 03, 2017 at 11:42:18AM +0200, Paolo Bonzini wrote: >> On 01/10/2017 01:22, Nick Desaulniers wrote: >> > I don't follow (but I also don't know what any of these three letter >> > acryonyms acronyms stand for), does svm depend on vmx or vice-versa? >> Neither, one is Intel (VMX), the other is AMD (SVM). > > Oh, neat, did not realize the vendors did not have different names for > their virtualization extensions. > > https://rtfmp.com/2016/03/21/what-does-vmx-svm-cpu-flags-mean/ > >> Would this work for you? > > It doesn't apply cleanly on Linus' tree, or the KVM tree master branch, > so I couldn't fully test it. But it does look like it will do the > trick. > >> And again, is this only with clang? > > Indeed the warning was coming from Clang, but looks like some > additional cleanup was done, which is good. > > Reviewed-by: Nick Desaulniers
Re: [PATCH 6/7] gpio: brcmstb: consolidate interrupt domains
Hi Doug, On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: > The GPIOLIB IRQ chip helpers were very appealing, but badly broke > the 1:1 mapping between a GPIO controller's device_node and its > interrupt domain. Out of curiosity, what sort of problems have you seen from this? > > This commit consolidates the per bank irq domains to a version > where we have one larger interrupt domain per GPIO controller > instance spanning multiple GPIO banks. This works (and is reminiscent to my initially submitted implementation at [1]), but I think it might make sense to keep as-is (using the gpiolib irqchip helpers), and instead allocate an irqchip fwnode per bank and use to_of_node() to set it as the of_node for the gpiochip before calling gpiochip_irqchip_add(). OTOH, that capability might go away... Linus, can you comment on the FIXME in gpiochip_irqchip_add_key() that says "get rid of this and use gpiochip->parent->of_node everywhere"? It seems like it would still be beneficial to be able to override the associated node for a gpiochip, since that's what's used for the irqdomain, but if that's going away, obviously we don't want to start using that now. Thanks, Gregory [1] https://patchwork.kernel.org/patch/6347811/
Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles
On Tue, Oct 3, 2017 at 1:48 PM, Douglas Anderson wrote: > The main Linux Makefiles and the tools sub-Makefiles have different > conventions for passing in CC / CFLAGS. > > Here's brief summary for the kernel: > * CC: target C compiler (must be passed as an argument to make to > override) > * HOSTCC: host C compiler (must be passed as an argument to make to > override) > * CFLAGS: ignored (kernel manages its own CFLAGS) > * KCFLAGS: extra cflags for the target (expected as an env var) > * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of > make) > > Here's a brief summary for the tools: > * CC: host C compiler (must be passed as an argument to make to > override) > * CFLAGS: base arguments for the host C compiler which are added to by > sub builds (expected as an env var) > > When the main kernel Makefile calls into the tools Makefile, it should > adapt things from its syntax to the tools Makefile syntax. > > If we don't do this, we have a few issues: > * If someone wants to user another compiler (like clang, maybe) for > hostcc then the tools will still be compiled with gcc. > * If you happen to be building and you left ${CFLAGS} set to something > random (maybe it matches ${CC}, the _target_ C compiler, not the > _host_ C compiler) then this random value will be used to compile > the tools. > > In any case, it seems like it makes sense to pass CC, CFLAGS, and > similar properly into the tools Makefile. > > NOTE: in order to do this properly, we need to add some new > definitions of HOSTAS and HOSTLD into the main kernel Makefile. If we > don't do this and someone overrides "AS" or "LD" on the command line > then those (which were intended for the target) will be used to build > host side tools. We also make up some imaginary "HOSTASFLAGS" and > "HOSTLDFLAGS". Someone could specify these, but if they don't at > least these blank variables will properly clobber ASFLAGS and LDFLAGS > from the kernel. > > This was discovered in the Chrome OS build system where CFLAGS (an > environment variable) was accidentally left pointing to flags that > would be an appropriate match to CC. The kernel didn't use these > CFLAGS so it was never an issue. Turning on STACKVALIDATION, however, > suddenly invoked a tools build that blew up. > > Signed-off-by: Douglas Anderson Still not sure if I'd want to go that far. Just clearing out CFLAGS would have fixed my problem, and I start to realize that I am more of a minimalist when it comes to kernel changes. Anyway, not that it means much, Reviewed-by: Guenter Roeck Tested-by: Guenter Roeck [ I didn't test all possible flag combinations, only a few of them, but I did make sure that "make allmodconfig; CFLAGS=-junk make" no longer fails. ] Guenter > --- > > Makefile | 30 -- > 1 file changed, 28 insertions(+), 2 deletions(-) > > diff --git a/Makefile b/Makefile > index cf007a31d575..0d3af0677d88 100644 > --- a/Makefile > +++ b/Makefile > @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS) > HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS) > HOST_LFS_LIBS := $(shell getconf LFS_LIBS) > > +HOSTAS = as > HOSTCC = gcc > +HOSTLD = ld > HOSTCXX = g++ > HOSTCFLAGS := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \ > -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) > @@ -1616,11 +1618,35 @@ image_name: > # Clear a bunch of variables before executing the submake > tools/: FORCE > $(Q)mkdir -p $(objtree)/tools > - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% > -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ > + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ > + LDFLAGS="$(HOSTLDFLAGS)" \ > + CFLAGS="$(HOSTCFLAGS)" \ > + CXXFLAGS="$(HOSTCXXFLAGS)" \ > + $(MAKE) \ > + AS="$(HOSTAS)" \ > + CC="$(HOSTCC)" \ > + CXX="$(HOSTCXX)" \ > + LD="$(HOSTLD)" \ > + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ > + O=$(abspath $(objtree)) \ > + subdir=tools \ > + -C $(src)/tools/ > > tools/%: FORCE > $(Q)mkdir -p $(objtree)/tools > - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% > -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $* > + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ > + LDFLAGS="$(HOSTLDFLAGS)" \ > + CFLAGS="$(HOSTCFLAGS)" \ > + CXXFLAGS="$(HOSTCXXFLAGS)" \ > + $(MAKE) \ > + AS="$(HOSTAS)" \ > + CC="$(HOSTCC)" \ > + CXX="$(HOSTCXX)" \ > + LD="$(HOSTLD)" \ > + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ > + O=$(abspath $(objtree)) \ > + subdir=tools \ > + -C $(src)/tools/ $* > > # Single targets >
Re: [RFC PATCH 2/2] tools build: Take CC, AS, and LD from the command line
On Tue, Oct 3, 2017 at 1:48 PM, Douglas Anderson wrote: > If the top-level tools build is called to build a tool and is passed > CC, AS, or LD on the command line then someone wants to use a > different compiler, assembler, or linker. Let's honor that. > > This together with the change ("kbuild: Pass HOSTCC and similar to > tools Makefiles") allows us to properly get the HOST C Compiler used > when the main kernel Makefile calls into the tools. > > Signed-off-by: Douglas Anderson > --- > > tools/scripts/Makefile.include | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include > index 9dc8f078a83c..00261848ec54 100644 > --- a/tools/scripts/Makefile.include > +++ b/tools/scripts/Makefile.include > @@ -10,6 +10,20 @@ endif > endif > endif > > +SUBMAKE_ARGS := > + > +# If CC, LD, or AR were passed on the command line, pass them through to the > +# command line of the sub-tools. > +ifeq ($(origin CC), command line) > + SUBMAKE_ARGS += "CC=$(CC)" > +endif > +ifeq ($(origin LD), command line) > + SUBMAKE_ARGS += "LD=$(LD)" > +endif > +ifeq ($(origin AR), command line) > + SUBMAKE_ARGS += "AR=$(ar)" s/ar/AR/ ? > +endif > + > # check that the output directory actually exists > ifneq ($(OUTPUT),) > OUTDIR := $(realpath $(OUTPUT)) > @@ -71,7 +85,9 @@ endif > # > descend = \ > +mkdir -p $(OUTPUT)$(1) && \ > - $(MAKE) $(COMMAND_O) subdir=$(if $(subdir),$(subdir)/$(1),$(1)) > $(PRINT_DIR) -C $(1) $(2) > + $(MAKE) $(SUBMAKE_ARGS) $(COMMAND_O) \ > + subdir=$(if $(subdir),$(subdir)/$(1),$(1)) \ > + $(PRINT_DIR) -C $(1) $(2) > > QUIET_SUBDIR0 = +$(MAKE) $(COMMAND_O) -C # space to separate -C and subdir > QUIET_SUBDIR1 = > @@ -94,7 +110,9 @@ ifneq ($(silent),1) > descend = \ > +@echo ' DESCEND '$(1); \ > mkdir -p $(OUTPUT)$(1) && \ > - $(MAKE) $(COMMAND_O) subdir=$(if > $(subdir),$(subdir)/$(1),$(1)) $(PRINT_DIR) -C $(1) $(2) > + $(MAKE) $(SUBMAKE_ARGS) $(COMMAND_O) \ > + subdir=$(if $(subdir),$(subdir)/$(1),$(1)) \ > + $(PRINT_DIR) -C $(1) $(2) > > QUIET_CLEAN= @printf ' CLEAN%s\n' $1; > QUIET_INSTALL = @printf ' INSTALL %s\n' $1; > -- > 2.14.2.920.gcf0c67979c-goog >
[PATCH v2] tools/kvm_stat: Add Python 3 support to kvm_stat
Make kvm_stat support Python 3 by changing the use of "print" to a function rather than a statement, switching from "iteritems" and "iterkeys" (removed in Python 3) to "items" and "keys" respectively, and decoding bytes to strings when dealing with text. With this change, kvm_stat is usable with Python 2.6 and greater. Signed-off-by: Jeremy Cline --- There were a few flaws in my original patch that changed the behavior in Python 2. Additionally, I missed several compatibility problems with Python 3. Sorry! Changes in v2: - Include the "end=' '" argument in print statements that previously had trailing commas since in Python 2 that causes the print statement to not print a newline. - decode bytes strings to text strings so that string concatenation works properly in Python 3. - replace "iterkeys" with "keys" as "iterkeys" is not in Python 3. tools/kvm/kvm_stat/kvm_stat | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/tools/kvm/kvm_stat/kvm_stat b/tools/kvm/kvm_stat/kvm_stat index 32283d88701a..217cf6f95c36 100755 --- a/tools/kvm/kvm_stat/kvm_stat +++ b/tools/kvm/kvm_stat/kvm_stat @@ -19,9 +19,11 @@ Three different ways of output formatting are available: The data is sampled from the KVM's debugfs entries and its perf events. """ +from __future__ import print_function import curses import sys +import locale import os import time import optparse @@ -225,6 +227,8 @@ IOCTL_NUMBERS = { 'RESET': 0x2403, } +ENCODING = locale.getpreferredencoding(False) + class Arch(object): """Encapsulates global architecture specific data. @@ -666,7 +670,7 @@ class TracepointProvider(Provider): """Returns 'event name: current value' for all enabled events.""" ret = defaultdict(int) for group in self.group_leaders: -for name, val in group.read().iteritems(): +for name, val in group.read().items(): if name in self._fields: ret[name] += val return ret @@ -955,7 +959,7 @@ class Tui(object): except: raise Exception for line in child.stdout: -line = line.lstrip().split(' ', 1) +line = line.decode(ENCODING).lstrip().split(' ', 1) # perform a sanity check before calling the more expensive # function to possibly extract the guest name if ' -name ' in line[1]: @@ -1005,7 +1009,7 @@ class Tui(object): name = '' try: line = open('/proc/{}/cmdline' -.format(pid), 'rb').read().split('\0') +.format(pid), 'r').read().split('\0') parms = line[line.index('-name') + 1].split(',') while '' in parms: # commas are escaped (i.e. ',,'), hence e.g. 'foo,bar' results @@ -1170,7 +1174,7 @@ class Tui(object): .format(self.stats.fields_filter)) self.screen.addstr(3, 0, "New regex: ") curses.echo() -regex = self.screen.getstr() +regex = self.screen.getstr().decode(ENCODING) curses.noecho() if len(regex) == 0: self.stats.fields_filter = DEFAULT_REGEX @@ -1204,7 +1208,7 @@ class Tui(object): curses.echo() self.screen.addstr(3, 0, "Pid [0 or pid]: ") -pid = self.screen.getstr() +pid = self.screen.getstr().decode(ENCODING) curses.noecho() try: @@ -1233,7 +1237,7 @@ class Tui(object): self.screen.addstr(2, 0, 'Change delay from %.1fs to ' % self._delay_regular) curses.echo() -val = self.screen.getstr() +val = self.screen.getstr().decode(ENCODING) curses.noecho() try: @@ -1273,7 +1277,7 @@ class Tui(object): self.print_all_gnames(7) curses.echo() self.screen.addstr(3, 0, "Guest [ENTER or guest]: ") -gname = self.screen.getstr() +gname = self.screen.getstr().decode(ENCODING) curses.noecho() if not gname: @@ -1369,25 +1373,25 @@ def batch(stats): s = stats.get() for key in sorted(s.keys()): values = s[key] -print '%-42s%10d%10d' % (key, values[0], values[1]) +print('%-42s%10d%10d' % (key, values[0], values[1])) except KeyboardInterrupt: pass def log(stats): """Prints statistics as reiterating key block, multiple value blocks.""" -keys = sorted(stats.get().iterkeys()) +keys = sorted(stats.get().keys()) def banner(): for k in keys: -print '%s' % k, -print +print(k, end=' ') +print() def statline(): s = stats.get() for k in keys: -print ' %9d' % s[k][1], -print +
Re: [PATCH 2/7] gpio: brcmstb: release the bgpio lock during irq handlers
On Tue, Oct 3, 2017 at 7:09 PM, Doug Berger wrote: > On 10/03/2017 06:55 PM, Gregory Fong wrote: >> Hi Doug, >> >> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: >>> The basic memory-mapped GPIO controller lock must be released >>> before calling the registered GPIO interrupt handlers to allow >>> the interrupt handlers to access the hardware. Otherwise, the >>> hardware accesses will deadlock when they attempt to grab the >>> lock. >> >> I was having some trouble understanding exactly what the problem was >> here, but I think I see it now. Since this locks the entire bank, >> where some GPIOs might be set as inputs and some as inputs (and >> interrupt sources), then an interrupt on a GPIO that is supposed to >> set another GPIO in the bank would result in deadlock. Is that >> correct? If so, please update the commit message to make that clear, >> and nice fix. If not that, it would be nice to know what scenario can >> cause a problem. > > That is an example, but there are really many possibilities. > > Basically, if a registered interrupt handler wants to access its GPIO > you are likely to run into trouble. Another example might be an > interrupt that is configured to trigger on either edge transition and > the handler wants to know whether the input is currently high or low. > > I can submit a V2 with a change in the description if you would like, > but I'm not sure what the clearest example would be. If you could just mention both of these possible cases, that would be great! With that change, Acked-by: Gregory Fong
Re: [PATCH 4.13 103/110] platform/x86: fujitsu-laptop: Dont oops when FUJ02E3 is not presnt
On Tue, Oct 03, 2017 at 05:27:25PM -0700, Darren Hart wrote: > On Wed, Oct 04, 2017 at 08:39:34AM +1030, Jonathan Woithe wrote: > > : > > Reviewed-by: Jonathan Woithe > > Thanks for the due diligence here Jonathan, but for stable backport > announcements, you don't need to speak up unless you have an objection. No worries; thanks for the info. jonathan
Re: [PATCH 5/7] gpio: brcmstb: enable masking of interrupts when changing type
On Tue, Oct 3, 2017 at 7:22 PM, Doug Berger wrote: > On 10/03/2017 07:10 PM, Gregory Fong wrote: >> Hi Doug, >> >> On Fri, Sep 29, 2017 at 8:40 PM, Doug Berger wrote: >>> Mask the GPIO interrupt while its type is being changed, just in case >>> it can prevent a spurious interrupt. >> >> "Just in case"? I don't have access to hardware documentation for >> this anymore, but I'd expect to some stronger claim that the hardware >> actually requires masking before changing the trigger type. If you >> can quote documentation for this or explain an actual problem seen, >> that would be good. > > Well spotted ;). This was a protectionist change added at the request > of a user of the driver. I believe that it is superfluous and I suppose > that belief leaked through in the language of my comment. > > In actuality, the GPIO APIs don't make provision for clearing GPIO > interrupts before enabling them so this masking is really only a > deferral of a spurious interrupt if one is triggered by changes in the > hardware programming. Indeed. If the hardware is behaving in unexpected ways that would be a whole different kind of problem. > > I can strike this from the upstream submission and save a couple lines > of source code (after IRQCHIP_MASK_ON_SUSPEND goes away) if you prefer. > Yes, please remove. Thanks, Gregory
Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles
2017-10-04 5:48 GMT+09:00 Douglas Anderson : > The main Linux Makefiles and the tools sub-Makefiles have different > conventions for passing in CC / CFLAGS. > > Here's brief summary for the kernel: > * CC: target C compiler (must be passed as an argument to make to > override) > * HOSTCC: host C compiler (must be passed as an argument to make to > override) > * CFLAGS: ignored (kernel manages its own CFLAGS) > * KCFLAGS: extra cflags for the target (expected as an env var) > * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of > make) > > Here's a brief summary for the tools: > * CC: host C compiler (must be passed as an argument to make to > override) > * CFLAGS: base arguments for the host C compiler which are added to by > sub builds (expected as an env var) > > When the main kernel Makefile calls into the tools Makefile, it should > adapt things from its syntax to the tools Makefile syntax. > > If we don't do this, we have a few issues: > * If someone wants to user another compiler (like clang, maybe) for > hostcc then the tools will still be compiled with gcc. > * If you happen to be building and you left ${CFLAGS} set to something > random (maybe it matches ${CC}, the _target_ C compiler, not the > _host_ C compiler) then this random value will be used to compile > the tools. > > In any case, it seems like it makes sense to pass CC, CFLAGS, and > similar properly into the tools Makefile. > > NOTE: in order to do this properly, we need to add some new > definitions of HOSTAS and HOSTLD into the main kernel Makefile. If we > don't do this and someone overrides "AS" or "LD" on the command line > then those (which were intended for the target) will be used to build > host side tools. We also make up some imaginary "HOSTASFLAGS" and > "HOSTLDFLAGS". Someone could specify these, but if they don't at > least these blank variables will properly clobber ASFLAGS and LDFLAGS > from the kernel. > > This was discovered in the Chrome OS build system where CFLAGS (an > environment variable) was accidentally left pointing to flags that > would be an appropriate match to CC. The kernel didn't use these > CFLAGS so it was never an issue. Turning on STACKVALIDATION, however, > suddenly invoked a tools build that blew up. > > Signed-off-by: Douglas Anderson > --- > > Makefile | 30 -- > 1 file changed, 28 insertions(+), 2 deletions(-) > > diff --git a/Makefile b/Makefile > index cf007a31d575..0d3af0677d88 100644 > --- a/Makefile > +++ b/Makefile > @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS) > HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS) > HOST_LFS_LIBS := $(shell getconf LFS_LIBS) > > +HOSTAS = as > HOSTCC = gcc > +HOSTLD = ld > HOSTCXX = g++ > HOSTCFLAGS := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \ > -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) > @@ -1616,11 +1618,35 @@ image_name: > # Clear a bunch of variables before executing the submake > tools/: FORCE > $(Q)mkdir -p $(objtree)/tools > - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% > -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ > + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ > + LDFLAGS="$(HOSTLDFLAGS)" \ > + CFLAGS="$(HOSTCFLAGS)" \ > + CXXFLAGS="$(HOSTCXXFLAGS)" \ > + $(MAKE) \ > + AS="$(HOSTAS)" \ > + CC="$(HOSTCC)" \ > + CXX="$(HOSTCXX)" \ > + LD="$(HOSTLD)" \ > + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ > + O=$(abspath $(objtree)) \ > + subdir=tools \ > + -C $(src)/tools/ > > tools/%: FORCE > $(Q)mkdir -p $(objtree)/tools > - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% > -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $* > + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ > + LDFLAGS="$(HOSTLDFLAGS)" \ > + CFLAGS="$(HOSTCFLAGS)" \ > + CXXFLAGS="$(HOSTCXXFLAGS)" \ > + $(MAKE) \ > + AS="$(HOSTAS)" \ > + CC="$(HOSTCC)" \ > + CXX="$(HOSTCXX)" \ > + LD="$(HOSTLD)" \ > + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ > + O=$(abspath $(objtree)) \ > + subdir=tools \ > + -C $(src)/tools/ $* > Please do not do this. CC: for building tools that run on the target machine HOSTCC: for building tools that run on the build machine IMHO, this should be consistent to avoid confusion. Grepping CROSS_COMPILE under tools/ directory, I see many Makefile expect CC for the target system. For ex, tools/croup/Makefile CC = $(CROSS_COMPILE)gcc Why don't you fix tools/objtool/Makefile ? -- Best Regards Masahiro Yamada
Re: [PATCH 1/8] ARM: dts: aspeed: Move pinctrl subnodes to improve readability
On Thu, 2017-09-28 at 17:21 +0930, Joel Stanley wrote: > > From: Andrew Jeffery > > Moving the subnodes out of the pinctrl node declaration to a reference > allows easier access to the remaining parts of the devicetree. > > > Signed-off-by: Andrew Jeffery > > Reviewed-by: Xo Wang > > Signed-off-by: Joel Stanley > --- > arch/arm/boot/dts/aspeed-g4.dtsi | 1483 ++-- > arch/arm/boot/dts/aspeed-g5.dtsi | 1549 > +++--- > 2 files changed, 1518 insertions(+), 1514 deletions(-) I hate the way the diff came out for the g5, it's almost impossible to read. Also not sure how useful it is to ack my own patch, but given the rebasing it has suffered: Acked-by: Andrew Jeffery Cheers, Andrew signature.asc Description: This is a digitally signed message part
Re: [PATCH 2/8] ARM: dts: aspeed: Reorder ADC node
On Thu, 2017-09-28 at 17:21 +0930, Joel Stanley wrote: > We try to keep the nodes in address order. The ADC node was out of > place. > > Signed-off-by: Joel Stanley > --- > arch/arm/boot/dts/aspeed-g4.dtsi | 16 > arch/arm/boot/dts/aspeed-g5.dtsi | 16 > 2 files changed, 16 insertions(+), 16 deletions(-) > > diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi > b/arch/arm/boot/dts/aspeed-g4.dtsi > index 1edd0cee6221..a4579498fc25 100644 > --- a/arch/arm/boot/dts/aspeed-g4.dtsi > +++ b/arch/arm/boot/dts/aspeed-g4.dtsi > @@ -129,6 +129,14 @@ > }; > }; > > + adc: adc@1e6e9000 { > + compatible = "aspeed,ast2400-adc"; > + reg = <0x1e6e9000 0xb0>; > + clocks = <&syscon ASPEED_CLK_APB>; We can't do this yet as the clk driver isn't yet merged, and it breaks from the "just move the node" description in the commit message. > + #io-channel-cells = <1>; > + status = "disabled"; > + }; > + > sram@1e72 { > compatible = "mmio-sram"; > reg = <0x1e72 0x8000>; // 32K > @@ -227,14 +235,6 @@ > no-loopback-test; > status = "disabled"; > }; > - > - adc: adc@1e6e9000 { > - compatible = "aspeed,ast2400-adc"; > - reg = <0x1e6e9000 0xb0>; > - clocks = <&clk_apb>; > - #io-channel-cells = <1>; > - status = "disabled"; > - }; > }; > }; > }; > diff --git a/arch/arm/boot/dts/aspeed-g5.dtsi > b/arch/arm/boot/dts/aspeed-g5.dtsi > index f56dd67efa50..f6430b313f90 100644 > --- a/arch/arm/boot/dts/aspeed-g5.dtsi > +++ b/arch/arm/boot/dts/aspeed-g5.dtsi > @@ -173,6 +173,14 @@ > reg-io-width = <4>; > }; > > + adc: adc@1e6e9000 { > + compatible = "aspeed,ast2500-adc"; > + reg = <0x1e6e9000 0xb0>; Did you intend to change the size cell value here? It now matches the g4, but there was an explicit comment about the size for some reason in the -hunk below. It's probably worth an explicit call-out if we're going to change it. > + clocks = <&syscon ASPEED_CLK_APB>; See the clk comment on the g4 diff. Cheers, Andrew > + #io-channel-cells = <1>; > + status = "disabled"; > + }; > + > sram@1e72 { > compatible = "mmio-sram"; > reg = <0x1e72 0x9000>; // 36K > @@ -307,14 +315,6 @@ > no-loopback-test; > status = "disabled"; > }; > - > - adc: adc@1e6e9000 { > - compatible = "aspeed,ast2500-adc"; > - reg = <0x1e6e9000 0xb0>; > - clocks = <&clk_apb>; > - #io-channel-cells = <1>; > - status = "disabled"; > - }; > }; > }; > }; signature.asc Description: This is a digitally signed message part
[PATCH] kconfig: Document the 'menu' struct
Understanding what it represents helps a lot when reading the code, and it's not obvious, so document it. The ROOT_MENU flag is only set and tested by the gconf and qconf front ends, so leave it undocumented here. The obvious guess for what it means is correct. Signed-off-by: Ulf Magnusson --- scripts/kconfig/expr.h | 45 + 1 file changed, 45 insertions(+) diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h index a73f762..ae7583c 100644 --- a/scripts/kconfig/expr.h +++ b/scripts/kconfig/expr.h @@ -166,22 +166,67 @@ struct property { for (st = sym->prop; st; st = st->next) \ if (st->text) +/* + * Represents a node in the menu tree, as seen in e.g. menuconfig (though used + * for all front ends). Each symbol, menu, etc. defined in the Kconfig files + * gets a node. A symbol defined in multiple locations gets one node at each + * location. + */ struct menu { + /* The next menu node at the same level */ struct menu *next; + + /* The parent menu node, corresponding to e.g. a menu or choice */ struct menu *parent; + + /* The first child menu node, for e.g. menus and choices */ struct menu *list; + + /* +* The symbol associated with the menu node. Choices are implemented as +* a special kind of symbol. NULL for menus, comments, and ifs. +*/ struct symbol *sym; + + /* +* The prompt associated with the node. This holds the prompt for a +* symbol as well as the text for a menu or comment, along with the +* type (P_PROMPT, P_MENU, etc.) +*/ struct property *prompt; + + /* +* 'visible if' dependencies. If more than one is given, they will be +* ANDed together. +*/ struct expr *visibility; + + /* +* Ordinary dependencies from e.g. 'depends on' and 'if', ANDed +* together +*/ struct expr *dep; + + /* MENU_* flags */ unsigned int flags; + + /* Any help text associated with the node */ char *help; + + /* The location where the menu node appears in the Kconfig files */ struct file *file; int lineno; + + /* For use by front ends that need to store auxiliary data */ void *data; }; +/* + * Set on a menu node when the corresponding symbol changes state in some way. + * Can be checked by front ends. + */ #define MENU_CHANGED 0x0001 + #define MENU_ROOT 0x0002 struct jump_key { -- 2.7.4
[PATCH v9 20/29] x86/insn-eval: Add wrapper function for 32 and 64-bit addresses
The function insn_get_addr_ref() is capable of handling only 64-bit addresses. A previous commit introduced a function to handle 32-bit addresses. Invoke these two functions from a third wrapper function that calls the appropriate routine based on the address size specified in the instruction structure (obtained by looking at the code segment default address size and the address override prefix, if present). While doing this, rename the original function insn_get_addr_ref() with the more appropriate name get_addr_ref_64(), ensure it is only used for 64-bit addresses. Also, since 64-bit addresses are not possible in 32-bit builds, provide a dummy function such case. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 57 +++- 1 file changed, 52 insertions(+), 5 deletions(-) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 945c9b7..1d510a6 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -872,12 +872,28 @@ static void __user *get_addr_ref_32(struct insn *insn, struct pt_regs *regs) return (void __user *)linear_addr; } -/* - * return the address being referenced be instruction - * for rm=3 returning the content of the rm reg - * for rm!=3 calculates the address using SIB and Disp +/** + * get_addr_ref_64() - Obtain a 64-bit linear address + * @insn: Instruction struct with ModRM and SIB bytes and displacement + * @regs: Structure with register values as seen when entering kernel mode + * + * This function is to be used with 64-bit address encodings to obtain the + * linear memory address referred by the instruction's ModRM, SIB, + * displacement bytes and segment base address, as applicable. + * + * Returns: + * + * Linear address referenced by instruction and registers on success. + * + * -1L on error. */ -void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) +#ifndef CONFIG_X86_64 +static void __user *get_addr_ref_64(struct insn *insn, struct pt_regs *regs) +{ + return (void __user *)-1L; +} +#else +static void __user *get_addr_ref_64(struct insn *insn, struct pt_regs *regs) { int addr_offset, base_offset, indx_offset, seg_reg_indx; unsigned long linear_addr = -1L, seg_base_addr; @@ -888,6 +904,9 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) insn_get_sib(insn); sib = insn->sib.value; + if (insn->addr_bytes != 8) + goto out; + if (X86_MODRM_MOD(insn->modrm.value) == 3) { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) @@ -971,3 +990,31 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) out: return (void __user *)linear_addr; } +#endif /* CONFIG_X86_64 */ + +/** + * insn_get_addr_ref() - Obtain the linear address referred by instruction + * @insn: Instruction structure containing ModRM byte and displacement + * @regs: Structure with register values as seen when entering kernel mode + * + * Obtain the linear address referred by the instruction's ModRM, SIB and + * displacement bytes, and segment base, as applicable. In protected mode, + * segment limits are enforced. + * + * Returns: + * + * Linear address referenced by instruction and registers on success. + * + * -1L on error. + */ +void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) +{ + switch (insn->addr_bytes) { + case 4: + return get_addr_ref_32(insn, regs); + case 8: + return get_addr_ref_64(insn, regs); + default: + return (void __user *)-1L; + } +} -- 2.7.4
[PATCH v9 03/29] ptrace,x86: Make user_64bit_mode() available to 32-bit builds
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64 is selected. This implies that code built with CONFIG_X86_64=n cannot use it. If a piece of code needs to be built for both CONFIG_X86_64=y and CONFIG_X86_64=n and wants to use this function, it needs to wrap it in an #ifdef/#endif; potentially, in multiple places. This can be easily avoided with a single #ifdef/#endif pair within user_64bit_mode() itself. Suggested-by: Borislav Petkov Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/include/asm/ptrace.h | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8..e2afbf6 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -135,9 +135,9 @@ static inline int v8086_mode(struct pt_regs *regs) #endif } -#ifdef CONFIG_X86_64 static inline bool user_64bit_mode(struct pt_regs *regs) { +#ifdef CONFIG_X86_64 #ifndef CONFIG_PARAVIRT /* * On non-paravirt systems, this is the only long mode CPL 3 @@ -148,8 +148,12 @@ static inline bool user_64bit_mode(struct pt_regs *regs) /* Headers are too twisted for this to go in paravirt.h. */ return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs; #endif +#else /* !CONFIG_X86_64 */ + return false; +#endif } +#ifdef CONFIG_X86_64 #define current_user_stack_pointer() current_pt_regs()->sp #define compat_user_stack_pointer()current_pt_regs()->sp #endif -- 2.7.4
[PATCH v9 19/29] x86/insn-eval: Add support to resolve 32-bit address encodings
32-bit and 64-bit address encodings are identical. Thus, the same logic could be used to resolve the effective address. However, there are two key differences: address size and enforcement of segment limits. If running a 32-bit process on a 64-bit kernel, it is best to perform the address calculation using 32-bit data types. In this manner hardware is used for the arithmetic, including handling of signs and overflows. 32-bit addresses are generally used in protected mode; segment limits are enforced in this mode. This implementation obtains the limit of the segment associated with the instruction operands and prefixes. If the computed address is outside the segment limits, an error is returned. It is also possible to use 32-bit address in long mode and virtual-8086 mode by using an address override prefix. In such cases, segment limits are not enforced. The new function get_addr_ref_32() is almost identical to the existing function insn_get_addr_ref() (used for 64-bit addresses); except for the differences mentioned above. For the sake of simplicity and readability, it is better to use two separate functions. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 160 +++ 1 file changed, 160 insertions(+) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index b3aa891..945c9b7 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -712,6 +712,166 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) return get_reg_offset(insn, regs, REG_TYPE_RM); } +/** + * get_addr_ref_32() - Obtain a 32-bit linear address + * @insn: Instruction with ModRM, SIB bytes and displacement + * @regs: Register values as seen when entering kernel mode + * + * This function is to be used with 32-bit address encodings to obtain the + * linear memory address referred by the instruction's ModRM, SIB, + * displacement bytes and segment base address, as applicable. If in protected + * mode, segment limits are enforced. + * + * Returns: + * + * Linear address referenced by instruction and registers on success. + * + * -1L on error. + */ +static void __user *get_addr_ref_32(struct insn *insn, struct pt_regs *regs) +{ + int eff_addr, base, indx, addr_offset, base_offset, indx_offset; + unsigned long linear_addr = -1L, seg_base_addr, seg_limit, tmp; + int seg_reg_indx; + insn_byte_t sib; + + insn_get_modrm(insn); + insn_get_sib(insn); + sib = insn->sib.value; + + if (insn->addr_bytes != 4) + goto out; + + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); + if (addr_offset < 0) + goto out; + + tmp = regs_get_register(regs, addr_offset); + /* The 4 most significant bytes must be zero. */ + if (tmp & ~0xL) + goto out; + + eff_addr = (int)(tmp & 0x); + + seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset); + if (seg_reg_indx < 0) + goto out; + + seg_base_addr = insn_get_seg_base(regs, seg_reg_indx); + if (seg_base_addr == -1L) + goto out; + + seg_limit = get_seg_limit(regs, seg_reg_indx); + } else { + if (insn->sib.nbytes) { + /* +* Negative values in the base and index offset means +* an error when decoding the SIB byte. Except -EDOM, +* which means that the registers should not be used +* in the address computation. +*/ + base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); + if (base_offset == -EDOM) { + base = 0; + } else if (base_offset < 0) { + goto out; + } else { + tmp = regs_get_register(regs, base_offset); + /* The 4 most significant bytes must be zero. */ + if (tmp & ~0xL) + goto out; + + base = (int)(tmp & 0x); + } + + indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); + if (indx_offset == -EDOM) { + indx = 0; + } else if (indx_offset
[PATCH v9 28/29] selftests/x86: Add tests for User-Mode Instruction Prevention
Certain user space programs that run on virtual-8086 mode may utilize instructions protected by the User-Mode Instruction Prevention (UMIP) security feature present in new Intel processors: SGDT, SIDT and SMSW. In such a case, a general protection fault is issued if UMIP is enabled. When such a fault happens, the kernel traps it and emulates the results of these instructions with dummy values. The purpose of this new test is to verify whether the impacted instructions can be executed without causing such #GP. If no #GP exceptions occur, we expect to exit virtual-8086 mode from INT3. The instructions protected by UMIP are executed in representative use cases: a) displacement-only memory addressing b) register-indirect memory addressing c) results stored directly in operands Unfortunately, it is not possible to check the results against a set of expected values because no emulation will occur in systems that do not have the UMIP feature. Instead, results are printed for verification. A simple verification is done to ensure that results of all tests are identical. Cc: Andy Lutomirski Cc: Andrew Morton Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Ricardo Neri --- tools/testing/selftests/x86/entry_from_vm86.c | 73 ++- 1 file changed, 72 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/x86/entry_from_vm86.c b/tools/testing/selftests/x86/entry_from_vm86.c index d075ea0..f7d9cea 100644 --- a/tools/testing/selftests/x86/entry_from_vm86.c +++ b/tools/testing/selftests/x86/entry_from_vm86.c @@ -95,6 +95,22 @@ asm ( "int3\n\t" "vmcode_int80:\n\t" "int $0x80\n\t" + "vmcode_umip:\n\t" + /* addressing via displacements */ + "smsw (2052)\n\t" + "sidt (2054)\n\t" + "sgdt (2060)\n\t" + /* addressing via registers */ + "mov $2066, %bx\n\t" + "smsw (%bx)\n\t" + "mov $2068, %bx\n\t" + "sidt (%bx)\n\t" + "mov $2074, %bx\n\t" + "sgdt (%bx)\n\t" + /* register operands, only for smsw */ + "smsw %ax\n\t" + "mov %ax, (2080)\n\t" + "int3\n\t" ".size vmcode, . - vmcode\n\t" "end_vmcode:\n\t" ".code32\n\t" @@ -103,7 +119,7 @@ asm ( extern unsigned char vmcode[], end_vmcode[]; extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[], - vmcode_sti[], vmcode_int3[], vmcode_int80[]; + vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[]; /* Returns false if the test was skipped. */ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, @@ -160,6 +176,58 @@ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, return true; } +void do_umip_tests(struct vm86plus_struct *vm86, unsigned char *test_mem) +{ + struct table_desc { + unsigned short limit; + unsigned long base; + } __attribute__((packed)); + + /* Initialize variables with arbitrary values */ + struct table_desc gdt1 = { .base = 0x3c3c3c3c, .limit = 0x }; + struct table_desc gdt2 = { .base = 0x1a1a1a1a, .limit = 0xaeae }; + struct table_desc idt1 = { .base = 0x7b7b7b7b, .limit = 0xf1f1 }; + struct table_desc idt2 = { .base = 0x89898989, .limit = 0x1313 }; + unsigned short msw1 = 0x1414, msw2 = 0x2525, msw3 = 3737; + + /* UMIP -- exit with INT3 unless kernel emulation did not trap #GP */ + do_test(vm86, vmcode_umip - vmcode, VM86_TRAP, 3, "UMIP tests"); + + /* Results from displacement-only addressing */ + msw1 = *(unsigned short *)(test_mem + 2052); + memcpy(&idt1, test_mem + 2054, sizeof(idt1)); + memcpy(&gdt1, test_mem + 2060, sizeof(gdt1)); + + /* Results from register-indirect addressing */ + msw2 = *(unsigned short *)(test_mem + 2066); + memcpy(&idt2, test_mem + 2068, sizeof(idt2)); + memcpy(&gdt2, test_mem + 2074, sizeof(gdt2)); + + /* Results when using register operands */ + msw3 = *(unsigned short *)(test_mem + 2080); + + printf("[INFO]\tResult from SMSW:[0x%04x]\n", msw1); + printf("[INFO]\tResult from SIDT: limit[0x%04x]base[0x%08lx]\n", + idt1.limit, idt1.base); + printf("[INFO]\tResult from SGDT: limit[0x%04x]base[0x%08lx]\n", + gdt1.limit, gdt1.base); + + if (msw1 != msw2 || msw1 != msw3) + printf("[FAIL]\tAll the results of SMSW should be the same.\n"); + else + printf("[PASS]\tAll the results from SMSW are identical.\n"); + + if (memcmp(&gdt1, &gdt2, sizeof(gdt1))) + printf("[FAIL]\tAll the results of SGDT should be the same.\n"); + else + printf("[PASS]\tAll the results from SG
[PATCH v9 29/29] selftests/x86: Add tests for instruction str and sldt
The instructions str and sldt are not recognized when running on virtual- 8086 mode and generate an invalid operand exception. These two instructions are protected by the Intel User-Mode Instruction Prevention (UMIP) security feature. In protected mode, if UMIP is enabled, these instructions generate a general protection fault if called from CPL > 0. Linux traps the general protection fault and emulates the instructions sgdt, sidt and smsw; but not str and sldt. These tests are added to verify that the emulation code does not emulate these two instructions but the expected invalid operand exception is seen. Tests fallback to exit with int3 in case emulation does happen. Cc: Andy Lutomirski Cc: Andrew Morton Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Ricardo Neri --- tools/testing/selftests/x86/entry_from_vm86.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/x86/entry_from_vm86.c b/tools/testing/selftests/x86/entry_from_vm86.c index f7d9cea..361466a 100644 --- a/tools/testing/selftests/x86/entry_from_vm86.c +++ b/tools/testing/selftests/x86/entry_from_vm86.c @@ -111,6 +111,11 @@ asm ( "smsw %ax\n\t" "mov %ax, (2080)\n\t" "int3\n\t" + "vmcode_umip_str:\n\t" + "str %eax\n\t" + "vmcode_umip_sldt:\n\t" + "sldt %eax\n\t" + "int3\n\t" ".size vmcode, . - vmcode\n\t" "end_vmcode:\n\t" ".code32\n\t" @@ -119,7 +124,8 @@ asm ( extern unsigned char vmcode[], end_vmcode[]; extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[], - vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[]; + vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[], + vmcode_umip_str[], vmcode_umip_sldt[]; /* Returns false if the test was skipped. */ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, @@ -226,6 +232,16 @@ void do_umip_tests(struct vm86plus_struct *vm86, unsigned char *test_mem) printf("[FAIL]\tAll the results of SIDT should be the same.\n"); else printf("[PASS]\tAll the results from SIDT are identical.\n"); + + sethandler(SIGILL, sighandler, 0); + do_test(vm86, vmcode_umip_str - vmcode, VM86_SIGNAL, 0, + "STR instruction"); + clearhandler(SIGILL); + + sethandler(SIGILL, sighandler, 0); + do_test(vm86, vmcode_umip_sldt - vmcode, VM86_SIGNAL, 0, + "SLDT instruction"); + clearhandler(SIGILL); } int main(void) -- 2.7.4
[PATCH v9 26/29] x86: Enable User-Mode Instruction Prevention
User-Mode Instruction Prevention (UMIP) is enabled by setting/clearing a bit in %cr4. It makes sense to enable UMIP at some point while booting, before user spaces come up. Like SMAP and SMEP, is not critical to have it enabled very early during boot. This is because UMIP is relevant only when there is a userspace to be protected from. Given the similarities in relevance, it makes sense to enable UMIP along with SMAP and SMEP. UMIP is enabled by default. It can be disabled by adding clearcpuid=514 to the kernel parameters. Cc: Andy Lutomirski Cc: Andrew Morton Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Cc: Tony Luck Cc: Paolo Bonzini Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/Kconfig | 10 ++ arch/x86/kernel/cpu/common.c | 25 - 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5442735..b7c06d8 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1803,6 +1803,16 @@ config X86_SMAP If unsure, say Y. +config X86_INTEL_UMIP + def_bool n + depends on CPU_SUP_INTEL + prompt "Intel User Mode Instruction Prevention" if EXPERT + ---help--- + The User Mode Instruction Prevention (UMIP) is a security + feature in newer Intel processors. If enabled, a general + protection fault is issued if the instructions SGDT, SLDT, + SIDT, SMSW and STR are executed in user mode. + config X86_INTEL_MPX prompt "Intel MPX (Memory Protection Extensions)" def_bool n diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 03f9a1a..45b5ec4 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -329,6 +329,28 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } } +static __always_inline void setup_umip(struct cpuinfo_x86 *c) +{ + /* Check the boot processor, plus build option for UMIP. */ + if (!cpu_feature_enabled(X86_FEATURE_UMIP)) + goto out; + + /* Check the current processor's cpuid bits. */ + if (!cpu_has(c, X86_FEATURE_UMIP)) + goto out; + + cr4_set_bits(X86_CR4_UMIP); + + return; + +out: + /* +* Make sure UMIP is disabled in case it was enabled in a +* previous boot (e.g., via kexec). +*/ + cr4_clear_bits(X86_CR4_UMIP); +} + /* * Protection Keys are not available in 32-bit mode. */ @@ -1147,9 +1169,10 @@ static void identify_cpu(struct cpuinfo_x86 *c) /* Disable the PN if appropriate */ squash_the_stupid_serial_number(c); - /* Set up SMEP/SMAP */ + /* Set up SMEP/SMAP/UMIP */ setup_smep(c); setup_smap(c); + setup_umip(c); /* * The vendor-specific functions might have changed features. -- 2.7.4
[PATCH v9 27/29] x86/traps: Fixup general protection faults caused by UMIP
If the User-Mode Instruction Prevention CPU feature is available and enabled, a general protection fault will be issued if the instructions sgdt, sldt, sidt, str or smsw are executed from user-mode context (CPL > 0). If the fault was caused by any of the instructions protected by UMIP, fixup_umip_exception() will emulate dummy results for these instructions as follows: if running a 32-bit process, sgdt, sidt and smsw are emulated; str and sldt are not emulated. No emulation is done for 64-bit processes. If emulation is successful, the result is passed to the user space program and no SIGSEGV signal is emitted. Please note that fixup_umip_exception() also caters for the case when the fault originated while running in virtual-8086 mode. Cc: Andy Lutomirski Cc: Andrew Morton Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Cc: Tony Luck Cc: Paolo Bonzini Cc: x...@kernel.org Reviewed-by: Andy Lutomirski Signed-off-by: Ricardo Neri --- arch/x86/kernel/traps.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index a5791f3..4c0aa6c 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -60,6 +60,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -514,6 +515,10 @@ do_general_protection(struct pt_regs *regs, long error_code) RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); cond_local_irq_enable(regs); + if (static_cpu_has(X86_FEATURE_UMIP)) + if (user_mode(regs) && fixup_umip_exception(regs)) + return; + if (v8086_mode(regs)) { local_irq_enable(); handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code); -- 2.7.4
[PATCH v9 05/29] x86/mpx: Simplify handling of errors when computing linear addresses
When errors occur in the computation of the linear address, -1L is returned. Rather than having a separate return path for errors, the variable used to return the computed linear address can be initialized with the error value. Hence, only one return path is needed. This makes the function easier to read. While here, ensure that the error value is -1L, a 64-bit value, rather than -1, a 32-bit value. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Peter Zijlstra Cc: Nathan Howard Cc: Adan Hawthorn Cc: Joe Perches Cc: Ravi V. Shankar Cc: x...@kernel.org Suggested-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/mm/mpx.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 9ceaa95..f4c48a0 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -138,7 +138,7 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, */ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) { - unsigned long addr, base, indx; + unsigned long addr = -1L, base, indx; int addr_offset, base_offset, indx_offset; insn_byte_t sib; @@ -149,17 +149,17 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) if (X86_MODRM_MOD(insn->modrm.value) == 3) { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) - goto out_err; + goto out; addr = regs_get_register(regs, addr_offset); } else { if (insn->sib.nbytes) { base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); if (base_offset < 0) - goto out_err; + goto out; indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); if (indx_offset < 0) - goto out_err; + goto out; base = regs_get_register(regs, base_offset); indx = regs_get_register(regs, indx_offset); @@ -167,14 +167,13 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) - goto out_err; + goto out; addr = regs_get_register(regs, addr_offset); } addr += insn->displacement.value; } +out: return (void __user *)addr; -out_err: - return (void __user *)-1; } static int mpx_insn_decode(struct insn *insn, -- 2.7.4
[PATCH 1/4] kbuild: replace $(hdr-arch) with $(SRCARCH)
Since commit 5e53879008b9 ("sparc,sparc64: unify Makefile"), hdr-arch and SRCARCH always match. Signed-off-by: Masahiro Yamada --- Makefile | 21 + scripts/Makefile.headersinst | 2 +- 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/Makefile b/Makefile index cf007a3..8d900ba 100644 --- a/Makefile +++ b/Makefile @@ -283,9 +283,6 @@ ifeq ($(ARCH),tilegx) SRCARCH := tile endif -# Where to locate arch specific headers -hdr-arch := $(SRCARCH) - KCONFIG_CONFIG ?= .config export KCONFIG_CONFIG @@ -378,8 +375,8 @@ CFLAGS_KCOV := $(call cc-option,-fsanitize-coverage=trace-pc,) # Use USERINCLUDE when you must reference the UAPI directories only. USERINCLUDE:= \ - -I$(srctree)/arch/$(hdr-arch)/include/uapi \ - -I$(objtree)/arch/$(hdr-arch)/include/generated/uapi \ + -I$(srctree)/arch/$(SRCARCH)/include/uapi \ + -I$(objtree)/arch/$(SRCARCH)/include/generated/uapi \ -I$(srctree)/include/uapi \ -I$(objtree)/include/generated/uapi \ -include $(srctree)/include/linux/kconfig.h @@ -387,8 +384,8 @@ USERINCLUDE:= \ # Use LINUXINCLUDE when you must reference the include/ directory. # Needed to be compatible with the O= option LINUXINCLUDE:= \ - -I$(srctree)/arch/$(hdr-arch)/include \ - -I$(objtree)/arch/$(hdr-arch)/include/generated \ + -I$(srctree)/arch/$(SRCARCH)/include \ + -I$(objtree)/arch/$(SRCARCH)/include/generated \ $(if $(KBUILD_SRC), -I$(srctree)/include) \ -I$(objtree)/include \ $(USERINCLUDE) @@ -1134,8 +1131,8 @@ headerdep: #Default location for installed headers export INSTALL_HDR_PATH = $(objtree)/usr -# If we do an all arch process set dst to include/arch-$(hdr-arch) -hdr-dst = $(if $(KBUILD_HEADERS), dst=include/arch-$(hdr-arch), dst=include) +# If we do an all arch process set dst to include/arch-$(SRCARCH) +hdr-dst = $(if $(KBUILD_HEADERS), dst=include/arch-$(SRCARCH), dst=include) PHONY += archheaders archheaders: @@ -1153,10 +1150,10 @@ headers_install_all: PHONY += headers_install headers_install: __headers - $(if $(wildcard $(srctree)/arch/$(hdr-arch)/include/uapi/asm/Kbuild),, \ + $(if $(wildcard $(srctree)/arch/$(SRCARCH)/include/uapi/asm/Kbuild),, \ $(error Headers not exportable for the $(SRCARCH) architecture)) $(Q)$(MAKE) $(hdr-inst)=include/uapi dst=include - $(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi $(hdr-dst) + $(Q)$(MAKE) $(hdr-inst)=arch/$(SRCARCH)/include/uapi $(hdr-dst) PHONY += headers_check_all headers_check_all: headers_install_all @@ -1165,7 +1162,7 @@ headers_check_all: headers_install_all PHONY += headers_check headers_check: headers_install $(Q)$(MAKE) $(hdr-inst)=include/uapi dst=include HDRCHECK=1 - $(Q)$(MAKE) $(hdr-inst)=arch/$(hdr-arch)/include/uapi $(hdr-dst) HDRCHECK=1 + $(Q)$(MAKE) $(hdr-inst)=arch/$(SRCARCH)/include/uapi $(hdr-dst) HDRCHECK=1 # --- # Kernel selftest diff --git a/scripts/Makefile.headersinst b/scripts/Makefile.headersinst index 343d586..5692d7a 100644 --- a/scripts/Makefile.headersinst +++ b/scripts/Makefile.headersinst @@ -30,7 +30,7 @@ __headers: $(subdirs) $(subdirs): $(Q)$(MAKE) $(hdr-inst)=$(obj)/$@ dst=$(dst)/$@ -# Skip header install/check for include/uapi and arch/$(hdr-arch)/include/uapi. +# Skip header install/check for include/uapi and arch/$(SRCARCH)/include/uapi. # We have only sub-directories there. skip-inst := $(if $(filter %/uapi,$(obj)),1) -- 2.7.4
[PATCH v9 23/29] x86/cpufeature: Add User-Mode Instruction Prevention definitions
User-Mode Instruction Prevention is a security feature present in new Intel processors that, when set, prevents the execution of a subset of instructions if such instructions are executed in user mode (CPL > 0). Attempting to execute such instructions causes a general protection exception. The subset of instructions comprises: * SGDT - Store Global Descriptor Table * SIDT - Store Interrupt Descriptor Table * SLDT - Store Local Descriptor Table * SMSW - Store Machine Status Word * STR - Store Task Register This feature is also added to the list of disabled-features to allow a cleaner handling of build-time configuration. Cc: Andy Lutomirski Cc: Andrew Morton Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Cc: Tony Luck Cc: Paolo Bonzini Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h| 8 +++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 2519c6c..d1f18f2 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -292,6 +292,7 @@ /* Intel-defined CPU features, CPUID level 0x0007:0 (ecx), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/ +#define X86_FEATURE_UMIP (16*32+ 2) /* User Mode Instruction Protection */ #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for Userspace */ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index c10c912..14d6d50 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -16,6 +16,12 @@ # define DISABLE_MPX (1<<(X86_FEATURE_MPX & 31)) #endif +#ifdef CONFIG_X86_INTEL_UMIP +# define DISABLE_UMIP 0 +#else +# define DISABLE_UMIP (1<<(X86_FEATURE_UMIP & 31)) +#endif + #ifdef CONFIG_X86_64 # define DISABLE_VME (1<<(X86_FEATURE_VME & 31)) # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) @@ -63,7 +69,7 @@ #define DISABLED_MASK130 #define DISABLED_MASK140 #define DISABLED_MASK150 -#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57) +#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP) #define DISABLED_MASK170 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18) diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 39946d0..cf4c876 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -104,6 +104,8 @@ #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT) #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */ #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT) +#define X86_CR4_UMIP_BIT 11 /* enable UMIP support */ +#define X86_CR4_UMIP _BITUL(X86_CR4_UMIP_BIT) #define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */ #define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT) #define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */ -- 2.7.4
[RFC PATCH 4/4] kbuild: evaluate cc-option and friends only when building kernel
$(call cc-option,...) is costly because it invokes the C compiler. Given that only some targets need compiler flags, it is pointless to compute them all the time. The variable no-dot-config-targets lists the targets we can run without the .config file, such as "make clean", "make help", etc. My idea is similar here. We can add no-compiler-targets to list the targets we can run without the target compiler information. This includes no-dot-config-targets + config targets + misc. When we run only targets listed in the no-compiler-targets, we can set cc-option and friends to no-op. The hostcc-option is an exception since the host compiler is needed for building fixdep, kconfig, etc. I did not add "dtbs" and "%.dtb" to no-compiler-targets. This is intentional. Theoretically, we can build device tree blobs without the C compiler. It it true that in-kernel DT files are pre-processed by CPP before DTC, but KBUILD_CPPFLAGS is unused. However, for the reason of scripts/dtc/ location, Kbuild descends into scripts/mod/ before the DT build. I do not want to trigger the unrelated re-build of modpost when building DT. Perhaps, we can fix this with further refactoring, but not now. Signed-off-by: Masahiro Yamada --- Makefile | 10 ++ scripts/Kbuild.include | 14 +- 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/Makefile b/Makefile index a4fd682..80e1a38 100644 --- a/Makefile +++ b/Makefile @@ -224,9 +224,13 @@ no-dot-config-targets := clean mrproper distclean \ $(version_h) headers_% archheaders archscripts \ kernelversion %src-pkg +no-compiler-targets := $(no-dot-config-targets) config %config \ + kernelrelease image_name + config-targets := 0 mixed-targets := 0 dot-config := 1 +need-compiler := 1 ifneq ($(filter $(no-dot-config-targets), $(MAKECMDGOALS)),) ifeq ($(filter-out $(no-dot-config-targets), $(MAKECMDGOALS)),) @@ -234,6 +238,12 @@ ifneq ($(filter $(no-dot-config-targets), $(MAKECMDGOALS)),) endif endif +ifneq ($(filter $(no-compiler-targets), $(MAKECMDGOALS)),) + ifeq ($(filter-out $(no-compiler-targets), $(MAKECMDGOALS)),) + need-compiler := 0 + endif +endif + ifeq ($(KBUILD_EXTMOD),) ifneq ($(filter config %config,$(MAKECMDGOALS)),) config-targets := 1 diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index 9ffd3dd..222d0a2 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -96,6 +96,13 @@ try-run = $(shell set -e;\ fi; \ rm -f "$$TMP" "$$TMPO") +# hostcc-option +# Usage: cflags-y += $(call hostcc-option,-march=winchip-c6,-march=i586) +hostcc-option = $(call __cc-option, $(HOSTCC),\ + $(HOSTCFLAGS) $(HOST_EXTRACFLAGS),$(1),$(2)) + +ifeq ($(need-compiler),1) + # as-option # Usage: cflags-y += $(call as-option,-Wa$(comma)-isa=foo,) @@ -123,11 +130,6 @@ CC_OPTION_CFLAGS = $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS)) cc-option = $(call __cc-option, $(CC),\ $(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS),$(1),$(2)) -# hostcc-option -# Usage: cflags-y += $(call hostcc-option,-march=winchip-c6,-march=i586) -hostcc-option = $(call __cc-option, $(HOSTCC),\ - $(HOSTCFLAGS) $(HOST_EXTRACFLAGS),$(1),$(2)) - # cc-option-yn # Usage: flag := $(call cc-option-yn,-march=winchip-c6) cc-option-yn = $(call try-run,\ @@ -180,6 +182,8 @@ ld-version = $(shell $(LD) --version | $(srctree)/scripts/ld-version.sh) # Usage: $(call ld-ifversion, -ge, 22252, y) ld-ifversion = $(shell [ $(ld-version) $(1) $(2) ] && echo $(3) || echo $(4)) +endif + ## ### -- 2.7.4
[PATCH v9 24/29] x86: Add emulation code for UMIP instructions
The feature User-Mode Instruction Prevention present in recent Intel processor prevents a group of instructions (sgdt, sidt, sldt, smsw, and str) from being executed with CPL > 0. Otherwise, a general protection fault is issued. Rather than relaying to the user space the general protection fault caused by the UMIP-protected instructions (in the form of a SIGSEGV signal), it can be trapped and emulate the result of such instructions to provide dummy values. This allows to both conserve the current kernel behavior and not reveal the system resources that UMIP intends to protect (i.e., the locations of the global descriptor and interrupt descriptor tables, the segment selectors of the local descriptor table, the value of the task state register and the contents of the CR0 register). This emulation is needed because certain applications (e.g., WineHQ and DOSEMU2) rely on this subset of instructions to function. Given that sldt and str are not commonly used in programs that run on WineHQ or DOSEMU2, they are not emulated. Also, emulation is provided only for 32-bit processes; 64-bit processes that attempt to use the instructions that UMIP protects will receive the SIGSEGV signal issued as a consequence of the general protection fault. The instructions protected by UMIP can be split in two groups. Those which return a kernel memory address (sgdt and sidt) and those which return a value (sldt, str and smsw). For the instructions that return a kernel memory address, applications such as WineHQ rely on the result being located in the kernel memory space, not the actual location of the table. The result is emulated as a hard-coded value that lies close to the top of the kernel memory. The limit for the GDT and the IDT are set to zero. The instruction smsw is emulated to return the value that the register CR0 has at boot time as set in the head_32. Care is taken to appropriately emulate the results when segmentation is used. That is, rather than relying on USER_DS and USER_CS, the function insn_get_addr_ref() inspects the segment descriptor pointed by the registers in pt_regs. This ensures that we correctly obtain the segment base address and the address and operand sizes even if the user space application uses a local descriptor table. Cc: Andy Lutomirski Cc: Andrew Morton Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Cc: Tony Luck Cc: Paolo Bonzini Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/include/asm/umip.h | 12 ++ arch/x86/kernel/Makefile| 1 + arch/x86/kernel/umip.c | 309 3 files changed, 322 insertions(+) create mode 100644 arch/x86/include/asm/umip.h create mode 100644 arch/x86/kernel/umip.c diff --git a/arch/x86/include/asm/umip.h b/arch/x86/include/asm/umip.h new file mode 100644 index 000..db43f2a --- /dev/null +++ b/arch/x86/include/asm/umip.h @@ -0,0 +1,12 @@ +#ifndef _ASM_X86_UMIP_H +#define _ASM_X86_UMIP_H + +#include +#include + +#ifdef CONFIG_X86_INTEL_UMIP +bool fixup_umip_exception(struct pt_regs *regs); +#else +static inline bool fixup_umip_exception(struct pt_regs *regs) { return false; } +#endif /* CONFIG_X86_INTEL_UMIP */ +#endif /* _ASM_X86_UMIP_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index d8e2b70..bafeee1 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -125,6 +125,7 @@ obj-$(CONFIG_EFI) += sysfb_efi.o obj-$(CONFIG_PERF_EVENTS) += perf_regs.o obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_SCHED_MC_PRIO)+= itmt.o +obj-$(CONFIG_X86_INTEL_UMIP) += umip.o obj-$(CONFIG_ORC_UNWINDER) += unwind_orc.o obj-$(CONFIG_FRAME_POINTER_UNWINDER) += unwind_frame.o diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c new file mode 100644 index 000..1f338cb --- /dev/null +++ b/arch/x86/kernel/umip.c @@ -0,0 +1,309 @@ +/* + * umip.c Emulation for instruction protected by the Intel User-Mode + * Instruction Prevention feature + * + * Copyright (c) 2017, Intel Corporation. + * Ricardo Neri + */ + +#include +#include +#include +#include +#include +#include + +/** DOC: Emulation for User-Mode Instruction Prevention (UMIP) + * + * The feature User-Mode Instruction Prevention present in recent Intel + * processor prevents a group of instructions (sgdt, sidt, sldt, smsw, and str) + * from being executed with CPL > 0. Otherwise, a general protection fault is + * issued. + * + * Rather than relaying to the user space the general protection fault caused by + * the UMIP-protected instructions (in the form of a SIGSEGV signal), it can be + * trapped and emulate the result of such instructions to provide dummy
[PATCH 2/4] kbuild: move "_all" target out of $(KBUILD_SRC) conditional
The first "_all" occurrence around line 120 is only visible when KBUILD_SRC is unset. If O=... is specified, the working directory is relocated, then the only second occurrence around line 193 is visible, that is not set to PHONY. Move the first one to an always visible place. This clarifies "_all" is our default target and it is always set to PHONY. Signed-off-by: Masahiro Yamada --- Makefile | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index 8d900ba..39a7c03 100644 --- a/Makefile +++ b/Makefile @@ -10,6 +10,10 @@ NAME = Fearless Coyote # Comments in this file are targeted only to the developer, do not # expect to learn how to build the kernel reading this file. +# That's our default target when none is given on the command line +PHONY := _all +_all: + # o Do not use make's built-in rules and variables # (this increases performance and avoids hard-to-debug behaviour); # o Look for make include files relative to root of kernel src @@ -116,10 +120,6 @@ ifeq ("$(origin O)", "command line") KBUILD_OUTPUT := $(O) endif -# That's our default target when none is given on the command line -PHONY := _all -_all: - # Cancel implicit rules on top Makefile $(CURDIR)/Makefile Makefile: ; -- 2.7.4
[PATCH 3/4] kbuild: re-order the code to not parse unnecessary variables
The top Makefile is divided into some sections such as mixed targets, config targets, build targets, etc. When we build mixed targets, Kbuild just invokes submake to process them one by one. In this case, compiler-related variables like CC, KBUILD_CFLAGS, etc. are unneeded. Check what kind of targets we are building first, then parse necessary variables for building them. Signed-off-by: Masahiro Yamada --- Makefile | 233 --- 1 file changed, 118 insertions(+), 115 deletions(-) diff --git a/Makefile b/Makefile index 39a7c03..a4fd682 100644 --- a/Makefile +++ b/Makefile @@ -186,15 +186,6 @@ ifeq ("$(origin M)", "command line") KBUILD_EXTMOD := $(M) endif -# If building an external module we do not care about the all: rule -# but instead _all depend on modules -PHONY += all -ifeq ($(KBUILD_EXTMOD),) -_all: all -else -_all: modules -endif - ifeq ($(KBUILD_SRC),) # building in the source tree srctree := . @@ -206,6 +197,9 @@ else srctree := $(KBUILD_SRC) endif endif + +export KBUILD_CHECKSRC KBUILD_EXTMOD KBUILD_SRC + objtree:= . src:= $(srctree) obj:= $(objtree) @@ -214,6 +208,74 @@ VPATH := $(srctree)$(if $(KBUILD_EXTMOD),:$(KBUILD_EXTMOD)) export srctree objtree VPATH +# To make sure we do not include .config for any of the *config targets +# catch them early, and hand them over to scripts/kconfig/Makefile +# It is allowed to specify more targets when calling make, including +# mixing *config targets and build targets. +# For example 'make oldconfig all'. +# Detect when mixed targets is specified, and make a second invocation +# of make so .config is not included in this case either (for *config). + +version_h := include/generated/uapi/linux/version.h +old_version_h := include/linux/version.h + +no-dot-config-targets := clean mrproper distclean \ +cscope gtags TAGS tags help% %docs check% coccicheck \ +$(version_h) headers_% archheaders archscripts \ +kernelversion %src-pkg + +config-targets := 0 +mixed-targets := 0 +dot-config := 1 + +ifneq ($(filter $(no-dot-config-targets), $(MAKECMDGOALS)),) + ifeq ($(filter-out $(no-dot-config-targets), $(MAKECMDGOALS)),) + dot-config := 0 + endif +endif + +ifeq ($(KBUILD_EXTMOD),) +ifneq ($(filter config %config,$(MAKECMDGOALS)),) +config-targets := 1 +ifneq ($(words $(MAKECMDGOALS)),1) +mixed-targets := 1 +endif +endif +endif +# install and modules_install need also be processed one by one +ifneq ($(filter install,$(MAKECMDGOALS)),) +ifneq ($(filter modules_install,$(MAKECMDGOALS)),) + mixed-targets := 1 +endif +endif + +ifeq ($(mixed-targets),1) +# === +# We're called with mixed targets (*config and build targets). +# Handle them one by one. + +PHONY += $(MAKECMDGOALS) __build_one_by_one + +$(filter-out __build_one_by_one, $(MAKECMDGOALS)): __build_one_by_one + @: + +__build_one_by_one: + $(Q)set -e; \ + for i in $(MAKECMDGOALS); do \ + $(MAKE) -f $(srctree)/Makefile $$i; \ + done + +else + +# We need some generic definitions (do not try to remake the file). +scripts/Kbuild.include: ; +include scripts/Kbuild.include + +# Read KERNELRELEASE from include/config/kernel.release (if it exists) +KERNELRELEASE = $(shell cat include/config/kernel.release 2> /dev/null) +KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if $(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION) +export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION + # SUBARCH tells the usermode build what the underlying arch is. That is set # first, and if a usermode build is happening, the "ARCH=um" on the command # line overrides the setting of ARCH below. If a native build is happening, @@ -308,40 +370,6 @@ HOSTCFLAGS += -Wno-unused-value -Wno-unused-parameter \ -Wno-missing-field-initializers -fno-delete-null-pointer-checks endif -# Decide whether to build built-in, modular, or both. -# Normally, just do built-in. - -KBUILD_MODULES := -KBUILD_BUILTIN := 1 - -# If we have only "make modules", don't compile built-in objects. -# When we're building modules with modversions, we need to consider -# the built-in objects during the descend as well, in order to -# make sure the checksums are up to date before we record them. - -ifeq ($(MAKECMDGOALS),modules) - KBUILD_BUILTIN := $(if $(CONFIG_MODVERSIONS),1) -endif - -# If we have "make modules", compile modules -# in addition to whatever we do anyway. -# Just "make" or "make all" shall build modules as well - -ifneq ($(filter all _all modules,$(MAKECMDGOALS)),) - KBUILD_MODULES := 1 -endif - -ifeq ($(MAKECMDGOALS),) - KBUILD
[PATCH v9 25/29] x86/umip: Force a page fault when unable to copy emulated result to user
fixup_umip_exception() will be called from do_general_protection(). If the former returns false, the latter will issue a SIGSEGV with SEND_SIG_PRIV. However, when emulation is successful but the emulated result cannot be copied to user space memory, it is more accurate to issue a SIGSEGV with SEGV_MAPERR with the offending address. A new function, inspired in force_sig_info_fault(), is introduced to model the page fault. Cc: Andy Lutomirski Cc: Andrew Morton Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Yucong Cc: Chris Metcalf Cc: Dave Hansen Cc: Fenghua Yu Cc: Huang Rui Cc: Jiri Slaby Cc: Jonathan Corbet Cc: Michael S. Tsirkin Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Ravi V. Shankar Cc: Shuah Khan Cc: Vlastimil Babka Cc: Tony Luck Cc: Paolo Bonzini Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/kernel/umip.c | 45 +++-- 1 file changed, 43 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index 1f338cb..6d07eb5 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -187,6 +187,41 @@ static int emulate_umip_insn(struct insn *insn, int umip_inst, } /** + * force_sig_info_umip_fault() - Force a SIGSEGV with SEGV_MAPERR + * @addr: Address that caused the signal + * @regs: Register set containing the instruction pointer + * + * Force a SIGSEGV signal with SEGV_MAPERR as the error code. This function is + * intended to be used to provide a segmentation fault when the result of the + * UMIP emulation could not be copied to the user space memory. + * + * Returns: none + */ +static void force_sig_info_umip_fault(void __user *addr, struct pt_regs *regs) +{ + siginfo_t info; + struct task_struct *tsk = current; + + tsk->thread.cr2 = (unsigned long)addr; + tsk->thread.error_code = X86_PF_USER | X86_PF_WRITE; + tsk->thread.trap_nr = X86_TRAP_PF; + + info.si_signo = SIGSEGV; + info.si_errno = 0; + info.si_code= SEGV_MAPERR; + info.si_addr= addr; + force_sig_info(SIGSEGV, &info, tsk); + + if (!(show_unhandled_signals && unhandled_signal(tsk, SIGSEGV))) + return; + + pr_err_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx error:%x in %lx\n", + tsk->comm, task_pid_nr(tsk), regs->ip, + regs->sp, X86_PF_USER | X86_PF_WRITE, + regs->ip); +} + +/** * fixup_umip_exception() - Fixup #GP faults caused by UMIP * @regs: Registers as saved when entering the #GP trap * @@ -299,8 +334,14 @@ bool fixup_umip_exception(struct pt_regs *regs) return false; nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size); - if (nr_copied > 0) - return false; + if (nr_copied > 0) { + /* +* If copy fails, send a signal and tell caller that +* fault was fixed up. +*/ + force_sig_info_umip_fault(uaddr, regs); + return true; + } } /* increase IP to let the program keep going */ -- 2.7.4
[PATCH v9 16/29] x86/insn-eval: Add function to get default params of code segment
Obtain the default values of the address and operand sizes as specified in the D and L bits of the the segment descriptor selected by the register CS. The function can be used for both protected and long modes. For virtual-8086 mode, the default address and operand sizes are always 2 bytes. The returned parameters are encoded in a signed 8-bit data type. Auxiliar macros are provided to encode and decode such values. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/include/asm/insn-eval.h | 5 arch/x86/lib/insn-eval.c | 64 2 files changed, 69 insertions(+) diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 25d6e44..a5886ecc 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -11,8 +11,13 @@ #include #include +#define INSN_CODE_SEG_ADDR_SZ(params) ((params >> 4) & 0xf) +#define INSN_CODE_SEG_OPND_SZ(params) (params & 0xf) +#define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4)) + void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx); +char insn_get_code_seg_defaults(struct pt_regs *regs); #endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 02c4498..cb2734a 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -624,6 +624,70 @@ static unsigned long get_seg_limit(struct pt_regs *regs, int seg_reg_idx) } /** + * insn_get_code_seg_defaults() - Obtain code segment default parameters + * @regs: Structure with register values as seen when entering kernel mode + * + * Obtain the default parameters of the code segment: address and operand sizes. + * The code segment is obtained from the selector contained in the CS register + * in regs. In protected mode, the default address is determined by inspecting + * the L and D bits of the segment descriptor. In virtual-8086 mode, the default + * is always two bytes for both address and operand sizes. + * + * Returns: + * + * A signed 8-bit value containing the default parameters on success. + * + * -EINVAL on error. + */ +char insn_get_code_seg_defaults(struct pt_regs *regs) +{ + struct desc_struct *desc; + short sel; + + if (v8086_mode(regs)) + /* Address and operand size are both 16-bit. */ + return INSN_CODE_SEG_PARAMS(2, 2); + + sel = get_segment_selector(regs, INAT_SEG_REG_CS); + if (sel < 0) + return -1L; + + desc = get_desc(sel); + if (!desc) + return -EINVAL; + + /* +* The most significant byte of the Type field of the segment descriptor +* determines whether a segment contains data or code. If this is a data +* segment, return error. +*/ + if (!(desc->type & BIT(3))) + return -EINVAL; + + switch ((desc->l << 1) | desc->d) { + case 0: /* +* Legacy mode. CS.L=0, CS.D=0. Address and operand size are +* both 16-bit. +*/ + return INSN_CODE_SEG_PARAMS(2, 2); + case 1: /* +* Legacy mode. CS.L=0, CS.D=1. Address and operand size are +* both 32-bit. +*/ + return INSN_CODE_SEG_PARAMS(4, 4); + case 2: /* +* IA-32e 64-bit mode. CS.L=1, CS.D=0. Address size is 64-bit; +* operand size is 32-bit. +*/ + return INSN_CODE_SEG_PARAMS(4, 8); + case 3: /* Invalid setting. CS.L=1, CS.D=1 */ + /* fall through */ + default: + return -EINVAL; + } +} + +/** * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte * @insn: Instruction containing the ModRM byte * @regs: Register values as seen when entering kernel mode -- 2.7.4
[PATCH v9 17/29] x86/insn-eval: Indicate a 32-bit displacement if ModRM.mod is 0 and ModRM.rm is 101b
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod is zero and ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means that none of the registers are used in the computation of the effective address. A return value of -EDOM indicates callers that they should not use the value of registers when computing the effective address for the instruction. In long mode, the effective address is given by the 32-bit displacement plus the location of the next instruction. In protected mode, only the displacement is used. The instruction decoder takes care of obtaining the displacement. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index cb2734a..dd84819 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -408,6 +408,14 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, switch (type) { case REG_TYPE_RM: regno = X86_MODRM_RM(insn->modrm.value); + + /* +* ModRM.mod == 0 and ModRM.rm == 5 means a 32-bit displacement +* follows the ModRM byte. +*/ + if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) + return -EDOM; + if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break; @@ -754,10 +762,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); - if (addr_offset < 0) + /* +* -EDOM means that we must ignore the address_offset. +* In such a case, in 64-bit mode the effective address +* relative to the RIP of the following instruction. +*/ + if (addr_offset == -EDOM) { + if (user_64bit_mode(regs)) + eff_addr = (long)regs->ip + insn->length; + else + eff_addr = 0; + } else if (addr_offset < 0) { goto out; - - eff_addr = regs_get_register(regs, addr_offset); + } else { + eff_addr = regs_get_register(regs, addr_offset); + } } eff_addr += insn->displacement.value; -- 2.7.4
[PATCH v9 21/29] x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode
It is possible to utilize 32-bit address encodings in virtual-8086 mode via an address override instruction prefix. However, the range of the effective address is still limited to [0x-0x]. In such a case, return error. Also, linear addresses in virtual-8086 mode are limited to 20 bits. Enforce such limit by truncating the most significant bytes of the computed linear address. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1d510a6..d43808c 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -862,12 +862,23 @@ static void __user *get_addr_ref_32(struct insn *insn, struct pt_regs *regs) goto out; /* +* Even though 32-bit address encodings are allowed in virtual-8086 +* mode, the address range is still limited to [0x-0x]. +*/ + if (v8086_mode(regs) && (eff_addr & ~0x)) + goto out; + + /* * Data type long could be 64 bits in size. Ensure that our 32-bit * effective address is not sign-extended when computing the linear * address. */ linear_addr = (unsigned long)(eff_addr & 0x) + seg_base_addr; + /* Limit linear address to 20 bits */ + if (v8086_mode(regs)) + linear_addr &= 0xf; + out: return (void __user *)linear_addr; } -- 2.7.4
[PATCH v9 22/29] x86/insn-eval: Add support to resolve 16-bit addressing encodings
Tasks running in virtual-8086 mode, in protected mode with code segment descriptors that specify 16-bit default address sizes via the D bit, or via an address override prefix will use 16-bit addressing form encodings as described in the Intel 64 and IA-32 Architecture Software Developer's Manual Volume 2A Section 2.1.5, Table 2-1. 16-bit addressing encodings differ in several ways from the 32-bit/64-bit addressing form encodings: ModRM.rm points to different registers and, in some cases, effective addresses are indicated by the addition of the value of two registers. Also, there is no support for SIB bytes. Thus, a separate function is needed to parse this form of addressing. A couple of functions are introduced. get_reg_offset_16() obtains the offset from the base of pt_regs of the registers indicated by the ModRM byte of the address encoding. get_addr_ref_16() computes the linear address indicated by the instructions using the value of the registers given by ModRM and the base address of the applicable segment. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 182 +++ 1 file changed, 182 insertions(+) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index d43808c..2f859a1 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -462,6 +462,80 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, } /** + * get_reg_offset_16() - Obtain offset of register indicated by instruction + * @insn: Instruction containing ModRM byte + * @regs: Register values as seen when entering kernel mode + * @offs1: Offset of the first operand register + * @offs2: Offset of the second opeand register, if applicable + * + * Obtain the offset, in pt_regs, of the registers indicated by the ModRM byte + * within insn. This function is to be used with 16-bit address encodings. The + * offs1 and offs2 will be written with the offset of the two registers + * indicated by the instruction. In cases where any of the registers is not + * referenced by the instruction, the value will be set to -EDOM. + * + * Returns: + * + * 0 on success, -EINVAL on error. + */ +static int get_reg_offset_16(struct insn *insn, struct pt_regs *regs, +int *offs1, int *offs2) +{ + /* +* 16-bit addressing can use one or two registers. Specifics of +* encodings are given in Table 2-1. "16-Bit Addressing Forms with the +* ModR/M Byte" of the Intel Software Development Manual. +*/ + static const int regoff1[] = { + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, bx), + }; + + static const int regoff2[] = { + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + -EDOM, + -EDOM, + -EDOM, + -EDOM, + }; + + if (!offs1 || !offs2) + return -EINVAL; + + /* Operand is a register, use the generic function. */ + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + *offs1 = insn_get_modrm_rm_off(insn, regs); + *offs2 = -EDOM; + return 0; + } + + *offs1 = regoff1[X86_MODRM_RM(insn->modrm.value)]; + *offs2 = regoff2[X86_MODRM_RM(insn->modrm.value)]; + + /* +* If ModRM.mod is 0 and ModRM.rm is 110b, then we use displacement- +* only addressing. This means that no registers are involved in +* computing the effective address. Thus, ensure that the first +* register offset is invalild. The second register offset is already +* invalid under the aforementioned conditions. +*/ + if ((X86_MODRM_MOD(insn->modrm.value) == 0) && + (X86_MODRM_RM(insn->modrm.value) == 6)) + *offs1 = -EDOM; + + return 0; +} + +/** * get_desc() - Obtain address of segment descriptor * @sel: Segment selector * @@ -713,6 +787,112 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) } /** + * get_addr_ref_16() - Obtain the 16-bit address referred by instruction + * @insn: Instruction containing ModRM byte and displacement + * @regs: Register values as seen when entering kernel mode + * + * Thi
[PATCH v9 18/29] x86/insn-eval: Incorporate segment base in linear address computation
insn_get_addr_ref() returns the effective address as defined by the section 3.7.5.1 Vol 1 of the Intel 64 and IA-32 Architectures Software Developer's Manual. In order to compute the linear address, we must add to the effective address the segment base address as set in the segment descriptor. The segment descriptor to use depends on the register used as operand and segment override prefixes, if any. In most cases, the segment base address will be 0 if the USER_DS/USER32_DS segment is used or if segmentation is not used. However, the base address is not necessarily zero if a user programs defines its own segments. This is possible by using a local descriptor table. Since the effective address is a signed quantity, the unsigned segment base address is saved in a separate variable and added to the final, unsigned, effective address. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index dd84819..b3aa891 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -719,8 +719,8 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) */ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) { - int addr_offset, base_offset, indx_offset; - unsigned long linear_addr = -1L; + int addr_offset, base_offset, indx_offset, seg_reg_indx; + unsigned long linear_addr = -1L, seg_base_addr; long eff_addr, base, indx; insn_byte_t sib; @@ -734,6 +734,14 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) goto out; eff_addr = regs_get_register(regs, addr_offset); + + seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset); + if (seg_reg_indx < 0) + goto out; + + seg_base_addr = insn_get_seg_base(regs, seg_reg_indx); + if (seg_base_addr == -1L) + goto out; } else { if (insn->sib.nbytes) { /* @@ -760,6 +768,14 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) indx = regs_get_register(regs, indx_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); + + seg_reg_indx = resolve_seg_reg(insn, regs, base_offset); + if (seg_reg_indx < 0) + goto out; + + seg_base_addr = insn_get_seg_base(regs, seg_reg_indx); + if (seg_base_addr == -1L) + goto out; } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); /* @@ -777,12 +793,20 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) } else { eff_addr = regs_get_register(regs, addr_offset); } + + seg_reg_indx = resolve_seg_reg(insn, regs, addr_offset); + if (seg_reg_indx < 0) + goto out; + + seg_base_addr = insn_get_seg_base(regs, seg_reg_indx); + if (seg_base_addr == -1L) + goto out; } eff_addr += insn->displacement.value; } - linear_addr = (unsigned long)eff_addr; + linear_addr = (unsigned long)eff_addr + seg_base_addr; out: return (void __user *)linear_addr; -- 2.7.4
[PATCH v9 04/29] uprobes/x86: Use existing definitions for segment override prefixes
Rather than using hard-coded values of the segment override prefixes, leverage the existing definitions provided in inat.h. Suggested-by: Borislav Petkov Cc: Andy Lutomirski Cc: Andrew Morton Cc: Borislav Petkov Cc: Masami Hiramatsu Cc: Denys Vlasenko Cc: Srikar Dronamraju Cc: Ravi V. Shankar Signed-off-by: Ricardo Neri --- arch/x86/kernel/uprobes.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 495c776..a3755d2 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -271,12 +271,15 @@ static bool is_prefix_bad(struct insn *insn) int i; for (i = 0; i < insn->prefixes.nbytes; i++) { - switch (insn->prefixes.bytes[i]) { - case 0x26: /* INAT_PFX_ES */ - case 0x2E: /* INAT_PFX_CS */ - case 0x36: /* INAT_PFX_DS */ - case 0x3E: /* INAT_PFX_SS */ - case 0xF0: /* INAT_PFX_LOCK */ + insn_attr_t attr; + + attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]); + switch (attr) { + case INAT_MAKE_PREFIX(INAT_PFX_ES): + case INAT_MAKE_PREFIX(INAT_PFX_CS): + case INAT_MAKE_PREFIX(INAT_PFX_DS): + case INAT_MAKE_PREFIX(INAT_PFX_SS): + case INAT_MAKE_PREFIX(INAT_PFX_LOCK): return true; } } -- 2.7.4
[PATCH v9 13/29] x86/insn-eval: Add utility functions to get segment selector
When computing a linear address and segmentation is used, we need to know the base address of the segment involved in the computation. In most of the cases, the segment base address will be zero as in USER_DS/USER32_DS. However, it may be possible that a user space program defines its own segments via a local descriptor table. In such a case, the segment base address may not be zero. Thus, the segment base address is needed to calculate correctly the linear address. If running in protected mode, the segment selector to be used when computing a linear address is determined by either any of segment override prefixes in the instruction or inferred from the registers involved in the computation of the effective address; in that order. Also, there are cases when the segment override prefixes shall be ignored (i.e., code segments are always selected by the CS segment register; string instructions always use the ES segment register when using rDI register as operand). In long mode, segment registers are ignored, except for FS and GS. In these two cases, base addresses are obtained from the respective MSRs. For clarity, this process can be split into four steps (and an equal number of functions): determine if segment prefixes overrides can be used; parse the segment override prefixes, and use them if found; if not found or cannot be used, use the default segment registers associated with the operand registers. Once the segment register to use has been identified, read its value to obtain the segment selector. The method to obtain the segment selector depends on several factors. In 32-bit builds, segment selectors are saved into a pt_regs structure when switching to kernel mode. The same is also true for virtual-8086 mode. In 64-bit builds, segmentation is mostly ignored, except when running a program in 32-bit legacy mode. In this case, CS and SS can be obtained from pt_regs. DS, ES, FS and GS can be read directly from the respective segment registers. In order to identify the segment registers, a new set of #defines is introduced. It also includes two special identifiers. One of them indicates when the default segment register associated with instruction operands shall be used. Another one indicates that the contents of the segment register shall be ignored; this identifier is used when in long mode. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/include/asm/inat.h | 10 ++ arch/x86/lib/insn-eval.c| 321 2 files changed, 331 insertions(+) diff --git a/arch/x86/include/asm/inat.h b/arch/x86/include/asm/inat.h index 02aff08..1c78580 100644 --- a/arch/x86/include/asm/inat.h +++ b/arch/x86/include/asm/inat.h @@ -97,6 +97,16 @@ #define INAT_MAKE_GROUP(grp) ((grp << INAT_GRP_OFFS) | INAT_MODRM) #define INAT_MAKE_IMM(imm) (imm << INAT_IMM_OFFS) +/* Identifiers for segment registers */ +#define INAT_SEG_REG_IGNORE0 +#define INAT_SEG_REG_DEFAULT 1 +#define INAT_SEG_REG_CS2 +#define INAT_SEG_REG_SS3 +#define INAT_SEG_REG_DS4 +#define INAT_SEG_REG_ES5 +#define INAT_SEG_REG_FS6 +#define INAT_SEG_REG_GS7 + /* Attribute search APIs */ extern insn_attr_t inat_get_opcode_attribute(insn_byte_t opcode); extern int inat_get_last_prefix_id(insn_byte_t last_pfx); diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index ac7b87c..77b48f9 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -9,6 +9,7 @@ #include #include #include +#include #undef pr_fmt #define pr_fmt(fmt) "insn: " fmt @@ -47,6 +48,326 @@ static bool is_string_insn(struct insn *insn) } } +/** + * get_overridden_seg_reg_idx() - obtain segment register override index + * @insn: Instruction with segment override prefixes + * + * Inspect the instruction prefixes and find segment overrides, if any. + * + * Returns: + * + * A constant identifying the segment register to use, among CS, SS, DS, + * ES, FS, or GS. INAT_SEG_REG_DEFAULT is returned if no segment override + * prefixes were found. + * + * -EINVAL in case of error. + */ +static int get_overridden_seg_reg_idx(struct insn *insn) +{ + int idx = INAT_SEG_REG_DEFAULT; + int sel_overrides = 0, i; + + if (!insn) + return -EINVAL; + + insn_get_prefixes(insn); + + /* Look for any segment override prefixes. */ + for (i = 0; i < insn->prefixes.nbytes; i++) { + insn_attr_t attr; + + attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]); + switch (attr) { + case INAT_MAKE_PREFIX(INAT_PFX_CS): +
[PATCH v9 14/29] x86/insn-eval: Add utility function to get segment descriptor
The segment descriptor contains information that is relevant to how linear addresses need to be computed. It contains the default size of addresses as well as the base address of the segment. Thus, given a segment selector, we ought to look at segment descriptor to correctly calculate the linear address. In protected mode, the segment selector might indicate a segment descriptor from either the global descriptor table or a local descriptor table. Both cases are considered in this function. This function is a prerequisite for functions in subsequent commits that will obtain the aforementioned attributes of the segment descriptor. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 57 1 file changed, 57 insertions(+) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 77b48f9..d599dc3 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -6,9 +6,13 @@ #include #include #include +#include +#include +#include #include #include #include +#include #include #undef pr_fmt @@ -450,6 +454,59 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, } /** + * get_desc() - Obtain address of segment descriptor + * @sel: Segment selector + * + * Given a segment selector, obtain a pointer to the segment descriptor. + * Both global and local descriptor tables are supported. + * + * Returns: + * + * Pointer to segment descriptor on success. + * + * NULL on error. + */ +static struct desc_struct *get_desc(unsigned short sel) +{ + struct desc_ptr gdt_desc = {0, 0}; + unsigned long desc_base; + +#ifdef CONFIG_MODIFY_LDT_SYSCALL + struct desc_struct *desc = NULL; + struct ldt_struct *ldt; + + if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT) { + /* Bits [15:3] contain the index of the desired entry. */ + sel >>= 3; + + mutex_lock(¤t->active_mm->context.lock); + ldt = current->active_mm->context.ldt; + if (ldt && sel < ldt->nr_entries) + desc = &ldt->entries[sel]; + + mutex_unlock(¤t->active_mm->context.lock); + + return desc; + } +#endif + native_store_gdt(&gdt_desc); + + /* +* Segment descriptors have a size of 8 bytes. Thus, the index is +* multiplied by 8 to obtain the memory offset of the desired descriptor +* from the base of the GDT. As bits [15:3] of the segment selector +* contain the index, it can be regarded as multiplied by 8 already. +* All that remains is to clear bits [2:0]. +*/ + desc_base = sel & ~(SEGMENT_RPL_MASK | SEGMENT_TI_MASK); + + if (desc_base > gdt_desc.size) + return NULL; + + return (struct desc_struct *)(gdt_desc.address + desc_base); +} + +/** * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte * @insn: Instruction containing the ModRM byte * @regs: Register values as seen when entering kernel mode -- 2.7.4
[PATCH v9 12/29] x86/insn-eval: Add utility function to identify string instructions
String instructions are special because, in protected mode, the linear address is always obtained via the ES segment register in operands that use the (E)DI register; the DS segment register in operands that use the (E)SI register. Furthermore, segment override prefixes are ignored when calculating a linear address involving the (E)DI register; segment override prefixes can be used when calculating linear addresses involving the (E)SI register. It follows that linear addresses are calculated differently for the case of string instructions. The purpose of this utility function is to identify such instructions for callers to determine a linear address correctly. Note that this function only identifies string instructions; it does not determine what segment register to use in the address computation. That is left to callers. A subsequent commmit introduces a function to determine the segment register to use given the instruction, operands and segment override prefixes. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 28 1 file changed, 28 insertions(+) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 405ffeb..ac7b87c 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -19,6 +19,34 @@ enum reg_type { REG_TYPE_BASE, }; +/** + * is_string_insn() - Determine if instruction is a string instruction + * @insn: Instruction containing the opcode to inspect + * + * Returns: + * + * true if the instruction, determined by the opcode, is any of the + * string instructions as defined in the Intel Software Development manual. + * False otherwise. + */ +static bool is_string_insn(struct insn *insn) +{ + insn_get_opcode(insn); + + /* All string instructions have a 1-byte opcode. */ + if (insn->opcode.nbytes != 1) + return false; + + switch (insn->opcode.bytes[0]) { + case 0x6c ... 0x6f: /* INS, OUTS */ + case 0xa4 ... 0xa7: /* MOVS, CMPS */ + case 0xaa ... 0xaf: /* STOS, LODS, SCAS */ + return true; + default: + return false; + } +} + static int get_reg_offset(struct insn *insn, struct pt_regs *regs, enum reg_type type) { -- 2.7.4
[PATCH v9 15/29] x86/insn-eval: Add utility functions to get segment descriptor base address and limit
With segmentation, the base address of the segment is needed to compute a linear address. This base address is obtained from the applicable segment descriptor. Such segment descriptor is referenced from a segment selector. These new functions obtain the segment base and limit of the segment selector indicated by segment register index given as argument. This index is any of the INAT_SEG_REG_* family of #define's. The logic to obtain the segment selector is wrapped in the function get_seg_selector() with the inputs described above. Once the selector is known, the base address is determined. In protected mode, the selector is used to obtain the segment descriptor and then its base address. In 64-bit user mode, the segment base address is zero except when FS or GS are used. In virtual-8086 mode, the base address is computed as the value of the segment selector shifted 4 positions to the left. In protected mode, segment limits are enforced. Thus, a function to determine the limit of the segment is added. Segment limits are not enforced in long or virtual-8086. For the latter, addresses are limited to 20 bits; address size will be handled when computing the linear address. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/include/asm/insn-eval.h | 1 + arch/x86/lib/insn-eval.c | 117 +++ 2 files changed, 118 insertions(+) diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 7e8c963..25d6e44 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -13,5 +13,6 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); +unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx); #endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index d599dc3..02c4498 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -507,6 +507,123 @@ static struct desc_struct *get_desc(unsigned short sel) } /** + * insn_get_seg_base() - Obtain base address of segment descriptor. + * @regs: Register values as seen when entering kernel mode + * @seg_reg_idx: Index of the segment register pointing to seg descriptor + * + * Obtain the base address of the segment as indicated by the segment descriptor + * pointed by the segment selector. The segment selector is obtained from the + * input segment register index seg_reg_idx. + * + * Returns: + * + * In protected mode, base address of the segment. Zero in long mode, + * except when FS or GS are used. In virtual-8086 mode, the segment + * selector shifted 4 positions to the right. + * + * -1L in case of error. + */ +unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx) +{ + struct desc_struct *desc; + short sel; + + sel = get_segment_selector(regs, seg_reg_idx); + if (sel < 0) + return -1L; + + if (v8086_mode(regs)) + /* +* Base is simply the segment selector shifted 4 +* positions to the right. +*/ + return (unsigned long)(sel << 4); + + if (user_64bit_mode(regs)) { + /* +* Only FS or GS will have a base address, the rest of +* the segments' bases are forced to 0. +*/ + unsigned long base; + + if (seg_reg_idx == INAT_SEG_REG_FS) + rdmsrl(MSR_FS_BASE, base); + else if (seg_reg_idx == INAT_SEG_REG_GS) + /* +* swapgs was called at the kernel entry point. Thus, +* MSR_KERNEL_GS_BASE will have the user-space GS base. +*/ + rdmsrl(MSR_KERNEL_GS_BASE, base); + else if (seg_reg_idx != INAT_SEG_REG_IGNORE) + /* We should ignore the rest of segment registers. */ + base = -1L; + else + base = 0; + return base; + } + + /* In protected mode the segment selector cannot be null. */ + if (!sel) + return -1L; + + desc = get_desc(sel); + if (!desc) + return -1L; + + return get_desc_base(desc); +} + +/** + * get_seg_limit() - Obtain the limit of a segment descriptor + * @regs: Register values as seen when entering kernel mode + * @seg_reg_idx: Index of the segment register pointing to seg descriptor + * + * Obtain the limit of the segment as indicated by the
[PATCH v9 11/29] x86/insn-eval: Add a utility function to get register offsets
The function get_reg_offset() returns the offset to the register the argument specifies as indicated in an enumeration of type offset. Callers of this function would need the definition of such enumeration. This is not needed. Instead, add helper functions for this purpose. These functions are useful in cases when, for instance, the caller needs to decide whether the operand is a register or a memory location by looking at the rm part of the ModRM byte. As of now, this is the only helper function that is needed. Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/include/asm/insn-eval.h | 1 + arch/x86/lib/insn-eval.c | 17 + 2 files changed, 18 insertions(+) diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 5cab1b1..7e8c963 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -12,5 +12,6 @@ #include void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); #endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 4931d92..405ffeb 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -100,6 +100,23 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, return regoff[regno]; } +/** + * insn_get_modrm_rm_off() - Obtain register in r/m part of the ModRM byte + * @insn: Instruction containing the ModRM byte + * @regs: Register values as seen when entering kernel mode + * + * Returns: + * + * The register indicated by the r/m part of the ModRM byte. The + * register is obtained as an offset from the base of pt_regs. In specific + * cases, the returned value can be -EDOM to indicate that the particular value + * of ModRM does not refer to a register and shall be ignored. + */ +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) +{ + return get_reg_offset(insn, regs, REG_TYPE_RM); +} + /* * return the address being referenced be instruction * for rm=3 returning the content of the rm reg -- 2.7.4
[PATCH v9 07/29] x86/mpx: Do not use SIB.index if its value is 100b and ModRM.mod is not 11b
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod !=11b and ModRM.rm = 100b indexed register-indirect addressing is used. In other words, a SIB byte follows the ModRM byte. In the specific case of SIB.index = 100b, the scale*index portion of the computation of the effective address is null. To signal callers of this particular situation, get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an error when decoding the SIB byte). An example of this situation can be the following instruction: 8b 4c 23 80 mov -0x80(%rbx,%riz,1),%rcx ModRM:0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][index:100b][base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b) The %riz 'register' indicates a null index. In long mode, a REX prefix may be used. When a REX prefix is present, REX.X adds a fourth bit to the register selection of SIB.index. This gives the ability to refer to all the 16 general purpose registers. When REX.X is 1b and SIB.index is 100b, the index is indicated in %r12. In our example, this would look like: 42 8b 4c 23 80mov -0x80(%rbx,%r12,1),%rcx REX: 0x42 [W:0b][R:0b][X:1b][B:0b] ModRM:0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b) %r12 is a valid register to use in the scale*index part of the effective address computation. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Peter Zijlstra Cc: Nathan Howard Cc: Adan Hawthorn Cc: Joe Perches Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/mm/mpx.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 57e5bf5..2ad1d4a 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -110,6 +110,15 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, regno = X86_SIB_INDEX(insn->sib.value); if (X86_REX_X(insn->rex_prefix.value)) regno += 8; + + /* +* If ModRM.mod != 3 and SIB.index = 4 the scale*index +* portion of the address computation is null. This is +* true only if REX.X is 0. In such a case, the SIB index +* is used in the address computation. +*/ + if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4) + return -EDOM; break; case REG_TYPE_BASE: @@ -160,11 +169,19 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) goto out; indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - if (indx_offset < 0) + /* +* A negative offset generally means a error, except +* -EDOM, which means that the contents of the register +* should not be used as index. +*/ + if (indx_offset == -EDOM) + indx = 0; + else if (indx_offset < 0) goto out; + else + indx = regs_get_register(regs, indx_offset); base = regs_get_register(regs, base_offset); - indx = regs_get_register(regs, indx_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { -- 2.7.4
[PATCH v9 06/29] x86/mpx: Use signed variables to compute effective addresses
Even though memory addresses are unsigned, the operands used to compute the effective address do have a sign. This is true for ModRM.rm, SIB.base, SIB.index as well as the displacement bytes. Thus, signed variables shall be used when computing the effective address from these operands. Once the signed effective address has been computed, it is casted to an unsigned long to determine the linear address. Variables are renamed to better reflect the type of address being computed. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Peter Zijlstra Cc: Nathan Howard Cc: Adan Hawthorn Cc: Joe Perches Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/mm/mpx.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index f4c48a0..57e5bf5 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -138,8 +138,9 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, */ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) { - unsigned long addr = -1L, base, indx; int addr_offset, base_offset, indx_offset; + unsigned long linear_addr = -1L; + long eff_addr, base, indx; insn_byte_t sib; insn_get_modrm(insn); @@ -150,7 +151,8 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out; - addr = regs_get_register(regs, addr_offset); + + eff_addr = regs_get_register(regs, addr_offset); } else { if (insn->sib.nbytes) { base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); @@ -163,17 +165,23 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) base = regs_get_register(regs, base_offset); indx = regs_get_register(regs, indx_offset); - addr = base + indx * (1 << X86_SIB_SCALE(sib)); + + eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out; - addr = regs_get_register(regs, addr_offset); + + eff_addr = regs_get_register(regs, addr_offset); } - addr += insn->displacement.value; + + eff_addr += insn->displacement.value; } + + linear_addr = (unsigned long)eff_addr; + out: - return (void __user *)addr; + return (void __user *)linear_addr; } static int mpx_insn_decode(struct insn *insn, -- 2.7.4
[PATCH v9 08/29] x86/mpx: Do not use SIB.base if its value is 101b and ModRM.mod = 0
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that if a SIB byte is used and SIB.base is 101b and ModRM.mod is zero, then the base part of the base part of the effective address computation is null. To signal this situation, a -EDOM error is returned to indicate callers to ignore the base value present in the register operand. In this scenario, a 32-bit displacement follows the SIB byte. Displacement is obtained when the instruction decoder parses the operands. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Peter Zijlstra Cc: Nathan Howard Cc: Adan Hawthorn Cc: Joe Perches Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/mm/mpx.c | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 2ad1d4a..581a960 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -123,6 +123,14 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, case REG_TYPE_BASE: regno = X86_SIB_BASE(insn->sib.value); + /* +* If ModRM.mod is 0 and SIB.base == 5, the base of the +* register-indirect addressing is 0. In this case, a +* 32-bit displacement follows the SIB byte. +*/ + if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) + return -EDOM; + if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break; @@ -164,16 +172,22 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = regs_get_register(regs, addr_offset); } else { if (insn->sib.nbytes) { + /* +* Negative values in the base and index offset means +* an error when decoding the SIB byte. Except -EDOM, +* which means that the registers should not be used +* in the address computation. +*/ base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); - if (base_offset < 0) + if (base_offset == -EDOM) + base = 0; + else if (base_offset < 0) goto out; + else + base = regs_get_register(regs, base_offset); indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - /* -* A negative offset generally means a error, except -* -EDOM, which means that the contents of the register -* should not be used as index. -*/ + if (indx_offset == -EDOM) indx = 0; else if (indx_offset < 0) @@ -181,8 +195,6 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) else indx = regs_get_register(regs, indx_offset); - base = regs_get_register(regs, base_offset); - eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); -- 2.7.4
[PATCH v9 09/29] x86/mpx, x86/insn: Relocate insn util functions to a new insn-eval file
Other kernel submodules can benefit from using the utility functions defined in mpx.c to obtain the addresses and values of operands contained in the general purpose registers. An instance of this is the emulation code used for instructions protected by the Intel User-Mode Instruction Prevention feature. Thus, these functions are relocated to a new insn-eval.c file. The reason to not relocate these utilities into insn.c is that the latter solely analyses instructions given by a struct insn without any knowledge of the meaning of the values of instruction operands. This new utility insn- eval.c aims to be used to resolve userspace linear addresses based on the contents of the instruction operands as well as the contents of pt_regs structure. These utilities come with a separate header. This is to avoid taking insn.c out of sync from the instructions decoders under tools/obj and tools/perf. This also avoids adding cumbersome #ifdef's for the #include'd files required to decode instructions in a kernel context. Functions are simply relocated. There are not functional or indentation changes. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- The checkpatch script issues the following warning with this commit: WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() + BUG(); This warning will be fixed in a subsequent patch. --- arch/x86/include/asm/insn-eval.h | 16 arch/x86/lib/Makefile| 2 +- arch/x86/lib/insn-eval.c | 163 +++ arch/x86/mm/mpx.c| 156 + 4 files changed, 182 insertions(+), 155 deletions(-) create mode 100644 arch/x86/include/asm/insn-eval.h create mode 100644 arch/x86/lib/insn-eval.c diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h new file mode 100644 index 000..5cab1b1 --- /dev/null +++ b/arch/x86/include/asm/insn-eval.h @@ -0,0 +1,16 @@ +#ifndef _ASM_X86_INSN_EVAL_H +#define _ASM_X86_INSN_EVAL_H +/* + * A collection of utility functions for x86 instruction analysis to be + * used in a kernel context. Useful when, for instance, making sense + * of the registers indicated by operands. + */ + +#include +#include +#include +#include + +void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); + +#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 34a7413..675d7b0 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -23,7 +23,7 @@ lib-y := delay.o misc.o cmdline.o cpu.o lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o lib-y += memcpy_$(BITS).o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o -lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o +lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c new file mode 100644 index 000..df9418c --- /dev/null +++ b/arch/x86/lib/insn-eval.c @@ -0,0 +1,163 @@ +/* + * Utility functions for x86 operand and address decoding + * + * Copyright (C) Intel Corporation 2017 + */ +#include +#include +#include +#include +#include + +enum reg_type { + REG_TYPE_RM = 0, + REG_TYPE_INDEX, + REG_TYPE_BASE, +}; + +static int get_reg_offset(struct insn *insn, struct pt_regs *regs, + enum reg_type type) +{ + int regno = 0; + + static const int regoff[] = { + offsetof(struct pt_regs, ax), + offsetof(struct pt_regs, cx), + offsetof(struct pt_regs, dx), + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, sp), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), +#ifdef CONFIG_X86_64 + offsetof(struct pt_regs, r8), + offsetof(struct pt_regs, r9), + offsetof(struct pt_regs, r10), + offsetof(struct pt_regs, r11), + offsetof(struct pt_regs, r12), + offsetof(struct pt_regs, r13), + offsetof(struct pt_regs, r14), + offsetof(struct pt_regs, r15), +#endif + }; + int nr_registers = ARRAY_SIZE(regoff); + /* +* Don't possibly decode a 32-bit instructions as +* reading a 64-bit-only register. +*/ + if (IS_ENABLED(CONFIG_X86_64) && !insn->x86_64) + nr_registers -= 8; + + switch (type) { +
[PATCH v9 10/29] x86/insn-eval: Do not BUG on invalid register type
We are not in a critical failure path. The invalid register type is caused when trying to decode invalid instruction bytes from a user-space program. Thus, simply print an error message. To prevent this warning from being abused from user space programs, use the rate-limited variant of pr_err(). along with a descriptive prefix. Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Dave Hansen Cc: Adam Buchbinder Cc: Colin Ian King Cc: Lorenzo Stoakes Cc: Qiaowei Ren Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Adrian Hunter Cc: Kees Cook Cc: Thomas Garnier Cc: Peter Zijlstra Cc: Dmitry Vyukov Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- arch/x86/lib/insn-eval.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index df9418c..4931d92 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -5,10 +5,14 @@ */ #include #include +#include #include #include #include +#undef pr_fmt +#define pr_fmt(fmt) "insn: " fmt + enum reg_type { REG_TYPE_RM = 0, REG_TYPE_INDEX, @@ -85,9 +89,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, break; default: - pr_err("invalid register type"); - BUG(); - break; + pr_err_ratelimited("invalid register type: %d\n", type); + return -EINVAL; } if (regno >= nr_registers) { -- 2.7.4
[PATCH v9 00/29] x86: Enable User-Mode Instruction Prevention
This is v9 of this series. The seven previous submissions can be found here [1], here [2], here[3], here[4], here[5], here[6], here[7] and here[8]. This version addresses the feedback comments from Borislav Petkov received on v7. Please see details in the change log. === What is UMIP? User-Mode Instruction Prevention (UMIP) is a security feature present in new Intel Processors. If enabled, it prevents the execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. If these instructions were executed while in CPL > 0, user space applications could have access to system-wide settings such as the global and local descriptor tables, the segment selectors to the current task state and the local descriptor table. Hiding these system resources reduces the tools available to craft privilege escalation attacks such as [9]. These are the instructions covered by UMIP: * SGDT - Store Global Descriptor Table * SIDT - Store Interrupt Descriptor Table * SLDT - Store Local Descriptor Table * SMSW - Store Machine Status Word * STR - Store Task Register If any of these instructions is executed with CPL > 0, a general protection exception is issued when UMIP is enabled. === How does it impact applications? When enabled, However, UMIP will change the behavior that certain applications expect from the operating system. For instance, programs running on WineHQ and DOSEMU2 rely on some of these instructions to function. Stas Sergeev found that Microsoft Windows 3.1 and dos4gw use the instruction SMSW when running in virtual-8086 mode[10]. SGDT and SIDT can also be used on virtual-8086 mode. In order to not change the behavior of the system. This patchset emulates SGDT, SIDT and SMSW. This should be sufficient to not break the applications mentioned above. Regarding the two remaining instructions, STR and SLDT, the WineHQ team has shown interest catching the general protection fault and use it as a vehicle to fix broken applications[11]. Furthermore, STR and SLDT can only run in protected and long modes. DOSEMU2 emulates virtual-8086 mode via KVM. No applications will be broken unless DOSEMU2 decides to enable the CR4.UMIP bit in platforms that support it. Also, this should not pose a security risk as no system resouces would be revealed. Instead, code running inside the KVM would only see the KVM's GDT, IDT and MSW. Please note that UMIP is always enabled for both 64-bit and 32-bit Linux builds. However, emulation of the UMIP-protected instructions is not done for 64-bit processes. 64-bit user space applications will receive the SIGSEGV signal when UMIP instructions causes a general protection fault. === How are UMIP-protected instructions emulated? UMIP is kept enabled at all times when the CONFIG_x86_INTEL_UMIP option is selected. If a general protection fault caused by the instructions protected by UMIP is detected, such fault will be trapped and fixed-up. The return values will be dummy as follows: * SGDT and SIDT return hard-coded dummy values as the base of the global descriptor and interrupt descriptor tables. These hard-coded values correspond to memory addresses that are near the end of the kernel memory map. This is also the case for virtual-8086 mode tasks. In all my experiments with 32-bit processes, the base of GDT and IDT was always a 4-byte address, even for 16-bit operands. Thus, my emulation code does the same. In all cases, the limit of the table is set to 0. * SMSW returns the value with which the CR0 register is programmed in head_32/64.S at boot time. This is, the following bits are enabled: CR0.0 for Protection Enable, CR.1 for Monitor Coprocessor, CR.4 for Extension Type, which will always be 1 in recent processors with UMIP; CR.5 for Numeric Error, CR0.16 for Write Protect, CR0.18 for Alignment Mask and CR0.31 for Paging. As per the Intel 64 and IA-32 Architectures Software Developer's Manual, SMSW returns a 16-bit results for memory operands. However, when the operand is a register, the results can be up to CR0[63:0]. Since the emulation code only kicks-in for 32-bit processes, we return up to CR[31:0]. * The proposed emulation code is handles faults that happens in both protected and virtual-8086 mode. * Again, STR and SLDT are not emulated. === How is this series laid out? ++ Preparatory work As per suggestions from Andy Lutormirsky and Borislav Petkov, I moved the x86 page fault error codes to a header. Also, I made user_64bit_mode available to x86_32 builds. This helps to reuse code and reduce the number of #ifdef's in these patches. Borislav also suggested to uprobes should use the existing definitions in arch/x86/include/asm/inat.h instead of hard- coded values when checking instruction prefixes. I included this change in the series. ++ Fix bugs in MPX address decoder I found very useful the code for Intel MPX (Memory Protection Extensions) used to parse opcodes and the memory locations contained in th
[PATCH v9 02/29] x86/boot: Relocate definition of the initial state of CR0
Both head_32.S and head_64.S utilize the same value to initialize the control register CR0. Also, other parts of the kernel might want to access this initial definition (e.g., emulation code for User-Mode Instruction Prevention uses this state to provide a sane dummy value for CR0 when emulating the smsw instruction). Thus, relocate this definition to a header file from which it can be conveniently accessed. Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-a...@vger.kernel.org Cc: linux...@kvack.org Suggested-by: Borislav Petkov Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/include/uapi/asm/processor-flags.h | 3 +++ arch/x86/kernel/head_32.S | 3 --- arch/x86/kernel/head_64.S | 3 --- 3 files changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 185f3d1..39946d0 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -151,5 +151,8 @@ #define CX86_ARR_BASE 0xc4 #define CX86_RCR_BASE 0xdc +#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ +X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \ +X86_CR0_PG) #endif /* _UAPI_ASM_X86_PROCESSOR_FLAGS_H */ diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S index 9ed3074..c3cfc65 100644 --- a/arch/x86/kernel/head_32.S +++ b/arch/x86/kernel/head_32.S @@ -211,9 +211,6 @@ ENTRY(startup_32_smp) #endif .Ldefault_entry: -#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ -X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \ -X86_CR0_PG) movl $(CR0_STATE & ~X86_CR0_PG),%eax movl %eax,%cr0 diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 42e32c2..205dabc 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -152,9 +152,6 @@ ENTRY(secondary_startup_64) 1: wrmsr /* Make changes effective */ /* Setup cr0 */ -#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \ -X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \ -X86_CR0_PG) movl$CR0_STATE, %eax /* Make changes effective */ movq%rax, %cr0 -- 2.7.4
[PATCH v9 01/29] x86/mm: Relocate page fault error codes to traps.h
Up to this point, only fault.c used the definitions of the page fault error codes. Thus, it made sense to keep them within such file. Other portions of code might be interested in those definitions too. For instance, the User- Mode Instruction Prevention emulation code will use such definitions to emulate a page fault when it is unable to successfully copy the results of the emulated instructions to user space. While relocating the error code enumeration, the prefix X86_ is used to make it consistent with the rest of the definitions in traps.h. Of course, code using the enumeration had to be updated as well. No functional changes were performed. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: "Kirill A. Shutemov" Cc: Josh Poimboeuf Cc: Dave Hansen Cc: Paul Gortmaker Cc: x...@kernel.org Reviewed-by: Andy Lutomirski Reviewed-by: Borislav Petkov Signed-off-by: Ricardo Neri --- arch/x86/include/asm/traps.h | 18 + arch/x86/mm/fault.c | 88 +--- 2 files changed, 52 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 5545f64..da3c3a3 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -144,4 +144,22 @@ enum { X86_TRAP_IRET = 32, /* 32, IRET Exception */ }; +/* + * Page fault error code bits: + * + * bit 0 == 0: no page found 1: protection fault + * bit 1 == 0: read access 1: write access + * bit 2 == 0: kernel-mode access 1: user-mode access + * bit 3 == 1: use of reserved bit detected + * bit 4 == 1: fault was an instruction fetch + * bit 5 == 1: protection keys block access + */ +enum x86_pf_error_code { + X86_PF_PROT = 1 << 0, + X86_PF_WRITE= 1 << 1, + X86_PF_USER = 1 << 2, + X86_PF_RSVD = 1 << 3, + X86_PF_INSTR= 1 << 4, + X86_PF_PK = 1 << 5, +}; #endif /* _ASM_X86_TRAPS_H */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e2baeaa..db71c73 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -29,26 +29,6 @@ #include /* - * Page fault error code bits: - * - * bit 0 == 0: no page found 1: protection fault - * bit 1 == 0: read access 1: write access - * bit 2 == 0: kernel-mode access 1: user-mode access - * bit 3 == 1: use of reserved bit detected - * bit 4 == 1: fault was an instruction fetch - * bit 5 == 1: protection keys block access - */ -enum x86_pf_error_code { - - PF_PROT = 1 << 0, - PF_WRITE= 1 << 1, - PF_USER = 1 << 2, - PF_RSVD = 1 << 3, - PF_INSTR= 1 << 4, - PF_PK = 1 << 5, -}; - -/* * Returns 0 if mmiotrace is disabled, or if the fault is not * handled by mmiotrace: */ @@ -149,7 +129,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr) * If it was a exec (instruction fetch) fault on NX page, then * do not ignore the fault: */ - if (error_code & PF_INSTR) + if (error_code & X86_PF_INSTR) return 0; instr = (void *)convert_ip_to_linear(current, regs); @@ -179,7 +159,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr) * siginfo so userspace can discover which protection key was set * on the PTE. * - * If we get here, we know that the hardware signaled a PF_PK + * If we get here, we know that the hardware signaled a X86_PF_PK * fault and that there was a VMA once we got in the fault * handler. It does *not* guarantee that the VMA we find here * was the one that we faulted on. @@ -204,7 +184,7 @@ static void fill_sig_info_pkey(int si_code, siginfo_t *info, u32 *pkey) /* * force_sig_info_fault() is called from a number of * contexts, some of which have a VMA and some of which -* do not. The PF_PK handing happens after we have a +* do not. The X86_PF_PK handing happens after we have a * valid VMA, so we should never reach this without a * valid VMA. */ @@ -697,7 +677,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, if (!oops_may_print()) return; - if (error_code & PF_INSTR) { + if (error_code & X86_PF_INSTR) { unsigned int level; pgd_t *pgd; pte_t *pte; @@ -779,7 +759,7 @@ no_context(struct pt_regs *regs, unsigned long error_code, */ if (current->thread.sig_on_uaccess_err && signal) {
Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
From: Santosh Shilimkar Date: Tue, 3 Oct 2017 17:42:39 -0700 > Hi Dave, > > On 10/2/2017 1:30 PM, Santosh Shilimkar wrote: >> On 10/1/2017 10:56 PM, David Miller wrote: >>> From: David Miller > [...] > >>> >>> Actually, reverted, this breaks the build. >>> >>> net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or >>> directory >>> #include "ib.h" >>> >>> Although I can't see how in the world this patch is causing such >>> an error. >>> >> Weird indeed. Will sort this out with Avinash. Thanks Dave. >> > I tried few build combinations on net-next but couldn't > reproduce the build failure. AFAIU, the build error can only > happen if for some reason the ib.h file got deleted > accidentally. I did delete ib.h file and saw below error > > CC [M] net/rds/rdma_transport.o > net/rds/rdma_transport.c:38:16: error: ib.h: No such file or directory > > Could it be the case for some reason the ib.h file got > deleted or mangled while applying the $subject patch ? Please post the patch again, I will try to sort it out.
Re: [RFC PATCH 1/2] kbuild: Add a cache for generated variables
Hi Douglas, 2017-09-27 2:55 GMT+09:00 Douglas Anderson : > While timing a "no-op" build of the kernel (incrementally building the > kernel even though nothing changed) in the Chrome OS build system I > found that it was much slower than I expected. > > Digging into things a bit, I found that quite a bit of the time was > spent invoking the C compiler even though we weren't actually building > anything. Currently in the Chrome OS build system the C compiler is > called through a number of wrappers (one of which is written in > python!) and can take upwards of 100 ms to invoke even if we're not > doing anything difficult, so these invocations of the compiler were > taking a lot of time. Worse the invocations couldn't seem to take > advantage of the multiple cores on my system. > > Certainly it seems like we could make the compiler invocations in the > Chrome OS build system faster, but only to a point. Inherently > invoking a program as big as a C compiler is a fairly heavy > operation. Thus even if we can speed the compiler calls it made sense > to track down what was happening. > > It turned out that all the compiler invocations were coming from > usages like this in the kernel's Makefile: > > KBUILD_CFLAGS += $(call cc-option,-fno-delete-null-pointer-checks,) > > Due to the way cc-option and similar statements work the above > contains an implicit call to the C compiler. ...and due to the fact > that we're storing the result in KBUILD_CFLAGS, a simply expanded > variable, the call will happen every time the Makefile is parsed, even > if there are no users of KBUILD_CFLAGS. > > Rather than redoing this computation every time, it makes a lot of > sense to cache the result of all of the Makefile's compiler calls just > like we do when we compile a ".c" file to a ".o" file. Conceptually > this is quite a simple idea. ...and since the calls to invoke the > compiler and similar tools are centrally located in the Kbuild.include > file this doesn't even need to be super invasive. > > Implementing the cache in a simple-to-use and efficient way is not > quite as simple as it first sounds, though. To get maximum speed we > really want the cache in a format that make can natively understand > and make doesn't really have an ability to load/parse files. ...but > make _can_ import other Makefiles, so the solution is to store the > cache in Makefile format. This requires coming up with a valid/unique > Makefile variable name for each value to be cached, but that's > solvable with some cleverness. > > After this change, we'll automatically create a "Makefile-cache.o" > file that will contain our cached variables. We'll load this on each > invocation of make and will avoid recomputing anything that's already > in our cache. The cache is stored in a format that it shouldn't need > any invalidation since anything that might change should affect the > "key" and any old cached value won't be used. The cache will be > cleaned by virtue of the ".o" suffix by a "make clean". > > Signed-off-by: Douglas Anderson Thanks for the patches. The idea is interesting. I am not a Chrome developer, but cc-option could be improved somehow. I examined two approaches to mitigate the pain. [1] Skip cc-option completely when we run non-build targets such as "make help", "make clean", etc. [2] Cache the result of cc-option like this patch I wrote some patches for [1] https://patchwork.kernel.org/patch/9983825/ https://patchwork.kernel.org/patch/9983829/ https://patchwork.kernel.org/patch/9983833/ https://patchwork.kernel.org/patch/9983827/ Comments are welcome. :) [1] does not solve the slowness in the incremental build. Instead, it makes non-build targets faster from a clean source tree because cc-option is zero cost. [2] solves the issues in the incremental build. One funny thing is, it computes cc-option and store the cache file even for "make clean", where we know the cache file will be immediately deleted. We can apply [1] or [2], or maybe both of them because [1] and [2] are trying to solve the different issues. The cache approach seemed clever, but I do not like the implementation. The code is even more unreadable with lots of escape sequence. Here is my suggestion for simpler implementation (including bike-shed) Implement a new function "shell-cache". It works like $(shell ...) The difference is $(call shell-cached,...) returns the cached result, or run the command if not cached. Also, add try-run-cached. This is a cached variant of try-run. I just played with it, and seems working. (I did not have spend much time for testing, more wider test is needed.) The code is like something like this: make-cache := $(if $(KBUILD_EXTMOD),$(KBUILD_EXTMOD)/,$(if $(obj),$(obj)/)).cache.mk -include $(make-cache) __sanitize-opt = $(subst $$,_,$(subst $(right_paren),_,$(subst $(left_paren),_,$(subst \,_,$(subst =,_,$(subst $(space),_,$(subst $(comma),_,$(1 define __run-and-store ifeq ($(
Re: [RFC PATCH 1/2] kbuild: Pass HOSTCC and similar to tools Makefiles
On Tue, Oct 3, 2017 at 8:20 PM, Masahiro Yamada wrote: > 2017-10-04 5:48 GMT+09:00 Douglas Anderson : >> The main Linux Makefiles and the tools sub-Makefiles have different >> conventions for passing in CC / CFLAGS. >> >> Here's brief summary for the kernel: >> * CC: target C compiler (must be passed as an argument to make to >> override) >> * HOSTCC: host C compiler (must be passed as an argument to make to >> override) >> * CFLAGS: ignored (kernel manages its own CFLAGS) >> * KCFLAGS: extra cflags for the target (expected as an env var) >> * HOSTCFLAGS: host C compiler flags (not modifiable by the caller of >> make) >> >> Here's a brief summary for the tools: >> * CC: host C compiler (must be passed as an argument to make to >> override) >> * CFLAGS: base arguments for the host C compiler which are added to by >> sub builds (expected as an env var) >> >> When the main kernel Makefile calls into the tools Makefile, it should >> adapt things from its syntax to the tools Makefile syntax. >> >> If we don't do this, we have a few issues: >> * If someone wants to user another compiler (like clang, maybe) for >> hostcc then the tools will still be compiled with gcc. >> * If you happen to be building and you left ${CFLAGS} set to something >> random (maybe it matches ${CC}, the _target_ C compiler, not the >> _host_ C compiler) then this random value will be used to compile >> the tools. >> >> In any case, it seems like it makes sense to pass CC, CFLAGS, and >> similar properly into the tools Makefile. >> >> NOTE: in order to do this properly, we need to add some new >> definitions of HOSTAS and HOSTLD into the main kernel Makefile. If we >> don't do this and someone overrides "AS" or "LD" on the command line >> then those (which were intended for the target) will be used to build >> host side tools. We also make up some imaginary "HOSTASFLAGS" and >> "HOSTLDFLAGS". Someone could specify these, but if they don't at >> least these blank variables will properly clobber ASFLAGS and LDFLAGS >> from the kernel. >> >> This was discovered in the Chrome OS build system where CFLAGS (an >> environment variable) was accidentally left pointing to flags that >> would be an appropriate match to CC. The kernel didn't use these >> CFLAGS so it was never an issue. Turning on STACKVALIDATION, however, >> suddenly invoked a tools build that blew up. >> >> Signed-off-by: Douglas Anderson >> --- >> >> Makefile | 30 -- >> 1 file changed, 28 insertions(+), 2 deletions(-) >> >> diff --git a/Makefile b/Makefile >> index cf007a31d575..0d3af0677d88 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -298,7 +298,9 @@ HOST_LFS_CFLAGS := $(shell getconf LFS_CFLAGS) >> HOST_LFS_LDFLAGS := $(shell getconf LFS_LDFLAGS) >> HOST_LFS_LIBS := $(shell getconf LFS_LIBS) >> >> +HOSTAS = as >> HOSTCC = gcc >> +HOSTLD = ld >> HOSTCXX = g++ >> HOSTCFLAGS := -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 \ >> -fomit-frame-pointer -std=gnu89 $(HOST_LFS_CFLAGS) >> @@ -1616,11 +1618,35 @@ image_name: >> # Clear a bunch of variables before executing the submake >> tools/: FORCE >> $(Q)mkdir -p $(objtree)/tools >> - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% >> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ >> + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ >> + LDFLAGS="$(HOSTLDFLAGS)" \ >> + CFLAGS="$(HOSTCFLAGS)" \ >> + CXXFLAGS="$(HOSTCXXFLAGS)" \ >> + $(MAKE) \ >> + AS="$(HOSTAS)" \ >> + CC="$(HOSTCC)" \ >> + CXX="$(HOSTCXX)" \ >> + LD="$(HOSTLD)" \ >> + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ >> + O=$(abspath $(objtree)) \ >> + subdir=tools \ >> + -C $(src)/tools/ >> >> tools/%: FORCE >> $(Q)mkdir -p $(objtree)/tools >> - $(Q)$(MAKE) LDFLAGS= MAKEFLAGS="$(tools_silent) $(filter --j% >> -j,$(MAKEFLAGS))" O=$(abspath $(objtree)) subdir=tools -C $(src)/tools/ $* >> + $(Q)ASFLAGS="$(HOSTASFLAGS)"\ >> + LDFLAGS="$(HOSTLDFLAGS)" \ >> + CFLAGS="$(HOSTCFLAGS)" \ >> + CXXFLAGS="$(HOSTCXXFLAGS)" \ >> + $(MAKE) \ >> + AS="$(HOSTAS)" \ >> + CC="$(HOSTCC)" \ >> + CXX="$(HOSTCXX)" \ >> + LD="$(HOSTLD)" \ >> + MAKEFLAGS="$(tools_silent) $(filter --j% -j,$(MAKEFLAGS))" \ >> + O=$(abspath $(objtree)) \ >> + subdir=tools \ >> + -C $(src)/tools/ $* >> > > Please do not do this. > > > > CC: for building tools that run on the target machine > HOSTCC: for building tools that run on the build machine > > > IMHO, this should be consistent to avoid confusion. > > > Grepping CROSS_COMPILE under tools/ directory, > I see many Makefile expect CC for the t
[PATCH] kbuild: fix dependency of dtbs targets
The target "dtbs" should depend on "scripts" because it needs to build dtc. The "prepare" target is unneeded here. Signed-off-by: Masahiro Yamada --- arch/arm/Makefile | 2 +- arch/arm64/Makefile | 2 +- arch/nios2/Makefile | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/arm/Makefile b/arch/arm/Makefile index 47d3a1a..cdb5b55 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -342,7 +342,7 @@ $(INSTALL_TARGETS): PHONY += dtbs dtbs_install -dtbs: prepare scripts +dtbs: scripts $(Q)$(MAKE) $(build)=$(boot)/dts dtbs_install: diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 939b310..fd56fbc 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -133,7 +133,7 @@ zinstall install: PHONY += dtbs dtbs_install -dtbs: prepare scripts +dtbs: scripts $(Q)$(MAKE) $(build)=$(boot)/dts dtbs_install: diff --git a/arch/nios2/Makefile b/arch/nios2/Makefile index 8673a79..abc3f5b 100644 --- a/arch/nios2/Makefile +++ b/arch/nios2/Makefile @@ -61,7 +61,7 @@ archclean: %.dtb: | scripts $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@ -dtbs: +dtbs: scripts $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@ $(BOOT_TARGETS): vmlinux -- 2.7.4
Re: [PATCH 3/8] ARM: dts: aspeed: Add I2C buses
On Thu, 2017-09-28 at 11:35 -0700, Brendan Higgins wrote: > > On Thu, Sep 28, 2017 at 12:51 AM, Joel Stanley wrote: > > Now with an upstream i2c bus driver, we can add the 14 i2c buses that > > exist in ASPEED G4 and G5 generation SoCs. > > > > It also adds aliases for the 14 built-in I2C busses to ensure userspace > > sees the numbering staring from zero and counting up. > > > > Signed-off-by: Joel Stanley > > --- > > > Reviewed-by: Brendan Higgins > > nit: can we make the i2c labels and the pinctrl labels match? > > For example: > > > + > > + i2c13: i2c-bus@480 { > > + #address-cells = <1>; > > + #size-cells = <0>; > > + #interrupt-cells = <1>; > > + > > + reg = <0x480 0x40>; > > + compatible = "aspeed,ast2500-i2c-bus"; > > + clocks = <&clk_apb>; > > + bus-frequency = <10>; > > + interrupts = <13>; > > + interrupt-parent = <&i2c_ic>; > > + pinctrl-names = "default"; > > + pinctrl-0 = <&pinctrl_i2c14_default>; > > + status = "disabled"; > > + }; > > "i2c13" has a pinctrl-0 of "pinctrl_i2c14_default" > > I know that pinctrl_i2c14_default is consistent with the function and > groups it uses, but I would like to see them all be consistent at some > point in the future. I doubt we'll be making them consistent: How the i2c devices are aliased by Linux are independent of how they're labelled in hardware or the datasheet. For sanity's sake pinmux's functions and groups follow the datasheet (though if you want to maintain the Aspeed pinctrl drivers, be my guest :D) We could rename the pinmux nodes to offset them from the function and group, but it feels less self-contained if the nodes are named at the whim of the device requesting the function if there's no further configuration on top of the function and group*. It also makes it harder to auto-explode the function name into the node definitions as we'd then need some further mangling to make things line up. * The pinctrl devicetree bindings allow for definition of nested nodes to specify collections and combinations of pin function and group, and pin configuration such as drive-strength. Thus the nodes can be a lot more complex than what we've defined up until now, but what we've got enables a lot of pretty straight-forward uses-cases. We could alias the i2c devices in line with the hardware, but that breaks the convention of starting device numbering at 0. Ultimately Joel and I chose to make the break at the point where the i2c device requests the mux function. It's up to the device to know what hardware mux configuration it needs and that's what this approach documents. The inconsistency is irritating but I don't feel like it's the end of the world, and I think the alternatives are (slightly) worse. Cheers, Andrew signature.asc Description: This is a digitally signed message part