Re: Bug involving dynamic_debug and '-p' modifier
Hi Andrew, Thanks for the bug report. I need to send a patch to update the maintainers file... Haven't had a chance to look into this yet. Will get back to you. Thanks, -Jason On 03/11/2013 10:28 PM, Andrew Cooks wrote: > On Tue, Mar 12, 2013 at 9:14 AM, Andrew Cooks wrote: >> Hi Jason >> >> When I specify a '-p' modifier in the dynamic debug options in the >> kernel command line, I get a kernel panic during boot. The panic >> happens after userspace starts doing graphical stuff, but I haven't >> been able to pinpoint exactly what. >> >> If I use '=_' instead of '-p', or boot to a recovery mode, or change >> the modifier after booting using debugfs, the panic doesn't occur. For >> example, booting to runlevel 3 with '=_', then flipping the same >> modifier to first to '+p' and then to '-p', before starting X does not >> result in a panic. >> >> Booting to runlevel 3 (or even 1!), with '-p' results in a panic, but >> 'rdshell' is OK. >> >> I also tried the Red Hat kernel (just because it's installed - I >> haven't figured out how to tell the package manager that I don't want >> a kernel). This also panics, but it says that '-p' is not supported. >> >> Please let me know if you need me to do any other tests. >> >> '-p', panic: http://pastebin.com/8yAcBQdY >> '-p', rdshell, no panic: http://pastebin.com/Hi2a7BZA >> >> -- >> a. > Jason's mail bounced, so redirecting to Jim Cromie, who made the most > recent significant changes to dynamic_debug... > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Wed, Sep 12, 2012 at 06:13:30PM -0700, Joe Perches wrote: > On Thu, 2012-09-06 at 13:53 -0400, Jason Baron wrote: > > On Thu, Sep 06, 2012 at 09:13:59AM -0700, Greg Kroah-Hartman wrote: > > > Jason, any ACK on these, or any of the other random dynamic debug > > > patches floating around? What am I supposed to be applying here? > [] > > I just posted some follow up comments to Joe, so let's see where that > > discussion goes. > > Jason, I think you need to ack these original 5 patches. > Hi, Yes, they are certainly an improvement. Please add to the series: Acked-by: Jason Baron Thanks! -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] make pr_debug() dynamic
On Sat, Feb 09, 2008 at 11:21:58PM +0100, Jan Engelhardt wrote: > >> > >> What's wrong with klogd -c 8 or equivalent? > > > >Setting the loglevel higher, will not make pr_debug() calls visible. The only > >way to make them visible right now, is by re-compiling the kernel. > > pr_debug() was IMHO meant to be a compile-time optimization > to throw out debug messages most people do not want. > > If you want to switch on/off debugging messages, use > printk(KERN_DEBUG) [with klogd -c something] and not pr_debug! true, we could turn pr_debug() calls into printk(KERN_DEBUG) calls, but there is an overhead to all these extra printk functions calls. The way I have implemented pr_debug(), there is no function call at all, in the off case. To me pr_debug() is currently a compile option, b/c there is often a lot of overhead to having it on. Being able to turn pr_debug() calls off/on in a production environment, to get debugging data seems very valuable. thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/6] dyndbg: add more info to -E2BIG log warning
On Tue, Sep 18, 2012 at 05:36:44PM -0600, Jim Cromie wrote: > Tell user how big the attempted write was, in case its not obvious. > This helped identify a missing flush in my stress-test script. > > Signed-off-by: Jim Cromie > --- > lib/dynamic_debug.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c > index 43ae319..85258a8 100644 > --- a/lib/dynamic_debug.c > +++ b/lib/dynamic_debug.c > @@ -693,7 +693,8 @@ static ssize_t ddebug_proc_write(struct file *file, const > char __user *ubuf, > if (len == 0) > return 0; > if (len > USER_BUF_PAGE - 1) { > - pr_warn("expected <%d bytes into control\n", USER_BUF_PAGE); > + pr_warn("expected <%d bytes into control, you wrote %d\n", > + USER_BUF_PAGE, (int) len); The style here, I think, is generally to leave out the space, ie: (int)len. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] dyndbg: in dynamic_emit_prefix, change inter-field separator
On Tue, Sep 18, 2012 at 07:56:42PM -0700, Joe Perches wrote: > On Tue, 2012-09-18 at 17:36 -0600, Jim Cromie wrote: > > dynamic_emit_prefix() currently separates modname, funcname, lineno > > with ':'. This is confounds use of cut -d: , since the field > > positions can change per callsite with dynamic-debug. So change > > inter-field separator to '.' and keep the ':' prefix terminator. > > > > This improves the situation, but doesnt solve it entirely; if > > dyndbg==p is used instead of dyndbg==p[fmlt]+, the callsite is enabled > > but no prefix is added, so theres one less ':' in the message. > > Changing the terminator to ',' would fix this, and might be warranted, > > especially since pr_fmt() typically adds a ':' as well. > > > > Joe Perches wasnt a fan of this, but his complaint was essentially > > that cut -d: was a poor way to do this kind of thing. I concede that > > point, but note that the kernel is not in the habit of needlessly > > confounding users work, at least when accommodating them is so trivial. > > And I still think this is ugly as it requires different parsing > by scripts when using combinations of +pfmlt > > If this patch doesn't 'solve it entirely', I'm reluctant to ack it, Would brackets, such as {} around the optional prefix be reasonable? Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 1/2] dev-core: fix build break when DEBUG is enabled
On 08/27/2013 12:20 PM, Joe Perches wrote: > On Tue, 2013-08-27 at 17:47 +0300, Dmitry Kasatkin wrote: >> When DEBUG is defined, dev_dbg_ratelimited uses dynamic debug data >> structures even when CONFIG_DYNAMIC_DEBUG is not defined. >> It leads to build break. >> For example, when I try to use dev_dbg_ratelimited in USB code and >> CONFIG_USB_DEBUG is enabled, but CONFIG_DYNAMIC_DEBUG is not, I get: > Jason? > > Seems mostly sensible to me but I think the first check > needs to be > > #if defined(CONFIG_DYNAMIC_DEBUG) && defined(DEBUG) Why? All the other call-sites, do it the way Dmitry has done it. > Also, it's curious why it was ever using __dynamic_pr_debug > instead of __dynamic_dev_dbg Not sure. > Maybe for completeness there should be a > dev_vdbg_ratelimited too. I could also see the rate limiting being pushed further down, such that dynamic debug could control how and whether or not its enabled. Thanks, -Jason > Additional bit below... > >> CC [M] drivers/usb/host/xhci-ring.o >> drivers/usb/host/xhci-ring.c: In function ‘xhci_queue_intr_tx’: >> drivers/usb/host/xhci-ring.c:3059:3: error: implicit declaration of >> function ‘DEFINE_DYNAMIC_DEBUG_METADATA’ >> [-Werror=implicit-function-declaration] >> drivers/usb/host/xhci-ring.c:3059:3: error: ‘descriptor’ undeclared (first >> use in this function) >> drivers/usb/host/xhci-ring.c:3059:3: note: each undeclared identifier is >> reported only once for each function it appears in >> drivers/usb/host/xhci-ring.c:3059:3: error: implicit declaration of >> function ‘__dynamic_pr_debug’ [-Werror=implicit-function-declaration] >> drivers/usb/host/xhci-ring.c: In function ‘xhci_queue_isoc_tx_prepare’: >> drivers/usb/host/xhci-ring.c:3847:3: error: ‘descriptor’ undeclared (first >> use in this function) >> cc1: some warnings being treated as errors >> make[2]: *** [drivers/usb/host/xhci-ring.o] Error 1 >> make[1]: *** [drivers/usb/host] Error 2 >> make: *** [drivers/usb/] Error 2 >> >> This patch separates definition for CONFIG_DYNAMIC_DEBUG and DEBUG cases. >> >> Signed-off-by: Dmitry Kasatkin >> --- >> include/linux/device.h | 17 ++--- >> 1 file changed, 14 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/device.h b/include/linux/device.h >> index 22b546a..d336beb 100644 >> --- a/include/linux/device.h >> +++ b/include/linux/device.h >> @@ -1099,17 +1099,28 @@ do { >> \ >> dev_level_ratelimited(dev_notice, dev, fmt, ##__VA_ARGS__) >> #define dev_info_ratelimited(dev, fmt, ...) \ >> dev_level_ratelimited(dev_info, dev, fmt, ##__VA_ARGS__) >> -#if defined(CONFIG_DYNAMIC_DEBUG) || defined(DEBUG) >> +#if defined(CONFIG_DYNAMIC_DEBUG) > I think this needs to be > > #if defined(CONFIG_DYNAMIC_DEBUG) && defined(DEBUG) > >> #define dev_dbg_ratelimited(dev, fmt, ...) \ >> do { >> \ >> static DEFINE_RATELIMIT_STATE(_rs, \ >>DEFAULT_RATELIMIT_INTERVAL, \ >>DEFAULT_RATELIMIT_BURST); \ >> DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ >> +/* descriptor check is first to prevent flooding with \ >> + "callbacks suppressed" */\ >> if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT) &&\ >> __ratelimit(&_rs)) \ >> -__dynamic_pr_debug(&descriptor, pr_fmt(fmt),\ >> - ##__VA_ARGS__); \ >> +__dynamic_dev_dbg(&descriptor, dev, fmt,\ >> + ##__VA_ARGS__); \ >> +} while (0) >> +#elif defined(DEBUG) >> +#define dev_dbg_ratelimited(dev, fmt, ...) \ >> +do { >> \ >> +static DEFINE_RATELIMIT_STATE(_rs, \ >> + DEFAULT_RATELIMIT_INTERVAL, \ >> + DEFAULT_RATELIMIT_BURST); \ >> +if (__ratelimit(&_rs)) \ >> +dev_printk(KERN_DEBUG, dev, fmt, ##__VA_ARGS__);\ >> } while (0) >> #else >> #define dev_dbg_ratelimited(dev, fmt, ...) \ > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dynamic debug: line queries failing due to uninitialized local variable
On 08/27/2013 01:41 PM, Greg KH wrote: > On Tue, Aug 27, 2013 at 04:50:03PM +, jba...@akamai.com wrote: >> Settings of the form, 'line x module y +p', can fail arbitrarily due to an >> uninitialized local variable. With this patch results are consistent, as >> expected. >> >> Signed-off-by: Jason Baron >> --- >> lib/dynamic_debug.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) > How far back should this patch go for older kernels? Is this a bug that > people are hitting today and causing problems, or is it only a 3.11-rc > issue? > > thanks, > > greg k-h Looks like this was introduced in >= 3.3. So its been there for quite a while. Its a bit of a corner case, since it requires, 'line' to be specified before anything else. It depends on on how you order your query. I found the bug via code inspection - no bug reports that I'm aware of. So I wouldn't think this is an urgent fix. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2 v2] epoll: optimize EPOLL_CTL_DEL using rcu
Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Tested-by: Nathan Zimmer Signed-off-by: Jason Baron --- fs/eventpoll.c | 65 -- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 473e09d..dd9fae1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -42,6 +42,7 @@ #include #include #include +#include /* * LOCKING: @@ -134,8 +135,12 @@ struct nested_calls { * of these on a server and we do not want this to take another cache line. */ struct epitem { - /* RB tree node used to link this structure to the eventpoll RB tree */ - struct rb_node rbn; + union { + /* RB tree node links this structure to the eventpoll RB tree */ + struct rb_node rbn; + /* Used to free the struct epitem */ + struct rcu_head rcu; + }; /* List header used to link this structure to the eventpoll ready list */ struct list_head rdllink; @@ -166,6 +171,9 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* The fllink is in use. Since rcu can't do 'list_del_init()' */ + int on_list; }; /* @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eventpoll *ep, return error; } +static void epi_rcu_free(struct rcu_head *head) +{ + struct epitem *epi = container_of(head, struct epitem, rcu); + kmem_cache_free(epi_cache, epi); +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) /* Remove the current item from the list of epoll hooks */ spin_lock(&file->f_lock); - if (ep_is_linked(&epi->fllink)) - list_del_init(&epi->fllink); + if (epi->on_list) { + list_del_rcu(&epi->fllink); + epi->on_list = 0; + } spin_unlock(&file->f_lock); rb_erase(&epi->rbn, &ep->rbr); @@ -705,9 +721,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) spin_unlock_irqrestore(&ep->lock, flags); wakeup_source_unregister(ep_wakeup_source(epi)); - - /* At this point it is safe to free the eventpoll item */ - kmem_cache_free(epi_cache, epi); + /* +* At this point it is safe to free the eventpoll item. Use the union +* field epi->rcu, since we are trying to minimize the size of +* 'struct epitem'. The 'rbn' field is no longer in use. Protected by +* ep->mtx. The rcu read side, reverse_path_check_proc(), does not make +* use of the rbn field. +*/ + call_rcu(&epi->rcu, epi_rcu_free); atomic_long_dec(&ep->user->epoll_watches); @@ -873,7 +894,6 @@ static const struct file_operations eventpoll_fops = { */ void eventpoll_release_file(struct file *file) { - struct list_head *lsthead = &file->f_ep_links; struct eventpoll *ep; struct epitem *epi; @@ -891,17 +911,12 @@ void eventpoll_release_file(struct file *file) * Besides, ep_remove() acquires the lock, so we can't hold it here. */ mutex_lock(&epmutex); - - while (!list_empty(lsthead)) { - epi = list_first_entry(lsthead, struct epitem, fllink); - + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { ep = epi->ep; - list_del_init(&epi->fllink); mutex_lock_nested(&ep->mtx, 0); ep_remove(ep, epi); mutex_unlock(&ep->mtx); } - mutex_unlock(&epmutex); } @@ -1139,7 +1154,9 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) struct file *child_file; struct epitem *epi; - list_for_each_entry(epi, &file->f_ep_links, fllink) { + /* CTL_DEL can remove links here, but that can't increase our count */ + rcu_read_lock(); + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { child_file = epi->ep->file; if (is_file_epoll(child_file)) { if (list_empty(&c
[PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we do not need to hold the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Tested-by: Nathan Zimmer Signed-off-by: Jason Baron --- fs/eventpoll.c | 94 ++ 1 file changed, 68 insertions(+), 26 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index dd9fae1..0f25162 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), - void *priv, - int depth) + void *priv, int depth, int ep_locked) { int error, pwake = 0; unsigned long flags; @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, * We need to lock this because we could be hit by * eventpoll_release_file() and epoll_ctl(). */ - mutex_lock_nested(&ep->mtx, depth); + + if (!ep_locked) + mutex_lock_nested(&ep->mtx, depth); /* * Steal the ready list, and re-init the original one to the @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, } spin_unlock_irqrestore(&ep->lock, flags); - mutex_unlock(&ep->mtx); + if (!ep_locked) + mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ if (pwake) @@ -829,15 +831,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, return 0; } +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, +poll_table *pt); + +struct readyevents_arg { + struct eventpoll *ep; + int locked; +}; + static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) { - return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1); + struct readyevents_arg *arg = (struct readyevents_arg *)priv; + + return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, + call_nests + 1, arg->locked); } static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) { int pollflags; struct eventpoll *ep = file->private_data; + struct readyevents_arg arg; + + /* +* During ep_insert() we already hold the ep->mtx for the tfile. +* Prevent re-aquisition. +*/ + arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); + arg.ep = ep; /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); @@ -849,7 +870,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) * could re-enter here. */ pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS, - ep_poll_readyevents_proc, ep, ep, current); + ep_poll_readyevents_proc, &arg, ep, current); return pollflags != -1 ? pollflags : 0; } @@ -1250,7 +1271,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) * Must be called with "mtx" held. */ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, -struct file *tfile, int fd) +struct file *tfile, int fd, int full_check) { int error, revents, pwake = 0; unsigned long flags; @@ -1318,7 +1339,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, /* now check if we've created too many backpaths */ error = -EINVAL; - if (reverse_path_check()) + if (full_check && reverse_path_check()) goto error_remove_epi;
[PATCH 0/2 v2] epoll: reduce 'epmutex' lock contention
Hi, Updated series, using a 'union' for the rcu callback. Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. Performance of SPECjbb improves considerably. From Nathan Zimmer's testing. Thread: http://marc.info/?l=linux-kernel&m=137908766013329&w=2 " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. I should also note there system responsiveness of various commands was quite improved when under the full load of the benchmark. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Thanks, -Jason Changes in v2: -use a union for rcu callback and drop patch #3 v1: http://marc.info/?l=linux-kernel&m=138057995316598&w=2 Jason Baron (2): epoll: optimize EPOLL_CTL_DEL using rcu epoll: Do not take global 'epmutex' for simple topologies fs/eventpoll.c | 153 +++-- 1 file changed, 106 insertions(+), 47 deletions(-) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2 v2] epoll: Do not take global 'epmutex' for simple topologies
On 10/03/2013 05:50 PM, Andrew Morton wrote: > On Tue, 1 Oct 2013 17:08:14 + (GMT) Jason Baron > wrote: > >> When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached >> directly to a wakeup source, we do not need to take the global 'epmutex', >> unless the epoll file descriptor is nested. The purpose of taking >> the 'epmutex' on add is to prevent complex topologies such as loops and >> deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD >> operations. However, for the simple case of an epoll file descriptor >> attached directly to a wakeup source (with no nesting), we do not need >> to hold the 'epmutex'. >> >> This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves >> scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb >> performance: >> >> " >> On the 16 socket run the performance went from 35k jOPS to 125k jOPS. >> In addition the benchmark when from scaling well on 10 sockets to scaling >> well >> on just over 40 sockets. >> >> ... >> >> Currently the benchmark stops scaling at around 40-44 sockets but it seems >> like >> I found a second unrelated bottleneck. >> " > I couldn't resist fiddling. Please review > > From: Andrew Morton > Subject: epoll-do-not-take-global-epmutex-for-simple-topologies-fix > > - use `bool' for boolean variables > - remove unneeded/undesirable cast of void* > - add missed ep_scan_ready_list() kerneldoc Hi Andrew, Looks good to me. Feel free to add: Reviewed-by: Jason Baron Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH 0/2] x86: make jump labels use int3-based breakpoint instead of stop_machine()
On 07/10/2013 04:25 PM, Jiri Kosina wrote: > Hi, > > this is a resurrection of a few years old idea to have jump labels use > synchronization based on int3 breakpoint rather than relying on > stop_machine() with all the consequences. > > ftrace has been doing exactly this kind of patching for year since > 08d636b6 ("ftrace/x86: Have arch x86_64 use breakpoints instead of stop > machine"). > > This patchset first introduces generic text_poke_bp() that provides means > to perform this method of patching in parallel to text_poke_smp(), and > then converts x86 jump label code to use it. > > If this is merged, I'll do a followup patch converting ftrace to use this > infrastructure as well, as it's doing the same thing in principle already. > > Comments welcome. > Cool. This definitely an area I've wanted to improve with jump labels. Perhaps, ftrace should be considered at this point to make sure the interface is suitable for both callers? Also, I wonder if its worth batching up updates. For example, right now we simply update each call-site one at a time even if its associated with the same control variable. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH 1/2 v2] x86: introduce int3-based instruction patching
On 07/11/2013 10:35 AM, Steven Rostedt wrote: > On Wed, 2013-07-10 at 14:36 -0700, H. Peter Anvin wrote: >> On 07/10/2013 02:31 PM, Jiri Kosina wrote: >>> If any CPU instruction execution would collide with the patching, >>> it'd be trapped by the int3 breakpoint and redirected to the provided >>> "handler" (which would typically mean just skipping over the patched >>> region, acting as "nop" has been there, in case we are doing nop -> jump >>> and jump -> nop transitions). >>> >> I'm wondering if it would be easier/more general to just return to the >> instruction. The "more general" bit would allow this to be used for >> other things, like alternatives, and perhaps eventually dynamic call >> patching. >> >> Returning to the instruction will, in effect, be a busy-wait for the >> faulted CPU until the patch is complete; more or less what stop_machine >> would do, but only for a CPU which actually strays into the affected region. >> > Wont work for ftrace, as it patches all functions, it even patches > functions used to do the changes. Thus, it would cause a deadlock if a > breakpoint were to spin till the changes were finished. > > -- Steve > > I'm not sure this works for jump labels either. Some tracepoints (which use jump_labels) have interrupts disabled across them. Thus, they will spin with interrupts disabled, while we are trying to issue an IPI. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MAINTAINERS: dynamic debug: Jason's not there...
Thanks Joe - feel free to add: Acked-by: Jason Baron On 07/10/2013 06:36 PM, Joe Perches wrote: > He must be too, umm, busy to update his own bouncing > email address too. > > Signed-off-by: Joe Perches > --- > > poke - prod... > > MAINTAINERS | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index e51d018..3c81917 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -2875,7 +2875,7 @@ F: drivers/media/usb/dvb-usb-v2/dvb_usb* > F: drivers/media/usb/dvb-usb-v2/usb_urb.c > > DYNAMIC DEBUG > -M: Jason Baron > +M: Jason Baron > S: Maintained > F: lib/dynamic_debug.c > F: include/linux/dynamic_debug.h > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] eventpoll: Move a kmem_cache_alloc and kmem_cache_free
On 09/13/2013 11:54 AM, Nathan Zimmer wrote: > We noticed some scaling issue in the SPECjbb benchmark. Running perf > we found that the it was spending lots of time in SYS_epoll_ctl. > In particular it is holding the epmutex. > This patch helps by moving out the kmem_cache_alloc and kmem_cache_free out > from under the lock. It improves throughput by around 15% on 16 sockets. > > While this patch should be fine as it is there are probably is more things > that can be done out side the lock, like wakeup_source_unregister, but I am > not familar with the area and I don't know of many tests. I did find the > one posted by Jason Baron at https://lkml.org/lkml/2011/2/25/297. > > Any thoughts? > Hi, Intersting - I think its also possible to completely drop taking the 'epmutex' for EPOLL_CTL_DEL by using rcu, and restricting it on add to more 'complex' topologies. That is when we have an epoll descriptor that doesn't nest with other epoll descriptors, we don't need the global 'epmutex' either. Any chance you can re-run with this? Its a bit hacky, but we can clean it up if it makes sense. Thanks, -Jason epoll: reduce usage of global 'epmutex' lock Epoll file descriptors that are 1 link from a wakeup source and are not nested within other epoll descriptors, or pointing to other epoll descriptors, don't need to check for loop creation or the creation of wakeup storms. Because of this we can avoid taking the global 'epmutex' in these cases. This state for the epoll file descriptor is marked as 'EVENTPOLL_BASIC'. Once the epoll file descriptor is attached to another epoll file descriptor it is labeled as 'EVENTPOLL_COMPLEX', and full loop checking and wakeup storm creation are checked using the the global 'epmutex'. It does not transition back. Hopefully, this is a common usecase... Also optimize EPOLL_CTL_DEL so that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. In order, to not expand the 'struct epitem' data structure, use the 'struct rb_node rbn', which is embedded in the 'stuct epitem' as the 'rcu_head' callback point. This is ok, since the 'rbn' is no longer in use when we go to free the epitem. There is currently a build-time error check for expanding the epi which I am trying to work around (Yes, this is a hack). In addition, we add a check to make sure that 'struct rb_node' is >= 'struct rcu_head'. Signed-off-by: Jason Baron diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 473e09d..d98105d 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -42,6 +42,7 @@ #include #include #include +#include /* * LOCKING: @@ -166,6 +167,14 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* TODO: really necessary? */ + int on_list; +}; + +enum eventpoll_type { + EVENTPOLL_BASIC, + EVENTPOLL_COMPLEX }; /* @@ -215,6 +224,9 @@ struct eventpoll { /* used to optimize loop detection check */ int visited; struct list_head visited_list_link; + + /* used to optimized loop checking */ + int type; }; /* Wait structure used by the poll hooks */ @@ -587,8 +599,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), - void *priv, - int depth) + void *priv, int depth, int locked) { int error, pwake = 0; unsigned long flags; @@ -599,7 +610,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, * We need to lock this because we could be hit by * eventpoll_release_file() and epoll_ctl(). */ - mutex_lock_nested(&ep->mtx, depth); + + if (!locked) + mutex_lock_nested(&ep->mtx, depth); /* * Steal the ready list, and re-init the original one to the @@ -663,7 +676,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, } spin_unlock_irqrestore(&ep->lock, flags); - mutex_unlock(&ep->mtx); + if (!locked) + mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ if (pwake) @@ -672,6 +686,13 @@ static int ep_scan_ready_list(struct eventpoll *ep, return err
[PATCH 0/3] epoll: reduce 'epmutex' lock contention
Hi, Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. Patch #3, restores the size of 'struct epitem' to <= 128 bytes, which was broken in Patch #1. Its a bit hacky, so I decided to break it out as a separate patch for review purposes. Performance of SPECjbb improves considerably. From Nathan Zimmer's testing. Thread: http://marc.info/?l=linux-kernel&m=137908766013329&w=2 " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. I should also note there system responsiveness of various commands was quite improved when under the full load of the benchmark. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Jason Baron (3): epoll: optimize EPOLL_CTL_DEL using rcu epoll: Do not take global 'epmutex' for simple topologies epoll: restore 'struct epitem' size fs/eventpoll.c | 150 - 1 file changed, 105 insertions(+), 45 deletions(-) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] epoll: optimize EPOLL_CTL_DEL using rcu
Optimize EPOLL_CTL_DEL so that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. Note that I've removed the build time check limiting 'struct epitem' to 128 bytes or less. This check is restored by subsequent patch. The reason for separating it out was to call attention to its somewhat hacky nature. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Tested-by: Nathan Zimmer Signed-off-by: Jason Baron --- fs/eventpoll.c | 58 -- 1 file changed, 32 insertions(+), 26 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 473e09d..f4251a6 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -42,6 +42,7 @@ #include #include #include +#include /* * LOCKING: @@ -166,6 +167,13 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* +* Free the epitem using rcu to enable a CTL_DEL to happen in parallel +* with reverse path checks. +*/ + struct rcu_head rcu; + int on_list; }; /* @@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eventpoll *ep, return error; } +static void epi_rcu_free(struct rcu_head *head) +{ + struct epitem *epi = container_of(head, struct epitem, rcu); + kmem_cache_free(epi_cache, epi); +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. @@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) /* Remove the current item from the list of epoll hooks */ spin_lock(&file->f_lock); - if (ep_is_linked(&epi->fllink)) - list_del_init(&epi->fllink); + if (epi->on_list) { + list_del_rcu(&epi->fllink); + epi->on_list = 0; + } spin_unlock(&file->f_lock); rb_erase(&epi->rbn, &ep->rbr); @@ -707,7 +723,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) wakeup_source_unregister(ep_wakeup_source(epi)); /* At this point it is safe to free the eventpoll item */ - kmem_cache_free(epi_cache, epi); + call_rcu(&epi->rcu, epi_rcu_free); atomic_long_dec(&ep->user->epoll_watches); @@ -873,7 +889,6 @@ static const struct file_operations eventpoll_fops = { */ void eventpoll_release_file(struct file *file) { - struct list_head *lsthead = &file->f_ep_links; struct eventpoll *ep; struct epitem *epi; @@ -891,17 +906,12 @@ void eventpoll_release_file(struct file *file) * Besides, ep_remove() acquires the lock, so we can't hold it here. */ mutex_lock(&epmutex); - - while (!list_empty(lsthead)) { - epi = list_first_entry(lsthead, struct epitem, fllink); - + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { ep = epi->ep; - list_del_init(&epi->fllink); mutex_lock_nested(&ep->mtx, 0); ep_remove(ep, epi); mutex_unlock(&ep->mtx); } - mutex_unlock(&epmutex); } @@ -1139,7 +1149,9 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) struct file *child_file; struct epitem *epi; - list_for_each_entry(epi, &file->f_ep_links, fllink) { + /* CTL_DEL can remove links here, but that can't increase our count */ + rcu_read_lock(); + list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) { child_file = epi->ep->file; if (is_file_epoll(child_file)) { if (list_empty(&child_file->f_ep_links)) { @@ -1161,6 +1173,7 @@ static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) "file is not an ep!\n"); } } + rcu_read_unlock(); return error; } @@ -1255,6 +1268,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, epi->event = *event; epi->nwait = 0; epi->next = EP_UNACTIVE_PTR; + epi->on_list = 0; if (epi->event.events & EPOLLWAKEUP) { error = ep_create_wakeup_source(epi); if (error) @@ -1287,7 +1301,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
[PATCH 3/3] epoll: restore 'struct epitem' size
The patch entitled 'epoll: optimize EPOLL_CTL_DEL using rcu', increases the size of 'struct epitem' over 128 bytes as it adds a 'struct rcu_head' field to 'struct epitem', in order to make use of RCU. This patch restores the size of 'struct epitem' back to <= 128 bytes. This size restriction and enforcement was orignally brought in by commit 39732ca5af4b09f4db561149041ddad7211019a5. The idea of this patch is to use the 'struct rb_node rbn', which is embedded in the 'stuct epitem' as the 'rcu_head' callback point. This is ok, since the 'rbn' is no longer in use when we schedule the item to be freed with RCU and its access is guarded by ep->mtx. Further, the RCU reader does not access the rbn field. I've also added a build-time check to ensure that 'struct rb_node' is >= 'struct rcu_head'. Note, I've kept this separate from 'epoll: optimize EPOLL_CTL_DEL using rcu' in order to make clear the hack-ish nature of this thing. Tested-by: Nathan Zimmer Signed-off-by: Jason Baron --- fs/eventpoll.c | 32 ++-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index daaec16..2f06737 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -168,11 +168,7 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; - /* -* Free the epitem using rcu to enable a CTL_DEL to happen in parallel -* with reverse path checks. -*/ - struct rcu_head rcu; + /* The fllink is in use. Since rcu can't do 'list_del_init()' */ int on_list; }; @@ -682,9 +678,10 @@ static int ep_scan_ready_list(struct eventpoll *ep, return error; } -static void epi_rcu_free(struct rcu_head *head) +static void epi_rcu_free(struct rcu_head *rcu) { - struct epitem *epi = container_of(head, struct epitem, rcu); + struct epitem *epi = (struct epitem *)((char *)rcu - + offsetof(struct epitem, rbn)); kmem_cache_free(epi_cache, epi); } @@ -723,9 +720,15 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) spin_unlock_irqrestore(&ep->lock, flags); wakeup_source_unregister(ep_wakeup_source(epi)); - - /* At this point it is safe to free the eventpoll item */ - call_rcu(&epi->rcu, epi_rcu_free); + /* +* At this point it is safe to free the eventpoll item. +* Use epi->rbn, instead of struct rcu_head, since we +* are trying to minimize the size of 'struct epitem'. +* The 'rbn' field is no longer in use. Protected +* by ep->mtx. the rcu read side, reverse_path_check_proc(), +* does not make use of the rbn field. +*/ + call_rcu((struct rcu_head *)&epi->rbn, epi_rcu_free); atomic_long_dec(&ep->user->epoll_watches); @@ -2126,6 +2129,15 @@ static int __init eventpoll_init(void) /* Initialize the structure used to perform file's f_op->poll() calls */ ep_nested_calls_init(&poll_readywalk_ncalls); + /* +* We can have many thousands of epitems, so prevent this from +* using an extra cache line on 64-bit (and smaller) CPUs +*/ + BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128); + + /* make sure the overloading continues to work */ + BUILD_BUG_ON(sizeof(struct rb_node) < sizeof(struct rcu_head)); + /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] epoll: Do not take global 'epmutex' for simple topologies
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we should not have to pay by taking the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: " On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck. " Tested-by: Nathan Zimmer Signed-off-by: Jason Baron --- fs/eventpoll.c | 94 ++ 1 file changed, 68 insertions(+), 26 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index f4251a6..daaec16 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi) static int ep_scan_ready_list(struct eventpoll *ep, int (*sproc)(struct eventpoll *, struct list_head *, void *), - void *priv, - int depth) + void *priv, int depth, int ep_locked) { int error, pwake = 0; unsigned long flags; @@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep, * We need to lock this because we could be hit by * eventpoll_release_file() and epoll_ctl(). */ - mutex_lock_nested(&ep->mtx, depth); + + if (!ep_locked) + mutex_lock_nested(&ep->mtx, depth); /* * Steal the ready list, and re-init the original one to the @@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep, } spin_unlock_irqrestore(&ep->lock, flags); - mutex_unlock(&ep->mtx); + if (!ep_locked) + mutex_unlock(&ep->mtx); /* We have to call this outside the lock */ if (pwake) @@ -824,15 +826,34 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, return 0; } +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, +poll_table *pt); + +struct readyevents_arg { + struct eventpoll *ep; + int locked; +}; + static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) { - return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 1); + struct readyevents_arg *arg = (struct readyevents_arg *)priv; + + return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, + call_nests + 1, arg->locked); } static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) { int pollflags; struct eventpoll *ep = file->private_data; + struct readyevents_arg arg; + + /* +* During ep_insert() we already hold the ep->mtx for the tfile. +* Prevent re-aquisition. +*/ + arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); + arg.ep = ep; /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); @@ -844,7 +865,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) * could re-enter here. */ pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS, - ep_poll_readyevents_proc, ep, ep, current); + ep_poll_readyevents_proc, &arg, ep, current); return pollflags != -1 ? pollflags : 0; } @@ -1245,7 +1266,7 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi) * Must be called with "mtx" held. */ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, -struct file *tfile, int fd) +struct file *tfile, int fd, int full_check) { int error, revents, pwake = 0; unsigned long flags; @@ -1313,7 +1334,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, /* now check if we've created too many backpaths */ error = -EINVAL; - if (reverse_path_check()) + if (full_check && reverse_path_check()) goto error_remove_epi;
Re: [RFC] eventpoll: Move a kmem_cache_alloc and kmem_cache_free
On 09/19/2013 12:37 PM, Nathan Zimmer wrote: > On 09/18/2013 02:09 PM, Jason Baron wrote: >> On 09/13/2013 11:54 AM, Nathan Zimmer wrote: >>> We noticed some scaling issue in the SPECjbb benchmark. Running perf >>> we found that the it was spending lots of time in SYS_epoll_ctl. >>> In particular it is holding the epmutex. >>> This patch helps by moving out the kmem_cache_alloc and kmem_cache_free out >>> from under the lock. It improves throughput by around 15% on 16 sockets. >>> >>> While this patch should be fine as it is there are probably is more things >>> that can be done out side the lock, like wakeup_source_unregister, but I am >>> not familar with the area and I don't know of many tests. I did find the >>> one posted by Jason Baron at https://lkml.org/lkml/2011/2/25/297. >>> >>> Any thoughts? >>> >> Hi, >> >> Intersting - I think its also possible to completely drop taking the >> 'epmutex' for EPOLL_CTL_DEL by using rcu, and restricting it on add >> to more 'complex' topologies. That is when we have an epoll descriptor >> that doesn't nest with other epoll descriptors, we don't need the >> global 'epmutex' either. Any chance you can re-run with this? Its a bit >> hacky, but we can clean it up if it makes sense. >> >> Thanks, >> >> -Jason >> > That is working GREAT. It is scaling to 16 jobs quite well. > I will have to grab a larger machine( to see what the new scaling curve > will be. > Cool. Any specific numbers would be helpful for the changelog in support of these changes. Also, I think the move the alloc/free out of from under the locks still might be nice, since we are still taking the per-ep lock in most cases. If you want I can roll those too into a patch series for this when I resubmit. Also, if you're still testing I have a small additional optimization on top of the prior patch: diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d98105d..d967fd7 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -857,7 +857,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) struct eventpoll *ep = file->private_data; struct readyevents_params params; - params.locked = ((wait->_qproc == ep_ptable_queue_proc) ? 1 : 0); + params.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0); params.ep = ep; /* Insert inside our poll wait queue */ @@ -1907,7 +1907,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, } else list_add(&tf.file->f_tfile_llink, &tfile_check_list); mutex_lock_nested(&ep->mtx, 0); - ep->type = EVENTPOLL_COMPLEX; if (is_file_epoll(tf.file)) { mutex_lock_nested(&(((struct eventpoll *)tf.file->private_data)->mtx), 1); ((struct eventpoll *)tf.file->private_data)->type = EVENTPOLL_COMPLEX; Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] eventpoll: Move a kmem_cache_alloc and kmem_cache_free
On 09/22/2013 04:41 PM, Eric Wong wrote: > Jason Baron wrote: >> epoll: reduce usage of global 'epmutex' lock >> >> Epoll file descriptors that are 1 link from a wakeup source and >> are not nested within other epoll descriptors, or pointing to >> other epoll descriptors, don't need to check for loop creation or >> the creation of wakeup storms. Because of this we can avoid taking >> the global 'epmutex' in these cases. This state for the epoll file >> descriptor is marked as 'EVENTPOLL_BASIC'. Once the epoll file >> descriptor is attached to another epoll file descriptor it is >> labeled as 'EVENTPOLL_COMPLEX', and full loop checking and wakeup >> storm creation are checked using the the global 'epmutex'. It does >> not transition back. Hopefully, this is a common usecase... > > Cool. I was thinking about doing the same thing down the line (for > EPOLL_CTL_ADD, too) > >> @@ -166,6 +167,14 @@ struct epitem { >> >> /* The structure that describe the interested events and the source fd >> */ >> struct epoll_event event; >> + >> +/* TODO: really necessary? */ >> +int on_list; > > There's some things we can overload to avoid increasing epitem size > (.ep, .ffd.fd, ...), so on_list should be unnecessary. Even with 'on_list' the size of 'epitem' stayed at 128 bytes. Not sure if there are certain compile options though that can move it over that you are concerned about...so I think that change is ok. The biggest hack here was using 'struct rb_node' instead of a proper 'struct rcu_head', so as not to increase the size of epitem. I think this is safe and I've added build time checks to ensure that 'struct rb_node' is never smaller than 'struct rcu_head'. But its rather hacky. I will probably break this change out separately when I re-post so it can be reviewed independently... Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Sun, Aug 26, 2012 at 04:25:25AM -0700, Joe Perches wrote: > The recent commit to fix dynamic_debug was a bit unclean. > Neaten the style for dynamic_debug. > Reduce the stack use of message logging that uses netdev_printk > Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg. > > Joe Perches (5): > dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack > netdev_printk/dynamic_netdev_dbg: Directly call printk_emit > netdev_printk/netif_printk: Remove a superfluous logging colon > dev: Add dev_vprintk_emit and dev_printk_emit > device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit > Looks Good. The one thing that is bothering me though, is that for __dynamic_dev_dbg(), __dynamic_netdev_dbg(), we are copying much of the core logic of __dev_printk(), __netdev_printk(), respectively. I would prefer have this in one place. Can we add a 'prefix' argument to __dev_printk(), and __netdev_printk() that dynamic debug can use, but is simply empty for dev_printk() and netdev_printk(). Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Thu, Sep 06, 2012 at 09:13:59AM -0700, Greg Kroah-Hartman wrote: > On Thu, Aug 30, 2012 at 09:48:12PM -0600, Jim Cromie wrote: > > On Thu, Aug 30, 2012 at 11:43 AM, Jim Cromie wrote: > > > On Sun, Aug 26, 2012 at 5:25 AM, Joe Perches wrote: > > >> The recent commit to fix dynamic_debug was a bit unclean. > > >> Neaten the style for dynamic_debug. > > >> Reduce the stack use of message logging that uses netdev_printk > > >> Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg. > > >> > > >> Joe Perches (5): > > >> dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack > > >> netdev_printk/dynamic_netdev_dbg: Directly call printk_emit > > >> netdev_printk/netif_printk: Remove a superfluous logging colon > > >> dev: Add dev_vprintk_emit and dev_printk_emit > > >> device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit > > >> > > > > > > Ive tested this on 2 builds differing only by DYNAMIC_DEBUG > > > It works for me on x86-64 > > > > > > However, I just booted a non-dyndbg build on x86-32, and got this. > > > > > > > Ok, transient error, went away with a clean build. > > > > tested-by: Jim Cromie > > Jason, any ACK on these, or any of the other random dynamic debug > patches floating around? What am I supposed to be applying here? > Hi Greg, I just posted some follow up comments to Joe, so let's see where that discussion goes. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Thu, Sep 06, 2012 at 11:43:46AM -0700, Joe Perches wrote: > On Thu, 2012-09-06 at 13:51 -0400, Jason Baron wrote: > > On Sun, Aug 26, 2012 at 04:25:25AM -0700, Joe Perches wrote: > > > The recent commit to fix dynamic_debug was a bit unclean. > > > Neaten the style for dynamic_debug. > > > Reduce the stack use of message logging that uses netdev_printk > > > Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg. > > > > > > Joe Perches (5): > > > dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack > > > netdev_printk/dynamic_netdev_dbg: Directly call printk_emit > > > netdev_printk/netif_printk: Remove a superfluous logging colon > > > dev: Add dev_vprintk_emit and dev_printk_emit > > > device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit > > > > > > > Looks Good. > > > > The one thing that is bothering me though, is that for > > __dynamic_dev_dbg(), __dynamic_netdev_dbg(), we are copying much of the core > > logic of __dev_printk(), __netdev_printk(), respectively. I would prefer > > have this in one place. Can we add a 'prefix' argument to __dev_printk(), > > and __netdev_printk() that dynamic debug can use, but is simply empty > > for dev_printk() and netdev_printk(). > > I don't think that's an improvement actually. > Why don't you try it. > Ok, below is what I was thinking. It winds up being more deletions than insertions. Thanks, -Jason dynamic debug: remove duplicate __netdev_printk() and __dev_printk() logic Add an option prefix string argument to __netdev_printk() and __dev_printk(). In this way dynamic debug does not have to contain duplicate logic. Signed-off-by: Jason Baron --- drivers/base/core.c | 13 +++-- include/linux/device.h|3 +++ include/linux/netdevice.h |3 +++ lib/dynamic_debug.c | 32 net/core/dev.c| 11 ++- 5 files changed, 27 insertions(+), 35 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 65f82e3..7b4ee6d 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1937,15 +1937,16 @@ int dev_printk_emit(int level, const struct device *dev, const char *fmt, ...) } EXPORT_SYMBOL(dev_printk_emit); -static int __dev_printk(const char *level, const struct device *dev, - struct va_format *vaf) +int __dev_printk(const char *level, const struct device *dev, +char *prefix, struct va_format *vaf) { if (!dev) return printk("%s(NULL device *): %pV", level, vaf); return dev_printk_emit(level[1] - '0', dev, - "%s %s: %pV", - dev_driver_string(dev), dev_name(dev), vaf); + "%s%s %s: %pV", + prefix, dev_driver_string(dev), + dev_name(dev), vaf); } int dev_printk(const char *level, const struct device *dev, @@ -1960,7 +1961,7 @@ int dev_printk(const char *level, const struct device *dev, vaf.fmt = fmt; vaf.va = &args; - r = __dev_printk(level, dev, &vaf); + r = __dev_printk(level, dev, "", &vaf); va_end(args); @@ -1980,7 +1981,7 @@ int func(const struct device *dev, const char *fmt, ...) \ vaf.fmt = fmt; \ vaf.va = &args; \ \ - r = __dev_printk(kern_level, dev, &vaf);\ + r = __dev_printk(kern_level, dev, "", &vaf);\ \ va_end(args); \ \ diff --git a/include/linux/device.h b/include/linux/device.h index 2da4589..573b15e 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -891,6 +891,9 @@ extern const char *dev_driver_string(const struct device *dev); #ifdef CONFIG_PRINTK +extern int __dev_printk(const char *level, const struct device *dev, + char *prefix, struct va_format *vaf); + extern int dev_vprintk_emit(int level, const struct device *dev, const char *fmt, va_list args); extern __printf(3, 4) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5f49cc0..63f6165 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2720,6 +2720,9 @@ static inline const char *netdev_name(const struct net_device *dev) return dev->name; } +extern int __netdev_p
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Fri, Sep 07, 2012 at 08:12:01AM -0700, Joe Perches wrote: > On Fri, 2012-09-07 at 10:52 -0400, Jason Baron wrote: > > On Thu, Sep 06, 2012 at 11:43:46AM -0700, Joe Perches wrote: > > > On Thu, 2012-09-06 at 13:51 -0400, Jason Baron wrote: > > > > On Sun, Aug 26, 2012 at 04:25:25AM -0700, Joe Perches wrote: > > > > > The recent commit to fix dynamic_debug was a bit unclean. > > > > > Neaten the style for dynamic_debug. > > > > > Reduce the stack use of message logging that uses netdev_printk > > > > > Add utility functions dev_printk_emit and dev_vprintk_emit for > > > > > /dev/kmsg. > > > > > > > > > > Joe Perches (5): > > > > > dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack > > > > > netdev_printk/dynamic_netdev_dbg: Directly call printk_emit > > > > > netdev_printk/netif_printk: Remove a superfluous logging colon > > > > > dev: Add dev_vprintk_emit and dev_printk_emit > > > > > device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit > > > > > > > > > > > > > Looks Good. > > > > > > > > The one thing that is bothering me though, is that for > > > > __dynamic_dev_dbg(), __dynamic_netdev_dbg(), we are copying much of the > > > > core > > > > logic of __dev_printk(), __netdev_printk(), respectively. I would prefer > > > > have this in one place. Can we add a 'prefix' argument to > > > > __dev_printk(), > > > > and __netdev_printk() that dynamic debug can use, but is simply empty > > > > for dev_printk() and netdev_printk(). > > > > > > I don't think that's an improvement actually. > > > Why don't you try it. > > > > > > > Ok, below is what I was thinking. It winds up being more deletions than > > insertions. > > > > Thanks, > > > > -Jason > > > > dynamic debug: remove duplicate __netdev_printk() and __dev_printk() logic > > > > Add an option prefix string argument to __netdev_printk() and > > __dev_printk(). > > In this way dynamic debug does not have to contain duplicate logic. > > > > Signed-off-by: Jason Baron > > --- > > drivers/base/core.c | 13 +++-- > > include/linux/device.h|3 +++ > > include/linux/netdevice.h |3 +++ > > lib/dynamic_debug.c | 32 > > net/core/dev.c| 11 ++- > > 5 files changed, 27 insertions(+), 35 deletions(-) > > > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > index 65f82e3..7b4ee6d 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -1937,15 +1937,16 @@ int dev_printk_emit(int level, const struct device > > *dev, const char *fmt, ...) > > } > > EXPORT_SYMBOL(dev_printk_emit); > > > > -static int __dev_printk(const char *level, const struct device *dev, > > - struct va_format *vaf) > > +int __dev_printk(const char *level, const struct device *dev, > > +char *prefix, struct va_format *vaf) > > { > > if (!dev) > > return printk("%s(NULL device *): %pV", level, vaf); > > > > return dev_printk_emit(level[1] - '0', dev, > > - "%s %s: %pV", > > - dev_driver_string(dev), dev_name(dev), vaf); > > + "%s%s %s: %pV", > > + prefix, dev_driver_string(dev), > > + dev_name(dev), vaf); > > } > > > > int dev_printk(const char *level, const struct device *dev, > > @@ -1960,7 +1961,7 @@ int dev_printk(const char *level, const struct device > > *dev, > > vaf.fmt = fmt; > > vaf.va = &args; > > > > - r = __dev_printk(level, dev, &vaf); > > + r = __dev_printk(level, dev, "", &vaf); > > > > va_end(args); > > > > @@ -1980,7 +1981,7 @@ int func(const struct device *dev, const char *fmt, > > ...) \ > > vaf.fmt = fmt; \ > > vaf.va = &args; \ > > \ > > - r = __dev_printk(kern_level, dev, &vaf);\ > > + r = __dev_printk(kern_level, dev, "", &vaf);\
Re: [PATCH 0/5] dev_ and dynamic_debug cleanups
On Fri, Sep 07, 2012 at 06:55:50PM -0700, Joe Perches wrote: > On Fri, 2012-09-07 at 11:35 -0400, Jason Baron wrote: > > If nobody else thinks this patch is better, let's at least add a comment in > > __dev_printk() and __netdev_printk() to fix dynamic debug, if these are > > changed. > > Or maybe make dynamic_emit_prefix a public function > and move the __dynamic__printk functions to > nearby the ___printk functions so it's easier to > see that changing one requires the other to change. > ok, that makes sense to me. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] dynamic_debug: Fix vpr_ logging styles
From: Joe Perches vpr_info_dq should be a function and vpr_info should have a do {} while (0) Add missing newlines to pr_s. Miscellaneous neatening too. braces, coalescing formats, alignments, etc... Signed-off-by: Joe Perches Signed-off-by: Jason Baron --- lib/dynamic_debug.c | 118 +++ 1 files changed, 62 insertions(+), 56 deletions(-) diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index e7f7d99..c0869f1 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -59,7 +59,7 @@ struct ddebug_iter { static DEFINE_MUTEX(ddebug_lock); static LIST_HEAD(ddebug_tables); -static int verbose = 0; +static int verbose; module_param(verbose, int, 0644); /* Return the last part of a pathname */ @@ -107,24 +107,32 @@ static char *ddebug_describe_flags(struct _ddebug *dp, char *buf, return buf; } -#define vpr_info(fmt, ...) \ - if (verbose) do { pr_info(fmt, ##__VA_ARGS__); } while (0) - -#define vpr_info_dq(q, msg)\ +#define vpr_info(fmt, ...) \ do { \ - /* trim last char off format print */ \ - vpr_info("%s: func=\"%s\" file=\"%s\" " \ - "module=\"%s\" format=\"%.*s\" "\ - "lineno=%u-%u", \ - msg,\ - q->function ? q->function : "", \ - q->filename ? q->filename : "", \ - q->module ? q->module : "", \ - (int)(q->format ? strlen(q->format) - 1 : 0), \ - q->format ? q->format : "", \ - q->first_lineno, q->last_lineno); \ + if (verbose)\ + pr_info(fmt, ##__VA_ARGS__);\ } while (0) +static void vpr_info_dq(const struct ddebug_query *query, const char *msg) +{ + /* trim any trailing newlines */ + int fmtlen = 0; + + if (query->format) { + fmtlen = strlen(query->format); + while (fmtlen && query->format[fmtlen - 1] == '\n') + fmtlen--; + } + + vpr_info("%s: func=\"%s\" file=\"%s\" module=\"%s\" format=\"%.*s\" lineno=%u-%u\n", +msg, +query->function ? query->function : "", +query->filename ? query->filename : "", +query->module ? query->module : "", +fmtlen, query->format ? query->format : "", +query->first_lineno, query->last_lineno); +} + /* * Search the tables for _ddebug's which match the given `query' and * apply the `flags' and `mask' to them. Returns number of matching @@ -148,7 +156,7 @@ static int ddebug_change(const struct ddebug_query *query, if (query->module && strcmp(query->module, dt->mod_name)) continue; - for (i = 0 ; i < dt->num_ddebugs ; i++) { + for (i = 0; i < dt->num_ddebugs; i++) { struct _ddebug *dp = &dt->ddebugs[i]; /* match against the source filename */ @@ -183,10 +191,10 @@ static int ddebug_change(const struct ddebug_query *query, continue; dp->flags = newflags; vpr_info("changed %s:%d [%s]%s =%s\n", - trim_prefix(dp->filename), dp->lineno, - dt->mod_name, dp->function, - ddebug_describe_flags(dp, flagbuf, - sizeof(flagbuf))); +trim_prefix(dp->filename), dp->lineno, +dt->mod_name, dp->function, +ddebug_describe_flags(dp, flagbuf, + sizeof(flagbuf))); } } mutex_unlock(&ddebug_lock); @@ -220,12 +228,12 @@ static int ddebug_tokenize(char *buf, char *words[], int maxwords) /* find `end' of word, whitespace separated or quoted */ if (*buf == '"' || *buf == '\'') { int quote = *buf++; - for (end = buf ; *end && *end != quote ; end++) +
[PATCH 3/3] dynamic_debug: add pr_errs before -EINVALs
From: Jim Cromie Ma noted that dynamic-debug is silent about many query errors, so add pr_err()s to explain those errors, and tweak a few others. Also parse flags 1st, so that match-spec errs are slightly clearer. CC: Jianpeng Ma CC: Joe Perches CC: Greg KH Signed-off-by: Jim Cromie Signed-off-by: Jason Baron --- lib/dynamic_debug.c | 47 +++ 1 files changed, 35 insertions(+), 12 deletions(-) diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index c0869f1..da820f2 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -230,8 +230,10 @@ static int ddebug_tokenize(char *buf, char *words[], int maxwords) int quote = *buf++; for (end = buf; *end && *end != quote; end++) ; - if (!*end) + if (!*end) { + pr_err("unclosed quote: %s\n", buf); return -EINVAL; /* unclosed quote */ + } } else { for (end = buf; *end && !isspace(*end); end++) ; @@ -239,8 +241,10 @@ static int ddebug_tokenize(char *buf, char *words[], int maxwords) } /* `buf' is start of word, `end' is one past its end */ - if (nwords == maxwords) + if (nwords == maxwords) { + pr_err("too many words, legal max <=%d\n", maxwords); return -EINVAL; /* ran out of words[] before bytes */ + } if (*end) *end++ = '\0'; /* terminate the word */ words[nwords++] = buf; @@ -272,7 +276,11 @@ static inline int parse_lineno(const char *str, unsigned int *val) return 0; } *val = simple_strtoul(str, &end, 10); - return end == NULL || end == str || *end != '\0' ? -EINVAL : 0; + if (end == NULL || end == str || *end != '\0') { + pr_err("bad line-number: %s\n", str); + return -EINVAL; + } + return 0; } /* @@ -352,8 +360,10 @@ static int ddebug_parse_query(char *words[], int nwords, int rc; /* check we have an even number of words */ - if (nwords % 2 != 0) + if (nwords % 2 != 0) { + pr_err("expecting pairs of match-spec \n"); return -EINVAL; + } memset(query, 0, sizeof(*query)); if (modname) @@ -374,18 +384,22 @@ static int ddebug_parse_query(char *words[], int nwords, char *first = words[i+1]; char *last = strchr(first, '-'); if (query->first_lineno || query->last_lineno) { - pr_err("match-spec:line given 2 times\n"); + pr_err("match-spec: line used 2x\n"); return -EINVAL; } if (last) *last++ = '\0'; - if (parse_lineno(first, &query->first_lineno) < 0) + if (parse_lineno(first, &query->first_lineno) < 0) { + pr_err("line-number is <0\n"); return -EINVAL; + } if (last) { /* range - */ if (parse_lineno(last, &query->last_lineno) < query->first_lineno) { - pr_err("last-line < 1st-line\n"); + pr_err("last-line:%d < 1st-line:%d\n", + query->last_lineno, + query->first_lineno); return -EINVAL; } } else { @@ -421,6 +435,7 @@ static int ddebug_parse_flags(const char *str, unsigned int *flagsp, op = *str++; break; default: + pr_err("bad flag-op %c, at start of %s\n", *str, str); return -EINVAL; } vpr_info("op='%c'\n", op); @@ -432,8 +447,10 @@ static int ddebug_parse_flags(const char *str, unsigned int *flagsp, break; } } - if (i < 0) + if (i < 0) { + pr_err("unknown flag '%c' in \"%s\"\n", *str, str); return -EINVAL; +
[PATCH 2/3] dynamic_debug: dynamic hex dump
From: Vladimir Kondratiev Introduce print_hex_dump_debug() that can be dynamically controlled, similar to pr_debug. Also, make print_hex_dump_bytes() dynamically controlled Implement only 'p' flag (_DPRINTK_FLAGS_PRINT) to keep it simple since hex dump prints multiple lines and long prefix would impact readability. To provide line/file etc. information, use pr_debug or similar before/after print_hex_dump_debug() Signed-off-by: Vladimir Kondratiev Signed-off-by: Jason Baron --- Documentation/dynamic-debug-howto.txt | 15 +-- include/linux/dynamic_debug.h | 11 +++ include/linux/printk.h| 17 + lib/hexdump.c |4 +++- 4 files changed, 44 insertions(+), 3 deletions(-) diff --git a/Documentation/dynamic-debug-howto.txt b/Documentation/dynamic-debug-howto.txt index 6e16849..72322c6 100644 --- a/Documentation/dynamic-debug-howto.txt +++ b/Documentation/dynamic-debug-howto.txt @@ -6,8 +6,16 @@ This document describes how to use the dynamic debug (dyndbg) feature. Dynamic debug is designed to allow you to dynamically enable/disable kernel code to obtain additional kernel information. Currently, if -CONFIG_DYNAMIC_DEBUG is set, then all pr_debug()/dev_dbg() calls can -be dynamically enabled per-callsite. +CONFIG_DYNAMIC_DEBUG is set, then all pr_debug()/dev_dbg() and +print_hex_dump_debug()/print_hex_dump_bytes() calls can be dynamically +enabled per-callsite. + +If CONFIG_DYNAMIC_DEBUG is not set, print_hex_dump_debug() is just +shortcut for print_hex_dump(KERN_DEBUG). + +For print_hex_dump_debug()/print_hex_dump_bytes(), format string is +its 'prefix_str' argument, if it is constant string; or "hexdump" +in case 'prefix_str' is build dynamically. Dynamic debug has even more useful features: @@ -202,6 +210,9 @@ The flags are: tInclude thread ID in messages not generated from interrupt context _No flags are set. (Or'd with others on input) +For print_hex_dump_debug() and print_hex_dump_bytes(), only 'p' flag +have meaning, other flags ignored. + For display, the flags are preceded by '=' (mnemonic: what the flags are currently equal to). diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index 6dd4787..2fe93b2 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -95,6 +95,17 @@ do { \ ##__VA_ARGS__);\ } while (0) +#define dynamic_hex_dump(prefix_str, prefix_type, rowsize, \ +groupsize, buf, len, ascii)\ +do { \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, \ + __builtin_constant_p(prefix_str) ? prefix_str : "hexdump");\ + if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) \ + print_hex_dump(KERN_DEBUG, prefix_str, \ + prefix_type, rowsize, groupsize, \ + buf, len, ascii);\ +} while (0) + #else #include diff --git a/include/linux/printk.h b/include/linux/printk.h index 9afc01e..02c95cf 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -321,8 +321,13 @@ extern void hex_dump_to_buffer(const void *buf, size_t len, extern void print_hex_dump(const char *level, const char *prefix_str, int prefix_type, int rowsize, int groupsize, const void *buf, size_t len, bool ascii); +#if defined(CONFIG_DYNAMIC_DEBUG) +#define print_hex_dump_bytes(prefix_str, prefix_type, buf, len)\ + dynamic_hex_dump(prefix_str, prefix_type, 16, 1, buf, len, true) +#else extern void print_hex_dump_bytes(const char *prefix_str, int prefix_type, const void *buf, size_t len); +#endif /* defined(CONFIG_DYNAMIC_DEBUG) */ #else static inline void print_hex_dump(const char *level, const char *prefix_str, int prefix_type, int rowsize, int groupsize, @@ -336,4 +341,16 @@ static inline void print_hex_dump_bytes(const char *prefix_str, int prefix_type, #endif +#if defined(CONFIG_DYNAMIC_DEBUG) +#define print_hex_dump_debug(prefix_str, prefix_type, rowsize, \ +groupsize, buf, len, ascii)\ + dynamic_hex_dump(prefix_str, prefix_type, rowsize, \ +groupsize, buf, len, ascii) +#else +#define print_hex_dump_debug(prefix_str, prefix_type, rowsize, \ +groupsize, buf, len, ascii)\ + print_hex_dump(KERN_DEBUG, prefix_str, prefix_type, rowsize,\ + groupsize, buf, len, ascii) +#endif /* defined(CONFIG_DYNAMIC_DEBUG) */ + #endif diff --git a/lib/hexdum
[PATCH 0/3] dynamic_debug: Add print_hex_dump_bytes/debug support and cleanups
Hi Greg, Here's a collection of the latest dyanmic debug patches that I have pending. Thanks, -Jason Jim Cromie (1): dynamic_debug: add pr_errs before -EINVALs Joe Perches (1): dynamic_debug: Fix vpr_ logging styles Vladimir Kondratiev (1): dynamic_debug: dynamic hex dump Documentation/dynamic-debug-howto.txt | 15 +++- include/linux/dynamic_debug.h | 11 ++ include/linux/printk.h| 17 lib/dynamic_debug.c | 165 +++-- lib/hexdump.c |4 +- 5 files changed, 141 insertions(+), 71 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] dynamic_debug: Add print_hex_dump_bytes/debug support and cleanups
On Wed, Dec 05, 2012 at 02:05:06PM -0800, Joe Perches wrote: > On Wed, 2012-12-05 at 16:48 -0500, Jason Baron wrote: > > Here's a collection of the latest dynamic debug patches that I have > > pending. > > Any update on the jump table support? > I have patches that implement the support, although they are a bit hacky due to header dependencies. However, the main reason I didn't post them yet is that the 'data' section increases quite a bit due to the increased table size (text section is reduced, since we are saving instructions). On x86_64, the the upper 32-bits of each table entry are always '0x', so the table could be reduced by a factor of 2. Although, I haven't figured out a simple way of doing so. Perhaps, the 'data' section bloat is ok, due to the runtime savings, though. I can dig out the % increase data. While we're on new features, I think it would be nice to have the ability to enable the core dynamic debug library - lib/dynamic_debug.c, independently from the debug statements that use it. So, having CONFIG_DYNAMIC_DEBUG would just configure the infrastructure bits, and something such as CONFIG_DYNAMIC_DEBUG_PRINTK could enable pr_debug() and friends. In that way subsystems that want to use dynamic output can be enabled/disabled independently. So they could do: #ifdef CONFIG_DRIVER_FOO_DEBUG #define drv_foo_debug() dev_dbg() #else #define drv_foo_debug() {} where CONFIG_DRIVER_FOO_DEBUG would enable CONFIG_DYNAMIC_DEBUG. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dynamic_debug: Remove unnecessary __used
On Thu, Oct 18, 2012 at 12:07:03PM -0700, Joe Perches wrote: > The __used attribute prevents gcc from eliminating > unnecessary, otherwise optimized away, metadata for > debugging logging messages. > > Remove the __used attribute. > > Signed-off-by: Joe Perches > --- > include/linux/dynamic_debug.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h > index c18257b..6dd4787 100644 > --- a/include/linux/dynamic_debug.h > +++ b/include/linux/dynamic_debug.h > @@ -61,7 +61,7 @@ int __dynamic_netdev_dbg(struct _ddebug *descriptor, >const char *fmt, ...); > > #define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt) \ > - static struct _ddebug __used __aligned(8) \ > + static struct _ddebug __aligned(8) \ > __attribute__((section("__verbose"))) name = { \ > .modname = KBUILD_MODNAME, \ > .function = __func__, \ > > Greg, Pls pull this. Acked-by: Jason Baron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] kvm: Check userspace_addr when modifying a memory slot
On Thu, Dec 06, 2012 at 03:20:37PM -0700, Alex Williamson wrote: > The API documents that only flags and guest physical memory space can > be modified on an existing slot, but we don't enforce that the > userspace address cannot be modified. Instead we just ignore it. > This means that a user may think they've successfully moved both the > guest and user addresses, when in fact only the guest address changed. > Check and error instead. > > Signed-off-by: Alex Williamson > --- > virt/kvm/kvm_main.c |8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index e426704..93213e1 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -779,13 +779,19 @@ int __kvm_set_memory_region(struct kvm *kvm, > > r = -ENOMEM; > > - /* Allocate if a slot is being created */ > + /* > + * Allocate if a slot is being created. If modifying a slot, > + * the userspace_addr cannot change. > + */ > if (!old.npages) { > new.user_alloc = user_alloc; > new.userspace_addr = mem->userspace_addr; > > if (kvm_arch_create_memslot(&new, npages)) > goto out_free; > + } else if (mem->userspace_addr != old.userspace_addr) { > + r = -EINVAL; > + goto out_free; > } > > /* Allocate page dirty bitmap if needed */ > hmmm...does this mean that on a 'destroy', where npages is 0, the user has to set up userspace_addr correctly? If so, that would appear to be an unwanted change in semantics here. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] kvm: Check userspace_addr when modifying a memory slot
On Fri, Dec 07, 2012 at 11:32:08AM -0700, Alex Williamson wrote: > On Fri, 2012-12-07 at 13:17 -0500, Jason Baron wrote: > > On Thu, Dec 06, 2012 at 03:20:37PM -0700, Alex Williamson wrote: > > > The API documents that only flags and guest physical memory space can > > > be modified on an existing slot, but we don't enforce that the > > > userspace address cannot be modified. Instead we just ignore it. > > > This means that a user may think they've successfully moved both the > > > guest and user addresses, when in fact only the guest address changed. > > > Check and error instead. > > > > > > Signed-off-by: Alex Williamson > > > --- > > > virt/kvm/kvm_main.c |8 +++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index e426704..93213e1 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -779,13 +779,19 @@ int __kvm_set_memory_region(struct kvm *kvm, > > > > > > r = -ENOMEM; > > > > > > - /* Allocate if a slot is being created */ > > > + /* > > > + * Allocate if a slot is being created. If modifying a slot, > > > + * the userspace_addr cannot change. > > > + */ > > > if (!old.npages) { > > > new.user_alloc = user_alloc; > > > new.userspace_addr = mem->userspace_addr; > > > > > > if (kvm_arch_create_memslot(&new, npages)) > > > goto out_free; > > > + } else if (mem->userspace_addr != old.userspace_addr) { > > > + r = -EINVAL; > > > + goto out_free; > > > } > > > > > > /* Allocate page dirty bitmap if needed */ > > > > > > > hmmm...does this mean that on a 'destroy', where npages is 0, the user > > has to set up userspace_addr correctly? If so, that would appear to be > > an unwanted change in semantics here. > > Good point, it does change that. We could make this be (npages && > mem->userspace_addr != old.userspace_addr) to avoid that case. Thanks, > > Alex > Yup, that works for me. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynamic_debug: introduce debug_hex_dump()
On Tue, Dec 11, 2012 at 11:36:46AM -0800, Luis R. Rodriguez wrote: > On Tue, Dec 04, 2012 at 11:28:26AM +0200, Vladimir Kondratiev wrote: > > Hi Jason, > > > > Do you have any update on the status for patches below? > > Where is it now? When do you expect it to merge? 3.8? > > I am waiting for this to merge before I can go on > > with my driver. > > *poke* > > Luis Hi Luis, Yes, I posted this patch, along with 2 others last week for Greg to pull. I believe these are too late for the 3.8 merge window. Perhaps, Greg can comment. Here's a link to the post: http://marc.info/?l=linux-kernel&m=135474417810281&w=2 Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dynamic_debug: introduce debug_hex_dump()
On Sun, Nov 18, 2012 at 10:35:40AM -0800, Joe Perches wrote: > On Sun, 2012-11-18 at 19:01 +0200, Vladimir Kondratiev wrote: > > In case this option wins, patch follows. There is no need for > > 2-nd one to fix existing drivers. > > Thanks Vladimir. > > Yes, thanks. I've pulled in this patch, and posted a dynamic debug tree at: git clone git://github.com/jibaron/ddebug.git I will push the collected set of patches to Greg. I also added a couple of patches there to convert dynamic debug to use 'jump labels', which shows good results. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: racy jump label users
Hi Andi, Agreed. However, other users of 'static_key_enabled()', provide their own locking. For example, in kernel/tracepoint.c, 'static_key_enabled()', relies on the tracepoints_mutex. Were there any other users that are problematic? I agree a 'setstate' would be nice. Maybe something like: static_key_slow_set_true(); static_key_slow_set_false(); Thanks, -Jason On 03/22/2013 03:55 PM, Andi Kleen wrote: > Jason, > > I noticed that a lot of the jump label users are racy, > because they implement something like this > > static void sched_feat_disable(int i) > { > if (static_key_enabled(&sched_feat_keys[i])) > static_key_slow_dec(&sched_feat_keys[i]); > } > > static void sched_feat_enable(int i) > { > if (!static_key_enabled(&sched_feat_keys[i])) > static_key_slow_inc(&sched_feat_keys[i]); > } > > with no extra locking, controlled by sysfs. If two > CPUs do this in parallel the reference can be set multiple > times, which gives very unexpected semantics for a sysfs boolean. > > Most likely you need a static_key_slow_setstate() > that does the check and set inside the jump label lock. > > I understand that for inside kernel use reference > counts are the right semantics, but they are not so > good for sysfs interfaces. > > -Andi > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging: ozwpan: Convert printk to dev_dbg()
On 06/25/2013 01:29 PM, Joe Perches wrote: > On Tue, 2013-06-25 at 10:02 -0700, Greg KH wrote: >> On Tue, Jun 25, 2013 at 05:30:02PM +0100, Rupesh Gujare wrote: >>> convert all debug messages from printk to dev_dbg() & add kernel config to >>> enable/disable these messages during compilation. >> No, just use the built-in dynamic debug code in the kernel, no need to >> provide any new macros or functions or most importantly, no new Kconfig >> options. > I think the Kconfig option is pretty poor too but a > long needed extension to dev_dbg is to enable classes > of messages by level or mask. > > There are many existing macros like > > #define module_dbg(level, fmt, ...) > do { > if (level >= some_module_var) > debug_something(...); > } while (0) > > and > > #define module_dbg(mask, fmt, ...) > do { > if (mask & some_module_var) > debug_something(...) > } while (0) > > It'd be nice to consolidate those in dev_dbg > > I'll get 'round to it one day if Jason doesn't. > Hi, I've been a bit reluctant to add these 'flag' and 'level' to dynamic debug b/c the debug statements can already be controlled individually, and thus one could implement something pretty similar in userspace. Also, it keeps things simpler. That said, this has come up several times and might be a nice extension for both the dynamic debug and non-dynamic debug case. I think the interface could look something like: pr_debug_level(&control_var, level, fmt, ...); pr_debug_mask(&control_var, mask, fmt, ...); and then you could do something like: echo "grouping control_var level N +p" > /debugfs/dynamic_debug/control or echo "grouping control_var mask 0xN +p" > /debugfs/dynamic_debug/control So we can think of the 'control_var' as 'grouping' debug statements together. So it would provide a nice way of associating statements together other than the existing: module, func, file, format, and line selectors. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] staging: ozwpan: Convert printk to dev_dbg()
On 06/25/2013 02:03 PM, Joe Perches wrote: > (Using Jason Baron's most current email address) > > On Tue, 2013-06-25 at 10:56 -0700, Joe Perches wrote: >> On Tue, 2013-06-25 at 10:38 -0700, Greg KH wrote: >>> On Tue, Jun 25, 2013 at 10:29:50AM -0700, Joe Perches wrote: a long needed extension to dev_dbg is to enable classes of messages by level or mask. There are many existing macros like #define module_dbg(level, fmt, ...) do { if (level >= some_module_var) debug_something(...); } while (0) and #define module_dbg(mask, fmt, ...) do { if (mask & some_module_var) debug_something(...) } while (0) It'd be nice to consolidate those in dev_dbg >>> I'd almost recommend that all of them just be removed, because most of >>> them were only used for debugging the code the first time it was >>> developed, right? Only for very limited usages would this type of thing >>> be needed. >> Maybe, though that's hard to know with certainty. >> How often module options are used isn't easy to find out. >> >> These forms are sometimes used to reduce object size. > Hey Jason. Your redhat.com address is out of date. > > If you're still interested in dynamic_debug, you should > update your MAINTAINERS entry. > > Hi Joe, Thanks for asking - I'll send out a patch to update it. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] static keys: fix test/set races
On 06/29/2013 03:20 AM, Ingo Molnar wrote: * jba...@akamai.com wrote: Hi, As pointed out by Andi Kleen, some static key users can be racy because they check the value of the key->enabled, and then subsequently update the branch direction. A number of call sites have 'higher' level locking that avoids this race, but the usage in the scheduler features does not. See: http://lkml.indiana.edu/hypermail/linux/kernel/1304.2/01655.html But that's not an issue at all - switching the scheduler features is for development and debugging only, and in some cases higher level locking would be needed to solve it 'properly', beyond what the keys API could give ... So this is pretty pointless, sorry, please don't complicate this facility. Thanks, Ingo Hi Ingo, Yes, I agree that 'higher' level locking may be required for some callers of the newly proposed interface. However, I do think that the static_key_slow_set_true()/false() provides a nice abstraction for some callers, while addressing test/set() races, by making that sequence atomic. I view the proposed inteface of set_true()/set_false() as somewhat analogous to an atomic_set() call. In the same way, the current static_key_slow_inc()/dec() are analogous to atomic_inc()/dec(). It arguably makes the code code a bit more readable, transforming sequences such as: if (!static_key_enabled(&control_var)) static_key_slow_inc(&control_var); into: static_key_slow_set_true(&control_var); I see at least 3 users of static_keys in the tree which I think would benefit from this transformation. The 2 attached with this series, and the usage in kernel/tracepoint.c. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] udp: make use of static_key_slow_set_true() interface
On 06/28/2013 11:13 PM, Steven Rostedt wrote: On Fri, 2013-06-28 at 22:30 +, jba...@akamai.com wrote: Make use of the simpler API. Need to make the change log much more descriptive. Never assume someone in the future that looks up a change to this file will know about other commits that led to this. Each change log should be sufficient to stand on its own. Explain why this patch is needed. And it's not about the use of a simpler API. It actually fixes a real bug. Signed-off-by: Jason Baron --- net/ipv4/udp.c |9 - net/ipv6/udp.c |9 - 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 0bf5d39..b8d0029 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1421,11 +1421,10 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) } -static struct static_key udp_encap_needed __read_mostly; +static struct static_key_boolean udp_encap_needed __read_mostly; void udp_encap_enable(void) { - if (!static_key_enabled(&udp_encap_needed)) - static_key_slow_inc(&udp_encap_needed); + static_key_slow_set_true(&udp_encap_needed); } EXPORT_SYMBOL(udp_encap_enable); @@ -1450,7 +1449,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) goto drop; nf_reset(skb); - if (static_key_false(&udp_encap_needed) && up->encap_type) { + if (static_key_false(&udp_encap_needed.key) && up->encap_type) { I wonder if we should add a static_bool_key_false(), because that added ".key" is a bit confusing. -- Steve Yeah - that is sort of ugly, but it avoids introducing a new branch API call. That said, a 'static_bool_key_false()' would probably be a simple wrapper function. -Jason int (*encap_rcv)(struct sock *sk, struct sk_buff *skb); /* @@ -1773,7 +1772,7 @@ void udp_destroy_sock(struct sock *sk) bool slow = lock_sock_fast(sk); udp_flush_pending_frames(sk); unlock_sock_fast(sk, slow); - if (static_key_false(&udp_encap_needed) && up->encap_type) { + if (static_key_false(&udp_encap_needed.key) && up->encap_type) { void (*encap_destroy)(struct sock *sk); encap_destroy = ACCESS_ONCE(up->encap_destroy); if (encap_destroy) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 42923b1..94a4091 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -573,11 +573,10 @@ static __inline__ void udpv6_err(struct sk_buff *skb, __udp6_lib_err(skb, opt, type, code, offset, info, &udp_table); } -static struct static_key udpv6_encap_needed __read_mostly; +static struct static_key_boolean udpv6_encap_needed __read_mostly; void udpv6_encap_enable(void) { - if (!static_key_enabled(&udpv6_encap_needed)) - static_key_slow_inc(&udpv6_encap_needed); + static_key_slow_set_true(&udpv6_encap_needed); } EXPORT_SYMBOL(udpv6_encap_enable); @@ -590,7 +589,7 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) goto drop; - if (static_key_false(&udpv6_encap_needed) && up->encap_type) { + if (static_key_false(&udpv6_encap_needed.key) && up->encap_type) { int (*encap_rcv)(struct sock *sk, struct sk_buff *skb); /* @@ -1300,7 +1299,7 @@ void udpv6_destroy_sock(struct sock *sk) udp_v6_flush_pending_frames(sk); release_sock(sk); - if (static_key_false(&udpv6_encap_needed) && up->encap_type) { + if (static_key_false(&udpv6_encap_needed.key) && up->encap_type) { void (*encap_destroy)(struct sock *sk); encap_destroy = ACCESS_ONCE(up->encap_destroy); if (encap_destroy) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Dynamic debug on by default?
On 08/14/2013 12:51 PM, Joe Perches wrote: > On Wed, 2013-08-14 at 09:40 -0700, Sarah Sharp wrote: >> Hi Xenia, >> >> I'm a bit confused. I thought that debugging messages would be turned >> off by default for a module if CONFIG_DYNAMIC_DEBUG was turned on. When >> I tested your patch to remove the CONFIG_USB_XHCI_HCD_DEBUGGING and just >> use dev_dbg, the messages from the xHCI driver appeared in dmesg by >> default. >> >> That generates a lot of log spew. We can have distros add a boot >> parameter option to turn off debug messages, but that boot parameter is >> limited to 1023 characters. I'm concerned that if more drivers add >> dynamic debugging, the distros will eventually run out of space in the >> dynamic debugging boot parameter. I know Greg was ripping out debugging >> config options in other USB drivers, so this is a bit concerning. >> >> Jason, is there a way within the xHCI driver to say that dynamic >> debugging should be off by default? I've looked through the >> documentation, and I can't find anything like that documented. > #undefine DEBUG > >> I've attached my .config file, in case I have something misconfigured. > Because of: > > drivers/usb/host/Makefile:ccflags-$(CONFIG_USB_DEBUG) := -DDEBUG > > you probably want to turn this off > > CONFIG_USB_DEBUG=y > > when dynamic_debug is on. They should also be emitted in the case that dynamic_debug is off, but we have set -DDEBUG, right? That was the whole point of enabling them for dynamic debug too - to mirror the behavior of !CONFIG_DYNAMIC_DEBUG case. -Jason > CONFIG_DYNAMIC_DEBUG=y > > This will cause all pr_debug/dev_dbg statements to be > emitted without specific enabling via the dynamic_debug > control file. > > (from include/linux/dynamic_debug.h) > > #if defined DEBUG > #define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT > #else > #define _DPRINTK_FLAGS_DEFAULT 0 > #endif > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dynamic_debug: add wildcard support to filter files/functions/modules
On 07/25/2013 12:47 PM, Joe Perches wrote: On Thu, 2013-07-25 at 21:02 +0800, Du, Changbin wrote: From: "Du, Changbin" This patch add wildcard '*'(matches zero or more characters) and '?' (matches one character) support when qurying debug flags. Seems very useful. Caveat below. diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c [] @@ -127,6 +127,21 @@ static void vpr_info_dq(const struct ddebug_query *query, const char *msg) query->first_lineno, query->last_lineno); } +static int match_pattern(char *pat, char *str) +{ + switch (*pat) { + case '\0': + return !*str; + case '*': + return match_pattern(pat+1, str) || + (*str && match_pattern(pat, str+1)); + case '?': + return *str && match_pattern(pat+1, str+1); + default: + return *pat == *str && match_pattern(pat+1, str 1); + } +} What's the maximum length string used today? On a very long pattern, can this recursion cause stack overflow? Other than that, I like it. The recursion here is a concern - especially since the 'pat' pointer is controlled from userspace. Maybe not at pretty, but this can easily be done in userspace. For example, to set all driver/usb* one could do something like: grep "drivers/usb" control | awk -F ":| " '{ print "file " $1 " line " $2 }' | xargs -i -n1 echo {} +p > control Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] gcc feature request: Moving blocks into sections
On 08/05/2013 03:40 PM, Marek Polacek wrote: On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote: On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds wrote: Ugh. I can see the attraction of your section thing for that case, I just get the feeling that we should be able to do better somehow. Hmm.. Quite frankly, Steven, for your use case I think you actually want the C goto *labels* associated with a section. Which sounds like it might be a cleaner syntax than making it about the basic block anyway. FWIW, we also support hot/cold attributes for labels, thus e.g. if (bar ()) goto A; /* ... */ A: __attribute__((cold)) /* ... */ I don't know whether that might be useful for what you want or not though... Marek It certainly would be. That was how I wanted to the 'static_key' stuff to work, but unfortunately the last time I tried it, it didn't move the text out-of-line any further than it was already doing. Would that be expected? The change for us, if it worked would be quite simple. Something like: --- a/arch/x86/include/asm/jump_label.h +++ b/arch/x86/include/asm/jump_label.h @@ -21,7 +21,7 @@ static __always_inline bool arch_static_branch(struct static_key *key) ".popsection \n\t" : : "i" (key) : : l_yes); return false; -l_yes: +l_yes: __attribute__((cold)) return true; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] gcc feature request: Moving blocks into sections
On 08/05/2013 02:39 PM, Steven Rostedt wrote: On Mon, 2013-08-05 at 11:20 -0700, Linus Torvalds wrote: Of course, it would be good to optimize static_key_false() itself - right now those static key jumps are always five bytes, and while they get nopped out, it would still be nice if there was some way to have just a two-byte nop (turning into a short branch) *if* we can reach another jump that way..For small functions that would be lovely. Oh well. I had patches that did exactly this: https://lkml.org/lkml/2012/3/8/461 But it got dropped for some reason. I don't remember why. Maybe because of the complexity? -- Steve Hi Steve, I recall testing your patches and the text size increased unexpectedly. I believe I correctly accounted for changes to the text size *outside* of branch points. If you do re-visit the series that is one thing I'd like to double check/understand. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] gcc feature request: Moving blocks into sections
On 08/05/2013 04:35 PM, Richard Henderson wrote: On 08/05/2013 09:57 AM, Jason Baron wrote: On 08/05/2013 03:40 PM, Marek Polacek wrote: On Mon, Aug 05, 2013 at 11:34:55AM -0700, Linus Torvalds wrote: On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds wrote: Ugh. I can see the attraction of your section thing for that case, I just get the feeling that we should be able to do better somehow. Hmm.. Quite frankly, Steven, for your use case I think you actually want the C goto *labels* associated with a section. Which sounds like it might be a cleaner syntax than making it about the basic block anyway. FWIW, we also support hot/cold attributes for labels, thus e.g. if (bar ()) goto A; /* ... */ A: __attribute__((cold)) /* ... */ I don't know whether that might be useful for what you want or not though... Marek It certainly would be. That was how I wanted to the 'static_key' stuff to work, but unfortunately the last time I tried it, it didn't move the text out-of-line any further than it was already doing. Would that be expected? The change for us, if it worked would be quite simple. Something like: It is expected. One must use -freorder-blocks-and-partition, and use real profile feedback to get blocks moved completely out-of-line. Whether that's a sensible default or not is debatable. Hi Steve, I think if the 'cold' attribute on the default disabled static_key branch moved the text completely out-of-line, it would satisfy your requirement here? If you like this approach, perhaps we can make something like this work within gcc. As its already supported, but doesn't quite go far enough for our purposes. Also, if we go down this path, it means the 2-byte jump sequence is probably not going to be too useful. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3/2] x86/jump labels: Count and display the short jumps used
On 08/07/2013 03:22 PM, Linus Torvalds wrote: > On Wed, Aug 7, 2013 at 10:54 AM, Steven Rostedt wrote: >> On another box, using a distro config, I had even better results: >> >> [2.352448] short jumps: 193 >> [2.355407] long jumps: 219 > .. well, another way of looking at this is to say that all of this > effort saves just 579 bytes. > > Yes, maybe some of those bytes are in really hot paths, but the other > side of *that* coin is that the 2-vs-5 byte jump doesn't much matter > if it's already cached. > > So I'd vote for not doing this. If we had some simple way to do the > short jumps, I think it would be lovely. Or if we had to parse the ELF > files and do instruction rewriting for various other reasons, and the > jump rewriting was just one small detail. > > But using 576 new lines (the diffstat for your patch 1/2 that adds the > infrastructure to do the rewriting) in order to same just about > exactly that many bytes in the binary - the effort just doesn't work > out, imnsho. > > Linus The whole point of the thread started with wanting to move the default 'disabled' branch further out-of-line. We could get there with better compiler support for the 'cold' label attribute. Thus, in theory the whole 2-byte jmp is just an intermediate step. (Yeah, I know that support doesn't seem to be happening anytime soon...) Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3/2] x86/jump labels: Count and display the short jumps used
On 08/07/2013 04:47 PM, Linus Torvalds wrote: > On Wed, Aug 7, 2013 at 1:19 PM, Jason Baron wrote: >> The whole point of the thread started with wanting to move the default >> 'disabled' branch further out-of-line. > Yeah, but I always disagreed with that. > > Putting the unusual code out-of-line (as in "at the end of the > function") is a good idea, but putting it *far* out of line (as in "in > a different section") likely makes little sense. > > Now, the tracing code is admittedly specialized enough that we could > have some "really cold" attribute and move it to that kind of "even > further away" model, but most of the other uses of the static keys are > not necessarily of the kind where the non-default case is completely > or utterly unlikely - they want to use the static keys not because > some codepath is basically never taken, but because the code-path is > so critical that loading and testing a value from memory is considered > to be excessive for when the feature is turned off (ie scheduler > statistics etc). > > So the code may not even be all that cold - some people may well run > with statistics enabled all the time - it's just that the non-enabled > case really *really* doesn't want to have the overhead of even > bothering to test for this event. > > ok - I can see 2 variants here as you mentioned: 1) 'Unbiased' - we want to treat both branches equally but don't want the load/test/jmp sequence. For things like the scheduler stats. 2) 'Biased' - where the unlikely path is moved completely out-of-line. And we have a strong 'bias' to optimize the default path. If we can get the completely out-of-line thing working, we could make this distinction. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/4] make pr_debug() dynamic - update docs
-add documentation about pr_debug kernel-parameters.txt Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- Documentation/kernel-parameters.txt |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index cf38689..6d524bb 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1494,6 +1494,11 @@ and is between 256 and 4096 characters. It is defined in the file autoconfiguration. Ranges are in pairs (memory base and size). + pr_debug + Enables pr_debug() calls if the immediate + infrastructure has been enabled. These can also be + switched on/off via /proc/sys/debug/pr_debug. + print-fatal-signals= [KNL] debug: print fatal signals print-fatal-signals=1: print segfault info to -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/4] make pr_debug() dynamic - sysctl support
-add /proc/sys/debug/pr_debug, to toggle pr_debug() on/off Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- kernel/sysctl.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7cb1ac3..73508d7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -155,6 +155,11 @@ static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp static int proc_dointvec_taint(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); #endif +#ifdef CONFIG_PR_DEBUG_DYNAMIC +static int proc_pr_debug_handler(struct ctl_table *table, int write, +struct file *filp, void __user *buffer, +size_t *lenp, loff_t *ppos); +#endif static struct ctl_table root_table[]; static struct ctl_table_root sysctl_table_root; @@ -1312,6 +1317,16 @@ static struct ctl_table debug_table[] = { .proc_handler = proc_dointvec }, #endif +#ifdef CONFIG_PR_DEBUG_DYNAMIC + { + .ctl_name = CTL_UNNUMBERED, + .procname = "pr_debug", + .data = NULL, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_pr_debug_handler + }, +#endif { .ctl_name = 0 } }; @@ -2499,6 +2514,32 @@ static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp return 0; } +#ifdef CONFIG_PR_DEBUG_DYNAMIC + +static int proc_pr_debug_handler(struct ctl_table *table, int write, +struct file *filp, void __user *buffer, +size_t *lenp, loff_t *ppos) +{ + int tmp, r; + + if (!write) + tmp = imv_read(pr_debug_on); + r = __do_proc_dointvec(&tmp, table, write, filp, buffer, + lenp, ppos, NULL, NULL); + if (r) + return r; + + if (write) { + if (tmp) + imv_set(pr_debug_on, 1); + else + imv_set(pr_debug_on, 0); + } + return 0; +} + +#endif + #else /* CONFIG_PROC_FS */ int proc_dostring(struct ctl_table *table, int write, struct file *filp, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/4] make pr_debug() dynamic
hi, make the pr_debug() function dependent upon the new immediate infrastruture. Thus, b/c of the low runtime impact, we can dynamically enable/disable pr_debug withoug recompiling. Patch allows 'pr_debug=0/1' on the command line or via /proc/sys/debug/pr_debug. thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] make pr_debug() dynamic
-make pr_debug() dynamic so it can be switched on/off. The off state is implemented on top of the immediate infrastructure, so as to promote more dynamic printing and debugging. depends on CONFIG_HAVE_IMMEDIATE and CONFIG_PRINTK Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- include/linux/kernel.h | 10 ++ init/Kconfig |5 + kernel/printk.c| 17 + 3 files changed, 32 insertions(+), 0 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index ff356b2..d69430a 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -276,6 +277,14 @@ extern void print_hex_dump_bytes(const char *prefix_str, int prefix_type, #define pr_info(fmt, arg...) \ printk(KERN_INFO fmt, ##arg) +#ifdef CONFIG_PR_DEBUG_DYNAMIC +DECLARE_IMV(char, pr_debug_on); +#define pr_debug(fmt, ...) \ + do { \ + if (unlikely(imv_read(pr_debug_on)))\ + printk(KERN_DEBUG fmt, ##__VA_ARGS__); \ + } while (0) +#else #ifdef DEBUG /* If you are writing a driver, please use dev_dbg instead */ #define pr_debug(fmt, arg...) \ @@ -286,6 +295,7 @@ static inline int __attribute__ ((format (printf, 1, 2))) pr_debug(const char * return 0; } #endif +#endif /* * Display an IP address in readable format. diff --git a/init/Kconfig b/init/Kconfig index 3918c2d..0bed605 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -539,6 +539,11 @@ config PRINTK very difficult to diagnose system problems, saying N here is strongly discouraged. +config PR_DEBUG_DYNAMIC + bool + depends on PRINTK && HAVE_IMMEDIATE + default y + config BUG bool "BUG() support" if EMBEDDED default y diff --git a/kernel/printk.c b/kernel/printk.c index 29ae1e9..1bc4fdb 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -33,6 +33,7 @@ #include #include #include +#include #include @@ -1334,3 +1335,19 @@ bool printk_timed_ratelimit(unsigned long *caller_jiffies, return false; } EXPORT_SYMBOL(printk_timed_ratelimit); + +#ifdef CONFIG_PR_DEBUG_DYNAMIC + +DEFINE_IMV(char, pr_debug_on) = 0; +EXPORT_IMV_SYMBOL_GPL(pr_debug_on); + +static int __init pr_debug_setup(char *str) +{ + if (str) + return -ENOENT; + imv_set(pr_debug_on, 1); + return 0; +} +early_param("pr_debug", pr_debug_setup); + +#endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] make pr_debug() dynamic
-make pptp_msg_name dependent on CONFIG_PR_DEBUG_DYNAMIC Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- net/netfilter/nf_conntrack_pptp.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c index b5cb8e8..2be92bd 100644 --- a/net/netfilter/nf_conntrack_pptp.c +++ b/net/netfilter/nf_conntrack_pptp.c @@ -65,7 +65,7 @@ void struct nf_conntrack_expect *exp) __read_mostly; EXPORT_SYMBOL_GPL(nf_nat_pptp_hook_expectfn); -#ifdef DEBUG +#if defined(DEBUG) || defined(CONFIG_PR_DEBUG_DYNAMIC) /* PptpControlMessageType names */ const char *const pptp_msg_name[] = { "UNKNOWN_MESSAGE", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] make pr_debug() dynamic
On Thu, Feb 07, 2008 at 02:42:14PM -0800, Joe Perches wrote: > On Thu, 2008-02-07 at 16:03 -0500, Jason Baron wrote: > > make the pr_debug() function dependent upon the new immediate infrastruture. > > What's wrong with klogd -c 8 or equivalent? > > Setting the loglevel higher, will not make pr_debug() calls visible. The only way to make them visible right now, is by re-compiling the kernel. -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/4] make pr_debug() dynamic - update docs
On Thu, Feb 07, 2008 at 01:26:04PM -0800, Randy Dunlap wrote: > > --- a/Documentation/kernel-parameters.txt > > +++ b/Documentation/kernel-parameters.txt > > @@ -1494,6 +1494,11 @@ and is between 256 and 4096 characters. It is > > defined in the file > > autoconfiguration. > > Ranges are in pairs (memory base and size). > > > > + pr_debug > > + Enables pr_debug() calls if the immediate > > + infrastructure has been enabled. These can also be > > + switched on/off via /proc/sys/debug/pr_debug. > > Is this actually Enables/disables? > I.e., is this actually pr_debug= ? > If so, please document that also. If not, why not? > The intention is that pr_debug is off by default. If you pass 'pr_debug' at the command line, it is turned on early in the boot process. I orginally implemented it as pr_debug=, but I believe just the string is simpler. thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/4] make pr_debug() dynamic - sysctl support
On Thu, Feb 07, 2008 at 04:10:52PM -0800, Stephen Hemminger wrote: > Jason Baron <[EMAIL PROTECTED]> wrote: > > > > > -add /proc/sys/debug/pr_debug, to toggle pr_debug() on/off > > > > Signed-off-by: Jason Baron <[EMAIL PROTECTED]> > > --- > > > > kernel/sysctl.c | 41 + > > 1 files changed, 41 insertions(+), 0 deletions(-) > > > > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > > index 7cb1ac3..73508d7 100644 > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -155,6 +155,11 @@ static int proc_do_cad_pid(struct ctl_table *table, > > int write, struct file *filp > > static int proc_dointvec_taint(struct ctl_table *table, int write, struct > > file *filp, > >void __user *buffer, size_t *lenp, loff_t *ppos); > > #endif > > +#ifdef CONFIG_PR_DEBUG_DYNAMIC > > +static int proc_pr_debug_handler(struct ctl_table *table, int write, > > +struct file *filp, void __user *buffer, > > +size_t *lenp, loff_t *ppos); > > +#endif > > > > static struct ctl_table root_table[]; > > static struct ctl_table_root sysctl_table_root; > > @@ -1312,6 +1317,16 @@ static struct ctl_table debug_table[] = { > > .proc_handler = proc_dointvec > > }, > > #endif > > +#ifdef CONFIG_PR_DEBUG_DYNAMIC > > + { > > + .ctl_name = CTL_UNNUMBERED, > > + .procname = "pr_debug", > > + .data = NULL, > > + .maxlen = sizeof(int), > > + .mode = 0644, > > + .proc_handler = &proc_pr_debug_handler > > + }, > > +#endif > > { .ctl_name = 0 } > > }; > > > > @@ -2499,6 +2514,32 @@ static int proc_do_cad_pid(struct ctl_table *table, > > int write, struct file *filp > > return 0; > > } > > > > +#ifdef CONFIG_PR_DEBUG_DYNAMIC > > + > > +static int proc_pr_debug_handler(struct ctl_table *table, int write, > > +struct file *filp, void __user *buffer, > > +size_t *lenp, loff_t *ppos) > > +{ > > + int tmp, r; > > + > > + if (!write) > > + tmp = imv_read(pr_debug_on); > > + r = __do_proc_dointvec(&tmp, table, write, filp, buffer, > > + lenp, ppos, NULL, NULL); > > + if (r) > > + return r; > > + > > + if (write) { > > + if (tmp) > > + imv_set(pr_debug_on, 1); > > + else > > + imv_set(pr_debug_on, 0); > > + } > > + return 0; > > +} > > + > > +#endif > > + > > #else /* CONFIG_PROC_FS */ > > > > int proc_dostring(struct ctl_table *table, int write, struct file *filp, > > Please no, there are many pr_debug calls that are only intended for debugging > that > particular subsystem, not a global enable/disable. It also adds to kernel > bloat. > Yes, but you are going to have to re-compile your kernel to get at them...the ability to turn on/off pr_debug() calls, at runtime, seems like a powerful runtime diagnostic tool...the CONFIG_PR_DEBUG_DYNAMIC can default to 'n', instead of being automactically set as I've proposed. In terms of bloat, the vmlinux file goes from 53649330 to 53985478, < %1 increase. thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] dyndbg: dev_dbg bugfix + 2 trivials
On Thu, Jul 19, 2012 at 01:46:19PM -0600, Jim Cromie wrote: > 3 patches here, 1st is bugfix, others are trivial. > > 1- fix __dev_printk, which broke dev_dbg() prefix under CONFIG_DYNAMIC_DEBUG. > Patch looks good, and would be really nice to get into 3.5. Kay, are you ok with this patch? > 2- change dyndbg prefix interfield separator from ':' to '.' > > for example (output from test-code, not submitted): > r8169 :02:00.0: r8169.rtl_init_one: set-drvdata pdev:880223041000 > dev:880220d6a000 > hwmon hwmon1: k10temp.k10temp_probe.180: set-drvdata pdev:88022303d000 > dev:8801dfd2a000 > > This improves usability of cut -d: for pr_debug() messages, > as field position is less volatile with various uses of dyndbg flags. > Its not perfect: > - dev_dbg on net-devices adds several more colons, > but this doesnt vary with dyndbg flags. > - dyndbg=+pfmlt still adds a field vs dyndbg==p (ie no prefix) > - pr_fmt() commonly adds another colon (unchanged with this patch) As you suggest in the patch, changing the delimiter to a non-colon character such as ',' would resolve these cases? > > 3- trivial var name change in lib/dynamic_debug.c > > > Please drop or apply 2,3 as you prefer. 2,3 are nice, but as you suggest I think we want to separate them from patch 1, which is a bugfix for 3.5. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] drivers-core: make structured logging play nice with dynamic-debug
On Thu, Jul 19, 2012 at 01:46:20PM -0600, Jim Cromie wrote: > commit c4e00daaa96d3a0786f1f4fe6456281c60ef9a16 changed __dev_printk > in a way that broke dynamic-debug's ability to control the dynamic > prefix of dev_dbg(dev,..), but not dev_dbg(NULL,..) or pr_debug(..), > which is why it wasnt noticed sooner. > > When dev==NULL, __dev_printk() just calls printk(), which just works. > But otherwise, it assumed that level was always a string like "" > and just plucked out the 'L', ignoring the rest. However, > dynamic_emit_prefix() adds "[tid] module:func:line:" to the string, > those additions all got lost. > > Signed-off-by: Jim Cromie > --- > drivers/base/core.c |9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > Acked-by: Jason Baron > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 346be8b..ebdb7c5 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -1848,6 +1848,7 @@ int __dev_printk(const char *level, const struct device > *dev, >struct va_format *vaf) > { > char dict[128]; > + const char *level_extra = ""; > size_t dictlen = 0; > const char *subsys; > > @@ -1894,10 +1895,14 @@ int __dev_printk(const char *level, const struct > device *dev, > "DEVICE=+%s:%s", subsys, dev_name(dev)); > } > skip: > + if (level[3]) > + level_extra = &level[3]; /* skip past "" */ > + > return printk_emit(0, level[1] - '0', > dictlen ? dict : NULL, dictlen, > -"%s %s: %pV", > -dev_driver_string(dev), dev_name(dev), vaf); > +"%s %s: %s%pV", > +dev_driver_string(dev), dev_name(dev), > +level_extra, vaf); > } > EXPORT_SYMBOL(__dev_printk); > > -- > 1.7.10.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/1] trace: Move trace event enable from fs_initcall to early_initcall
On Fri, Aug 17, 2012 at 10:39:39AM -0400, Steven Rostedt wrote: > On Fri, 2012-08-17 at 11:04 -0300, Ezequiel Garcia wrote: > > On Fri, Aug 17, 2012 at 10:55 AM, Steven Rostedt > > wrote: > > > On Fri, 2012-08-17 at 08:01 -0300, Ezequiel Garcia wrote: > > > > > >> Regarding the 'complete solution': to be able to capture events from > > >> the very beggining... > > >> Have you thought about this? > > >> Could you give me a hint on how could I implement it? > > > > > > How far in the "beginning"? Before memory is set up? > > > > Yes. > > > > > I wouldn't do that. > > > > Yeah, perhaps it sounds crazy. It makes some sense for kmem events, though. > > It doesn't sound crazy, because I've done it before. There may be ways > to do it. > > > > > > I have in the past (set up before memory was finished being > > > initialized), but things have changed since then. > > > > > > One thing that we could do for those that want really early tracing, is > > > to add a config option to add a static temporary ring buffer, that gets > > > > Yes, something like this would be ideal. How would this ring buffer be > > allocated? > > Perhaps as static and __initdata? > > Yes. > > > This way it would be released afterwards, right? > > Correct. > > > > > > copied into the default ring buffer after memory is set up. That may be > > > the easiest way. > > > > > > Once memory is set up, the ring buffer can be allocated and events can > > > be traced, but the ring buffer needs to be set up first. All it would > > > take is some calls in init/main.c start_kernel() to the initialization. > > > > > > > Note that my main concern is on trace_events (kmem events to be precise). > > However this are registered through tracepoints and in turn this tracepoints > > depend on kmalloc and friends. So, right now is a chicken-egg problem. > > I don't think kmalloc is the issue. The big problem in front of you is > jump labels. That's what enables and disables trace points, and it gets > initialized just after memory is set up. You may have to force jump > labels off when doing early tracing :-/ Although I'm not sure it > requires allocations. > The only allocation jump labels should do is when modules are loaded. So I think you should be able to move them earlier, if need be. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jump_label: constify jump_label_text_reserved and friends
On Mon, Jan 07, 2013 at 08:58:09AM -0500, Sasha Levin wrote: > jump_label_text_reserved() doesn't modify the memory it works on, it just > checks whether there are any jump labels there. > > Constify the parameters of it to prevent warnings when working with it. > I don't see any 'const' data passed to jump_label_text_reserved()? Where is the warning coming from? New users? Also, does this mean 'ftrace_text_reserved()' and 'alternatives_text_reserved()' need to be updated as well? Thanks, -Jason > Reviewed-by: Jamie Iles > Signed-off-by: Sasha Levin > --- > include/linux/jump_label.h | 4 ++-- > kernel/jump_label.c| 10 ++ > 2 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h > index 0976fc4..d917a2a 100644 > --- a/include/linux/jump_label.h > +++ b/include/linux/jump_label.h > @@ -116,7 +116,7 @@ extern void arch_jump_label_transform(struct jump_entry > *entry, > enum jump_label_type type); > extern void arch_jump_label_transform_static(struct jump_entry *entry, >enum jump_label_type type); > -extern int jump_label_text_reserved(void *start, void *end); > +extern int jump_label_text_reserved(const void *start, const void *end); > extern void static_key_slow_inc(struct static_key *key); > extern void static_key_slow_dec(struct static_key *key); > extern void static_key_slow_dec_deferred(struct static_key_deferred *key); > @@ -174,7 +174,7 @@ static inline void static_key_slow_dec_deferred(struct > static_key_deferred *key) > static_key_slow_dec(&key->key); > } > > -static inline int jump_label_text_reserved(void *start, void *end) > +static inline int jump_label_text_reserved(const void *start, const void > *end) > { > return 0; > } > diff --git a/kernel/jump_label.c b/kernel/jump_label.c > index 60f48fa..6dc71c6 100644 > --- a/kernel/jump_label.c > +++ b/kernel/jump_label.c > @@ -120,7 +120,8 @@ void jump_label_rate_limit(struct static_key_deferred > *key, > } > EXPORT_SYMBOL_GPL(jump_label_rate_limit); > > -static int addr_conflict(struct jump_entry *entry, void *start, void *end) > +static int addr_conflict(struct jump_entry *entry, > + const void *start, const void *end) > { > if (entry->code <= (unsigned long)end && > entry->code + JUMP_LABEL_NOP_SIZE > (unsigned long)start) > @@ -130,7 +131,8 @@ static int addr_conflict(struct jump_entry *entry, void > *start, void *end) > } > > static int __jump_label_text_reserved(struct jump_entry *iter_start, > - struct jump_entry *iter_stop, void *start, void *end) > + struct jump_entry *iter_stop, > + const void *start, const void *end) > { > struct jump_entry *iter; > > @@ -222,7 +224,7 @@ struct static_key_mod { > struct module *mod; > }; > > -static int __jump_label_mod_text_reserved(void *start, void *end) > +static int __jump_label_mod_text_reserved(const void *start, const void *end) > { > struct module *mod; > > @@ -419,7 +421,7 @@ early_initcall(jump_label_init_module); > * > * returns 1 if there is an overlap, 0 otherwise > */ > -int jump_label_text_reserved(void *start, void *end) > +int jump_label_text_reserved(const void *start, const void *end) > { > int ret = __jump_label_text_reserved(__start___jump_table, > __stop___jump_table, start, end); > -- > 1.8.0.2 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pty_chars_in_buffer NULL pointer (kernel oops)
On Sun, 27 Feb 2005, Linus Torvalds wrote: > > > On Sun, 27 Feb 2005, Marcelo Tosatti wrote: > > > > Alan, Linus, what correction to the which the above thread discusses has > > been deployed? > > This is the hacky "hide the problem" patch that is in my current tree (and > was discussed in the original thread some time ago). > > It's in no way "correct", in that the race hasn't actually gone away by > this patch, but the patch makes it unimportant. We may end up calling a > stale line discipline, which is still very wrong, but it so happens that > we don't much care in practice. > > I think that in a 2.4.x tree there are some theoretical SMP races with > module unloading etc (which the 2.6.x code doesn't have because module > unload stops the other CPU's - maybe that part got backported to 2.4.x?), > but quite frankly, I suspect that even in 2.4.x they are entirely > theoretical and impossible to actually hit. > > And again, in theory some line discipline might do something strange in > it's "chars_in_buffer" routine that would be problematic. In practice > that's just not the case: the "chars_in_buffer()" routine might return a > bogus _value_ for a stale line discipline thing, but none of them seem to > follow any pointers that might have become invalid (and in fact, most > ldiscs don't even have that function). > > So while this patch is wrogn in theory, it does have the advantage of > being (a) very safe minimal patch and (b) fixing the problem in practice > with no performance downside. > > I still feel a bit guilty about it, though. > > Linus > > --- > # This is a BitKeeper generated diff -Nru style patch. > # > # ChangeSet > # 2005/02/25 19:39:39-08:00 [EMAIL PROTECTED] > # Fix possible pty line discipline race. > # > # This ain't pretty. Real fix under discussion. > # > # drivers/char/pty.c > # 2005/02/25 19:39:32-08:00 [EMAIL PROTECTED] +4 -2 > # Fix possible pty line discipline race. > # > # This ain't pretty. Real fix under discussion. > # > diff -Nru a/drivers/char/pty.c b/drivers/char/pty.c > --- a/drivers/char/pty.c 2005-02-27 11:31:57 -08:00 > +++ b/drivers/char/pty.c 2005-02-27 11:31:57 -08:00 > @@ -149,13 +149,15 @@ > static int pty_chars_in_buffer(struct tty_struct *tty) > { > struct tty_struct *to = tty->link; > + ssize_t (*chars_in_buffer)(struct tty_struct *); > int count; > > - if (!to || !to->ldisc.chars_in_buffer) > + /* We should get the line discipline lock for "tty->link" */ > + if (!to || !(chars_in_buffer = to->ldisc.chars_in_buffer)) > return 0; > > /* The ldisc must report 0 if no characters available to be read */ > - count = to->ldisc.chars_in_buffer(to); > + count = chars_in_buffer(to); > > if (tty->driver->subtype == PTY_TYPE_SLAVE) return count; > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > hi, I've implented another approach to this issue, based on some previous suggestions. The idea is to lock both sides of a ptmx/pty pair during line discipline changing. As previously discussed this has the advantage of being on a less often used path, as opposed to locking the ptmx/pty pair on read/write/poll paths. The patch below, takes both ldisc locks in either order b/c the locks are both taken under the same spinlock(). I thought about locking the ptmx/pty separately, such as master always first but that introduces a 3 way deadlock. For example, process 1 does a blocking read on the slave side. Then, process 2 does an ldisc change on the slave side, which acquires the master ldisc lock but not the slave's. Finally, process 3 does a write which blocks on the process 2's ldisc reference. This patch does introduce some changes in semantics. For example, a line discipline change on side 'a' of a ptmx/pty pair, will now wait for a read/write to complete on the other side, or side 'b'. The current behavior is to simply wait for any read/writes on only side 'a', not both sides 'a' and 'b'. I think this behavior makes sense, but i wanted to point it out. I've tested the patch with a bunch of read/write/poll while changing the line discipline out from underneath. This patch obviates the need for the above "hide the problem" patch. thanks, -Jason --- linux/drivers/char/tty_io.c.bak +++ linux/drivers/char/tty_io.c @@ -458,21 +458,19 @@ static void tty_ldisc_enable(struct tty_ static int tty_set_ldisc(struct tty_struct *tty, int ldisc) { - int retval = 0; - struct tty_ldisc o_ldisc; + int retval = 0; + struct tty_ldisc o_ldisc; char buf[64]; int work; unsigned long flags; struct tty_ldisc *ld; + struct tty_struct *o_tty; if ((ldisc < N_T
pty_chars_in_buffer oops fix
hi, This is a re-posting of a fix for a race condition between changing the line discipline on a pty and and a poll on the 'other' side. The reference is: http://marc.theaimsgroup.com/?l=linux-kernel&m=111342171410005&w=2. Below is a diff against the current tree. thanks, -Jason --- linux-2.6.12/drivers/char/pty.c 2005-06-17 15:48:29.0 -0400 +++ linux-2.6.12.working/drivers/char/pty.c 2005-07-11 12:21:03.0 -0400 @@ -149,15 +149,14 @@ static int pty_chars_in_buffer(struct tty_struct *tty) { struct tty_struct *to = tty->link; - ssize_t (*chars_in_buffer)(struct tty_struct *); int count; /* We should get the line discipline lock for "tty->link" */ - if (!to || !(chars_in_buffer = to->ldisc.chars_in_buffer)) + if (!to || !to->ldisc.chars_in_buffer) return 0; /* The ldisc must report 0 if no characters available to be read */ - count = chars_in_buffer(to); + count = to->ldisc.chars_in_buffer(to); if (tty->driver->subtype == PTY_TYPE_SLAVE) return count; --- linux-2.6.12/drivers/char/tty_io.c 2005-07-11 12:13:48.0 -0400 +++ linux-2.6.12.working/drivers/char/tty_io.c 2005-07-11 12:19:21.0 -0400 @@ -470,21 +470,19 @@ static int tty_set_ldisc(struct tty_struct *tty, int ldisc) { - int retval = 0; - struct tty_ldisc o_ldisc; + int retval = 0; + struct tty_ldisc o_ldisc; char buf[64]; int work; unsigned long flags; struct tty_ldisc *ld; + struct tty_struct *o_tty; if ((ldisc < N_TTY) || (ldisc >= NR_LDISCS)) return -EINVAL; restart: - if (tty->ldisc.num == ldisc) - return 0; /* We are already in the desired discipline */ - ld = tty_ldisc_get(ldisc); /* Eduardo Blanco <[EMAIL PROTECTED]> */ /* Cyrus Durgin <[EMAIL PROTECTED]> */ @@ -495,45 +493,74 @@ if (ld == NULL) return -EINVAL; - o_ldisc = tty->ldisc; - tty_wait_until_sent(tty, 0); + if (tty->ldisc.num == ldisc) { + tty_ldisc_put(ldisc); + return 0; + } + + o_ldisc = tty->ldisc; + o_tty = tty->link; + /* * Make sure we don't change while someone holds a * reference to the line discipline. The TTY_LDISC bit * prevents anyone taking a reference once it is clear. * We need the lock to avoid racing reference takers. */ - + spin_lock_irqsave(&tty_ldisc_lock, flags); - if(tty->ldisc.refcount) - { - /* Free the new ldisc we grabbed. Must drop the lock - first. */ + if (tty->ldisc.refcount || (o_tty && o_tty->ldisc.refcount)) { + if(tty->ldisc.refcount) { + /* Free the new ldisc we grabbed. Must drop the lock + first. */ + spin_unlock_irqrestore(&tty_ldisc_lock, flags); + tty_ldisc_put(ldisc); + /* +* There are several reasons we may be busy, including +* random momentary I/O traffic. We must therefore +* retry. We could distinguish between blocking ops +* and retries if we made tty_ldisc_wait() smarter. That +* is up for discussion. +*/ + if (wait_event_interruptible(tty_ldisc_wait, tty->ldisc.refcount == 0) < 0) + return -ERESTARTSYS; + goto restart; + } + if(o_tty && o_tty->ldisc.refcount) { + spin_unlock_irqrestore(&tty_ldisc_lock, flags); + tty_ldisc_put(ldisc); + if (wait_event_interruptible(tty_ldisc_wait, o_tty->ldisc.refcount == 0) < 0) + return -ERESTARTSYS; + goto restart; + } + } + + /* if the TTY_LDISC bit is set, then we are racing against another ldisc change */ + + if (!test_bit(TTY_LDISC, &tty->flags)) { spin_unlock_irqrestore(&tty_ldisc_lock, flags); tty_ldisc_put(ldisc); - /* -* There are several reasons we may be busy, including -* random momentary I/O traffic. We must therefore -* retry. We could distinguish between blocking ops -* and retries if we made tty_ldisc_wait() smarter. That -* is up for discussion. -*/ - if(wait_event_interruptible(tty_ldisc_wait, tty->ldisc.refcount == 0) < 0) - return -ERESTARTSYS; + ld = tty_ldisc_ref_wait(tty); + tty_ldisc_deref(ld);
Re: Merging relayfs?
On Mon, 11 Jul 2005, Tom Zanussi wrote: > > Hi Andrew, can you please merge relayfs? It provides a low-overhead > logging and buffering capability, which does not currently exist in > the kernel. > One concern I had regarding relayfs, which was raised previously, was regarding its use of vmap, http://marc.theaimsgroup.com/?l=linux-kernel&m=110755199913216&w=2 On x86, the vmap space is at a premium, and this space is reserved over the entire lifetime of a 'channel'. Is the use of vmap really critical for performance? thanks, -Jason - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tty races
There are a couple of tty race conditions, which lead to inconsistent tty reference counting and tty layer oopses. The first is a tty_open vs. tty_close race in drivers/char/tty.io.c. Basically, from the time that the tty->count is deemed to be 1 and that we are going to free it to the time that TTY_CLOSING bit is set, needs to be atomic with respect to the manipulation of tty->count in init_dev(). This atomicity was previously guarded by the BKL. However, this is no longer true with the addition of a down() call in the middle of the release_dev()'s atomic path. So either the down() needs to be moved outside the atomic patch or dropped. I would vote for simply dropping it as i don't see why it is necessary. The second race is tty_open vs. tty_open. This race I've seen when the virtual console is the tty driver. In con_open(), vc_allocate() is called if the tty->count is 1. However, this check of the tty->count is not guarded by the 'tty_sem'. Thus, it is possible for con_open(), to never see the tty->count as 1, and thus never call vc_allocate(). This leads to a NULL filp->private_data, and an oops. The test case below reproduces these problems, and the patch fixes it. The test case uses /dev/tty9, which is generally restricted to root for open(). It may be able to exploit these races using pseudo terminals, although i wasn't able to. A previous report of this issue, with an oops trace was: http://www.ussg.iu.edu/hypermail/linux/kernel/0503.2/0017.html thanks, -Jason --- linux/drivers/char/tty_io.c.bak +++ linux/drivers/char/tty_io.c @@ -1596,14 +1596,9 @@ static void release_dev(struct file * fi * each iteration we avoid any problems. */ while (1) { - /* Guard against races with tty->count changes elsewhere and - opens on /dev/tty */ - - down(&tty_sem); tty_closing = tty->count <= 1; o_tty_closing = o_tty && (o_tty->count <= (pty_master ? 1 : 0)); - up(&tty_sem); do_sleep = 0; if (tty_closing) { @@ -1640,7 +1635,6 @@ static void release_dev(struct file * fi * block, so it's safe to proceed with closing. */ - down(&tty_sem); if (pty_master) { if (--o_tty->count < 0) { printk(KERN_WARNING "release_dev: bad pty slave count " @@ -1654,7 +1648,6 @@ static void release_dev(struct file * fi tty->count, tty_name(tty, buf)); tty->count = 0; } - up(&tty_sem); /* * We've decremented tty->count, so we need to remove this file @@ -1844,9 +1837,10 @@ retry_open: } got_driver: retval = init_dev(driver, index, &tty); - up(&tty_sem); - if (retval) + if (retval) { + up(&tty_sem); return retval; + } filp->private_data = tty; file_move(filp, &tty->tty_files); @@ -1863,6 +1857,7 @@ got_driver: else retval = -ENODEV; } + up(&tty_sem); filp->f_flags = saved_flags; if (!retval && test_bit(TTY_EXCLUSIVE, &tty->flags) && !capable(CAP_SYS_ADMIN)) #include #include #include #include #include #include #include #include #include #include #include #define NTHREADS 300 void *thread_function(); int open_fail_num; int open_success; int main(int argc, char *argv[]) { int i, j; pthread_t thread_id[NTHREADS]; for(;;) { for(i=0; i < NTHREADS; i++) { pthread_create(&thread_id[i], NULL, &thread_function, NULL); } for(j=0; j < NTHREADS; j++) { pthread_join(thread_id[j], NULL); } printf("open failures: %i\n", open_fail_num); printf("open success: %i\n", open_success); } } void *thread_function() { int fd; time_t t; int val; int ret; fd = open("/dev/tty9", O_RDWR); val = 0; //call an ioctl ret = ioctl(fd, KDGETMODE, &val); if (ret != 0) { perror("ioctl error\n"); } if (fd < 0) { open_fail_num++; } else { open_success++; } /* just waste some random time */ t = (time((time_t *)0) &31L) << 6; while (t-- > 0) (void)time((time_t *)0); close(fd); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PATCH: fix disassociate_ctty vs. fork race
hi, Race is as follows. Process A forks process B, both being part of the same session. Then, A calls disassociate_ctty while B forks C: A B fork() copy_signal() dissasociate_ctty() attach_pid(p, PIDTYPE_SID, p->signal->session); Now, C can have current->signal->tty pointing to a freed tty structure, as it hasn't yet been added to the session group (to have its controlling tty cleared on the diassociate_ctty() call). This has shown up as an oops but could be even more serious. I haven't tried to create a test case, but a customer has verified that the patch below resolves the issue, which was occuring quite frequently. I'll try and post the test case if i can. The patch simply checks for a NULL tty *after* it has been attached to the proper session group and clears it as necessary. Alternatively, we could simply do the tty assignment after the the process is added to the proper session group. thanks, -Jason Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- linux-2.6.git/kernel/fork.c.bak 2005-09-06 13:42:14.0 -0400 +++ linux-2.6.git/kernel/fork.c 2005-09-06 13:43:47.0 -0400 @@ -1115,6 +1115,9 @@ static task_t *copy_process(unsigned lon __get_cpu_var(process_counts)++; } + if (!current->signal->tty && p->signal->tty) + p->signal->tty = NULL; + nr_threads++; total_forks++; write_unlock_irq(&tasklist_lock); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Weird code in scripts/kconfig/Makefile
On Tue, 12 Dec 2006, Pete Zaitcev wrote: > Hi, Roman & All: > > In 2.6.19 (and Linus' curent tree), I found the following: > > libpath=$$dir/lib; lib=qt; osdir=""; \ > $(HOSTCXX) -print-multi-os-directory > /dev/null 2>&1 && \ > osdir=x$$($(HOSTCXX) -print-multi-os-directory); \ > test -d $$libpath/$$osdir && libpath=$$libpath/$$osdir; \ > > What does the little 'x' do in front of $$(foo)? It looks suspiciously > like a typo to me. > > I think Jason caught it, but I didn't see a correction sent out. > yes. looks like an error to me too...i left it out of an internal patch to this code. Since we are testing for the existence of the directory, it has probably gone unnoticed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] madvise_need_mmap_write() usage
On Sat, 16 Jun 2007, Christoph Hellwig wrote: > On Fri, Jun 15, 2007 at 11:20:31AM -0400, Jason Baron wrote: > > hi, > > > > i was just looking at the new madvise_need_mmap_write() call...can we > > avoid an extra case statement and function call as follows? > > Sounds like a good idea, but please move the assignment out of the > conditional. > ok, here's an updated version: Signed-off-by: Jason Baron <[EMAIL PROTECTED]> diff --git a/mm/madvise.c b/mm/madvise.c --- a/mm/madvise.c +++ b/mm/madvise.c @@ -287,9 +287,11 @@ asmlinkage long sys_madvise(unsigned long start, size_t len_in, int behavior) struct vm_area_struct * vma, *prev; int unmapped_error = 0; int error = -EINVAL; + int write; size_t len; - if (madvise_need_mmap_write(behavior)) + write = madvise_need_mmap_write(behavior); + if (write) down_write(¤t->mm->mmap_sem); else down_read(¤t->mm->mmap_sem); @@ -354,7 +356,7 @@ asmlinkage long sys_madvise(unsigned long start, size_t len_in, int behavior) vma = find_vma(current->mm, start); } out: - if (madvise_need_mmap_write(behavior)) + if (write) up_write(¤t->mm->mmap_sem); else up_read(¤t->mm->mmap_sem); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] madvise_need_mmap_write() usage
hi, i was just looking at the new madvise_need_mmap_write() call...can we avoid an extra case statement and function call as follows? thanks, -Jason Signed-off-by: Jason Baron <[EMAIL PROTECTED]> diff --git a/mm/madvise.c b/mm/madvise.c --- a/mm/madvise.c +++ b/mm/madvise.c @@ -287,9 +287,10 @@ asmlinkage long sys_madvise(unsigned long start, size_t len_in, int behavior) struct vm_area_struct * vma, *prev; int unmapped_error = 0; int error = -EINVAL; + int write; size_t len; - if (madvise_need_mmap_write(behavior)) + if (write = madvise_need_mmap_write(behavior)) down_write(¤t->mm->mmap_sem); else down_read(¤t->mm->mmap_sem); @@ -354,7 +355,7 @@ asmlinkage long sys_madvise(unsigned long start, size_t len_in, int behavior) vma = find_vma(current->mm, start); } out: - if (madvise_need_mmap_write(behavior)) + if (write) up_write(¤t->mm->mmap_sem); else up_read(¤t->mm->mmap_sem); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Sun, 20 May 2007, Peter Zijlstra wrote: > > Add lock contention tracking to lockdep > looks really nice to me. To me, in addition to the number of times the lock is contended we also need the total number of times the lock was acquired, to make the statistics significant. I've included one such patch below, which also introduces a 'event', thiking of things like lock timings that were mentioned, but this is not necessary. thanks, -Jason Signed-off-by: Jason Baron <[EMAIL PROTECTED]> --- linux-2.6.22-rc2-lockdep/include/linux/lockdep.h.bak2007-05-21 15:55:30.0 -0400 +++ linux-2.6.22-rc2-lockdep/include/linux/lockdep.h2007-05-21 16:30:13.0 -0400 @@ -36,6 +36,17 @@ }; /* + * lock contention types + */ + +enum lock_contention_event +{ + LOCK_CONTENTION_WRITE = 0, + LOCK_CONTENTION_READ, + LOCK_CONTENTION_TOTAL +}; + +/* * Usage-state bitmasks: */ #define LOCKF_USED (1 << LOCK_USED) @@ -121,6 +132,7 @@ const char *name; int name_version; + atomic_ttotal; atomic_tread_contentions; atomic_twrite_contentions; struct lockdep_contention_point contention_point[4]; @@ -253,7 +265,7 @@ extern void lock_release(struct lockdep_map *lock, int nested, unsigned long ip); -extern void lock_contended(struct lockdep_map *lock, int read, +extern void lock_contended(struct lockdep_map *lock, enum lock_contention_event, unsigned long ip); # define INIT_LOCKDEP .lockdep_recursion = 0, --- linux-2.6.22-rc2-lockdep/kernel/lockdep_proc.c.bak 2007-05-21 16:10:12.0 -0400 +++ linux-2.6.22-rc2-lockdep/kernel/lockdep_proc.c 2007-05-21 16:11:38.0 -0400 @@ -348,16 +348,17 @@ struct lock_class *class; list_for_each_entry(class, &all_lock_classes, lock_entry) { - int r, w; + int r, w, t; r = atomic_read(&class->read_contentions); w = atomic_read(&class->write_contentions); + t = atomic_read(&class->total); if (r || w) { int i; char sym[KSYM_SYMBOL_LEN]; - seq_printf(m, "%s: %d %d", class->name, w, r); + seq_printf(m, "%s: %d %d %d", class->name, w, r, t); for (i = 0; i < ARRAY_SIZE(class->contention_point); i++) { struct lockdep_contention_point *cp = --- linux-2.6.22-rc2-lockdep/kernel/lockdep.c.bak 2007-05-21 16:02:05.0 -0400 +++ linux-2.6.22-rc2-lockdep/kernel/lockdep.c 2007-05-21 16:29:35.0 -0400 @@ -2429,7 +2429,7 @@ } static void -__lock_contended(struct lockdep_map *lock, int read, unsigned long ip) +__lock_contended(struct lockdep_map *lock, enum lock_contention_event event, unsigned long ip) { struct task_struct *curr = current; struct held_lock *hlock, *prev_hlock; @@ -2463,10 +2463,17 @@ return; found_it: - if (read) + switch (event) { + case LOCK_CONTENTION_READ: atomic_inc(&class->read_contentions); - else + break; + case LOCK_CONTENTION_WRITE: atomic_inc(&class->write_contentions); + break; + case LOCK_CONTENTION_TOTAL: + atomic_inc(&class->total); + return; + } for (i = 0; i < ARRAY_SIZE(class->contention_point); i++) { if (class->contention_point[i].ip == 0) { @@ -2530,6 +2537,7 @@ current->lockdep_recursion = 1; __lock_acquire(lock, subclass, trylock, read, check, irqs_disabled_flags(flags), ip); + __lock_contended(lock, LOCK_CONTENTION_TOTAL, 0); current->lockdep_recursion = 0; raw_local_irq_restore(flags); } @@ -2553,7 +2561,7 @@ EXPORT_SYMBOL_GPL(lock_release); -void lock_contended(struct lockdep_map *lock, int read, unsigned long ip) +void lock_contended(struct lockdep_map *lock, enum lock_contention_event event, unsigned long ip) { unsigned long flags; @@ -2563,7 +2571,7 @@ raw_local_irq_save(flags); check_flags(flags); current->lockdep_recursion = 1; - __lock_contended(lock, read, ip); + __lock_contended(lock, event, ip); current->lockdep_recursion = 0; raw_local_irq_restore(flags); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] lockdep: lock contention tracking
On Wed, 23 May 2007, Peter Zijlstra wrote: > Count lock contention events per lock class. Additionally track the first four > callsites that resulted in the contention. > I think that we need the total number of locking calls, not just the total number of *contended* locking calls, in order to make the data significant. Same for waittime. Yes, this pollutes the fastpath. In the contention case, its one extra addition, and for the waittime, its a call the sched_clock(). I don't think that's too much to pay for much more meaningful data. thanks, -Jason - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] lockdep: lock contention tracking
On Wed, 23 May 2007, Peter Zijlstra wrote: > On Wed, 2007-05-23 at 10:40 -0400, Jason Baron wrote: > > On Wed, 23 May 2007, Peter Zijlstra wrote: > > > > > Count lock contention events per lock class. Additionally track the first > > > four > > > callsites that resulted in the contention. > > > > > > > I think that we need the total number of locking calls, not just the total > > number of *contended* locking calls, in order to make the data > > significant. Same for waittime. Yes, this pollutes the fastpath. In the > > contention case, its one extra addition, and for the waittime, its a call > > the sched_clock(). I don't think that's too much to pay for much more > > meaningful data. > > The holdtime statistics add these numbers. > ok, i see what you are saying...however, the 'waittime' statistic as implemented, is only recorded when we don't get the lock immediately. Thus, it's really measuring the waittime when there is lock contention. I understand that in the non-contended case we are only talking a a few cycles, but the fact that the non-contended case is not counted as another waittime of even zero length (so no measurement is required), might skew the stats a bit. For examples, if there was a lock that was almost never contended but one time happened to be contended for a long time, its average wait time would look really high. thanks, -Jason - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] lockdep: lock contention tracking
On Wed, 23 May 2007, Peter Zijlstra wrote: > On Wed, 2007-05-23 at 12:11 -0400, Jason Baron wrote: > > On Wed, 23 May 2007, Peter Zijlstra wrote: > > > > > On Wed, 2007-05-23 at 10:40 -0400, Jason Baron wrote: > > > > On Wed, 23 May 2007, Peter Zijlstra wrote: > > > > > > > > > Count lock contention events per lock class. Additionally track the > > > > > first four > > > > > callsites that resulted in the contention. > > > > > > > > > > > > > I think that we need the total number of locking calls, not just the > > > > total > > > > number of *contended* locking calls, in order to make the data > > > > significant. Same for waittime. Yes, this pollutes the fastpath. In the > > > > contention case, its one extra addition, and for the waittime, its a > > > > call > > > > the sched_clock(). I don't think that's too much to pay for much more > > > > meaningful data. > > > > > > The holdtime statistics add these numbers. > > > > > > > ok, i see what you are saying...however, the 'waittime' statistic as > > implemented, is only recorded when we don't get the lock immediately. > > Thus, it's really measuring the waittime when there is lock contention. I > > understand that in the non-contended case we are only talking a a few > > cycles, but the fact that the non-contended case is not counted as another > > waittime of even zero length (so no measurement is required), might skew > > the stats a bit. For examples, if there was a lock that was almost never > > contended but one time happened to be contended for a long time, its > > average wait time would look really high. > > I'm not seeing how or why this is a problem. The number of contentions > is reported on the same line, so it should be obvious. > > Your definition of wait-time is also obtainable from the numbers, one > would just divide waittime-total by acquisitions, instead of > contentions. > agreed, I just want to point out that under my definition of waitime, I would have to make the assumption that the waittime is 0 for a lock that is acquired without fallback to the slowpath. For a spinlock acquisition, this is likely a fair assumption, however the trylock path for semaphores can be longer. Acked-by: Jason Baron <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: strange Mac OSX RST behavior
Hi, After looking at this further we found that there is actually a rate limit on 'rst' packets sent by OSX on a closed socket. Its set to 250 per second and controlled via: net.inet.icmp.icmplim. Increasing that limit resolves the issue, but the default is apparently 250. Thanks, -Jason On 07/01/2016 02:16 PM, One Thousand Gnomes wrote: yes, we do in fact see a POLLRDHUP from the FIN in this case and read of zero, but we still have more data to write to the socket, and b/c the RST is dropped here, the socket stays in TIME_WAIT until things eventually time out... After the FIN when you send/retransmit your next segment do you then get a valid RST back from the Mac end? Alan
Re: [PATCH] fs/select: add vmalloc fallback for select(2)
Hi, On 09/23/2016 03:24 AM, Nicholas Piggin wrote: On Fri, 23 Sep 2016 14:42:53 +0800 "Hillf Danton" wrote: The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous. Also the allocation is temporary for the duration of the syscall, so it's unlikely to stress vmalloc too much. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. How about something like this? (untested) Eric isn't wrong about vmalloc sucking :) Thanks, Nick --- fs/select.c | 57 +++-- 1 file changed, 43 insertions(+), 14 deletions(-) diff --git a/fs/select.c b/fs/select.c index 8ed9da5..3b4834c 100644 --- a/fs/select.c +++ b/fs/select.c @@ -555,6 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, void *bits; int ret, max_fds; unsigned int size; + size_t nr_bytes; struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; @@ -576,21 +577,39 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, * since we used fdset we need to allocate memory in units of * long-words. */ - size = FDS_BYTES(n); + ret = -ENOMEM; bits = stack_fds; - if (size > sizeof(stack_fds) / 6) { - /* Not enough space in on-stack array; must use kmalloc */ + size = FDS_BYTES(n); + nr_bytes = 6 * size; + + if (unlikely(nr_bytes > PAGE_SIZE)) { + /* Avoid multi-page allocation if possible */ ret = -ENOMEM; - bits = kmalloc(6 * size, GFP_KERNEL); - if (!bits) - goto out_nofds; + fds.in = kmalloc(size, GFP_KERNEL); + fds.out = kmalloc(size, GFP_KERNEL); + fds.ex = kmalloc(size, GFP_KERNEL); + fds.res_in = kmalloc(size, GFP_KERNEL); + fds.res_out = kmalloc(size, GFP_KERNEL); + fds.res_ex = kmalloc(size, GFP_KERNEL); + + if (!(fds.in && fds.out && fds.ex && + fds.res_in && fds.res_out && fds.res_ex)) + goto out; + } else { + if (nr_bytes > sizeof(stack_fds)) { + /* Not enough space in on-stack array */ + if (nr_bytes > PAGE_SIZE * 2) The 'if' looks extraneous? Also, I wonder if we can just avoid some allocations altogether by checking by if the user fd_set pointers are NULL? That can avoid failures :) Thanks, -Jason
Re: [PATCH v2 2/2] arm64: Use static keys for CPU features
On 09/05/2016 01:25 PM, Catalin Marinas wrote: This patch adds static keys transparently for all the cpu_hwcaps features by implementing an array of default-false static keys and enabling them when detected. The cpus_have_cap() check uses the static keys if the feature being checked is a constant, otherwise the compiler generates the bitmap test. Because of the early call to static_branch_enable() via check_local_cpu_errata() -> update_cpu_capabilities(), the jump labels are initialised in cpuinfo_store_boot_cpu(). Was there a reason the jump_label_init() couldn't be moved earlier in the common code? Thanks, -Jason Cc: Will Deacon Cc: Suzuki K. Poulose Signed-off-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 14 +++--- arch/arm64/kernel/cpufeature.c | 3 +++ arch/arm64/kernel/smp.c | 5 + 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 7099f26e3702..c9dfb1e4c435 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -9,6 +9,8 @@ #ifndef __ASM_CPUFEATURE_H #define __ASM_CPUFEATURE_H +#include + #include #include @@ -109,6 +111,7 @@ struct arm64_cpu_capabilities { }; extern DECLARE_BITMAP(cpu_hwcaps, ARM64_NCAPS); +extern struct static_key_false cpu_hwcap_keys[ARM64_NCAPS]; bool this_cpu_has_cap(unsigned int cap); @@ -121,16 +124,21 @@ static inline bool cpus_have_cap(unsigned int num) { if (num >= ARM64_NCAPS) return false; - return test_bit(num, cpu_hwcaps); + if (__builtin_constant_p(num)) + return static_branch_unlikely(&cpu_hwcap_keys[num]); + else + return test_bit(num, cpu_hwcaps); } static inline void cpus_set_cap(unsigned int num) { - if (num >= ARM64_NCAPS) + if (num >= ARM64_NCAPS) { pr_warn("Attempt to set an illegal CPU capability (%d >= %d)\n", num, ARM64_NCAPS); - else + } else { __set_bit(num, cpu_hwcaps); + static_branch_enable(&cpu_hwcap_keys[num]); + } } static inline int __attribute_const__ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 62272eac1352..919b2d0d68ae 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -46,6 +46,9 @@ unsigned int compat_elf_hwcap2 __read_mostly; DECLARE_BITMAP(cpu_hwcaps, ARM64_NCAPS); +DEFINE_STATIC_KEY_ARRAY_FALSE(cpu_hwcap_keys, ARM64_NCAPS); +EXPORT_SYMBOL(cpu_hwcap_keys); + #define __ARM64_FTR_BITS(SIGNED, STRICT, TYPE, SHIFT, WIDTH, SAFE_VAL) \ { \ .sign = SIGNED, \ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index d93d43352504..c3c08368a685 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -437,6 +437,11 @@ void __init smp_cpus_done(unsigned int max_cpus) void __init smp_prepare_boot_cpu(void) { set_my_cpu_offset(per_cpu_offset(smp_processor_id())); + /* +* Initialise the static keys early as they may be enabled by the +* cpufeature code. +*/ + jump_label_init(); cpuinfo_store_boot_cpu(); save_boot_cpu_run_el(); }
Re: [PATCH v4 2/3] livepatch: shuffle core.c function order
On 10/16/2017 10:42 AM, Petr Mladek wrote: > On Thu 2017-10-12 17:12:28, Jason Baron wrote: >> In preparation for __klp_enable_patch() to call a number of 'static' >> functions, in a subsequent patch, move them earlier in core.c. This patch >> should be a nop from a functional pov. >> >> Signed-off-by: Jason Baron >> Cc: Josh Poimboeuf >> Cc: Jessica Yu >> Cc: Jiri Kosina >> Cc: Miroslav Benes >> Cc: Petr Mladek >> --- >> kernel/livepatch/core.c | 349 >> >> 1 file changed, 173 insertions(+), 176 deletions(-) >> >> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c >> index b7f77be..f53eed5 100644 >> --- a/kernel/livepatch/core.c >> +++ b/kernel/livepatch/core.c >> @@ -283,6 +283,179 @@ static int klp_write_object_relocations(struct module >> *pmod, >> +static int klp_init_func(struct klp_object *obj, struct klp_func *func) >> +{ >> +if (!func->old_name || !func->new_func) >> +return -EINVAL; >> + >> +INIT_LIST_HEAD(&func->stack_node); >> +func->patched = false; >> +func->transition = false; > > You lost the change from the 1st patch: > > list_add(&func->func_entry, &obj->func_list); > >> + >> +/* The format for the sysfs directory is where sympos >> + * is the nth occurrence of this symbol in kallsyms for the patched >> + * object. If the user selects 0 for old_sympos, then 1 will be used >> + * since a unique symbol will be the first occurrence. >> + */ >> +return kobject_init_and_add(&func->kobj, &klp_ktype_func, >> +&obj->kobj, "%s,%lu", func->old_name, >> +func->old_sympos ? func->old_sympos : 1); >> +} > > [...] > >> +static int klp_init_object(struct klp_patch *patch, struct klp_object *obj) >> +{ >> +struct klp_func *func; >> +int ret; >> +const char *name; >> + >> +if (!obj->funcs) >> +return -EINVAL; >> + >> +obj->patched = false; >> +obj->mod = NULL; >> + >> +klp_find_object_module(obj); >> + >> +name = klp_is_module(obj) ? obj->name : "vmlinux"; >> +ret = kobject_init_and_add(&obj->kobj, &klp_ktype_object, >> + &patch->kobj, "%s", name); >> +if (ret) >> +return ret; >> + > > Same here: > > list_add(&obj->obj_entry, &patch->obj_list); > INIT_LIST_HEAD(&obj->func_list); > klp_for_each_func_static(obj, func) { Yes, thanks. I noticed that too when insertion didn't match deletion :( The final patch ends up adding it back, instead of this one, so the states of things after patch 3 is ok. Thanks, -Jason
Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
Hi, I posted a patch to completely remove the lock here from the ep_poll_safewake() patch here: http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html So these are going to conflict. The reason its safe to remove the lock is that there are loop and depth checks now that are performed during EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these checks once add add time as opposed to at each wakeup (even if they can be scaled better). I also have a pending patch to do something similar for poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most egregious one here? Thanks, -Jason On 10/13/2017 11:45 AM, Davidlohr Bueso wrote: > A customer reported massive contention on the ncalls->lock in which > the workload is designed around nested epolls (where the fd is already > an epoll). > > 83.49% [kernel] [k] __pv_queued_spin_lock_slowpath > 2.70% [kernel] [k] ep_call_nested.constprop.13 > 1.94% [kernel] [k] _raw_spin_lock_irqsave > 1.83% [kernel] [k] > __raw_callee_save___pv_queued_spin_unlock > 1.45% [kernel] [k] _raw_spin_unlock_irqrestore > 0.41% [kernel] [k] ep_scan_ready_list.isra.8 > 0.36% [kernel] [k] pvclock_clocksource_read > > The application running on kvm, is using a shared memory IPC communication > with a pipe wakeup mechanism, and a two level dispatcher both built around > 'epoll_wait'. There is one process per physical core and a full mesh of > pipes > between them, so on a 18 core system (the actual case), there are 18*18 > pipes > for the IPCs alone. > > This patch proposes replacing the nested calls global linked list with a > dlock > interface, for which we can benefit from pcpu lists when doing > ep_poll_safewake(), > and hashing for the current task, which provides two benefits: > > 1. Speeds up the process of loop and max-depth checks from O(N) lookups > to O(1) > (albeit possible collisions, which we have to iterate); and, > > 2. Provides smaller lock granularity. > > cpusbeforeafter diff > 114093701344804 -4.58% > 210156901337053 31.63% > 3 7210091273254 76.59% > 4 3809591128931196.33% > 5 2876031028362257.56% > 6 221104 894943304.76% > 7 173592 976395462.46% > 8 145860 922584532.51% > 9 127877 925900624.05% > 10 112603 791456602.87% > 11 97926 724296639.63% > 12 80732 730485804.82% > > With the exception of a single cpu, where the overhead of jhashing > influences), we > get some pretty good raw throughput increase. Similarly, the amount of > time spent > decreases immensely as well. > > Passes ltp related testcases. > > Signed-off-by: Davidlohr Bueso > --- > fs/eventpoll.c | 88 > +++--- > 1 file changed, 53 insertions(+), 35 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index 2fabd19cdeea..675c97fdc5da 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -22,7 +22,7 @@ > #include > #include > #include > -#include > +#include > #include > #include > #include > @@ -119,7 +119,7 @@ struct epoll_filefd { > * and loop cycles. > */ > struct nested_call_node { > -struct list_head llink; > +struct dlock_list_node llink; > void *cookie; > void *ctx; > }; > @@ -129,8 +129,7 @@ struct nested_call_node { > * maximum recursion dept and loop cycles. > */ > struct nested_calls { > -struct list_head tasks_call_list; > -spinlock_t lock; > +struct dlock_list_heads tasks_call_list; > }; > > /* > @@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op) > } > > /* Initialize the poll safe wake up structure */ > -static void ep_nested_calls_init(struct nested_calls *ncalls) > +static inline int ep_nested_calls_init(struct nested_calls *ncalls) > +{ > +return alloc_dlock_list_heads(&ncalls->tasks_call_list, true); > +} > + > +static inline void ep_nested_calls_free(struct nested_calls *ncalls) > { > -INIT_LIST_HEAD(&ncalls->tasks_call_list); > -spin_lock_init(&ncalls->lock); > +free_dlock_list_heads(&ncalls->tasks_call_list); > } > > /** > @@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls > *ncalls, int max_nests, > { > int error, call_nests = 0; > unsigned long flags; > -struct list_head *lsthead = &ncalls->tasks_call_list; > -struct nested_call_node *tncur; > -struct nested_call_node tnode; > +struct dlock_list_head *head; > +struct nested_call_node *tncur, tnode; > > -spin_lock_irqsave(&ncalls->lock, flags); > +head = dlock_list_hash(&ncalls->tasks_call_list, ctx); > > +spin_lock_irqsave(&head->lock, flags); > /* > * Try to see if the current task is already inside this wakeup call. > - * We use a list here, since the popula
[PATCH] ie31200_edac: add Intel Kaby Lake CPU support
Kaby Lake seems to work just like Skylake. Signed-off-by: Jason Baron Reported-and-tested-by: Doug Thompson Cc: Borislav Petkov Cc: Mauro Carvalho Chehab Cc: Tony Luck --- drivers/edac/ie31200_edac.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/edac/ie31200_edac.c b/drivers/edac/ie31200_edac.c index 2733fb5..596be0a 100644 --- a/drivers/edac/ie31200_edac.c +++ b/drivers/edac/ie31200_edac.c @@ -18,10 +18,12 @@ * 0c04: Xeon E3-1200 v3/4th Gen Core Processor DRAM Controller * 0c08: Xeon E3-1200 v3 Processor DRAM Controller * 1918: Xeon E3-1200 v5 Skylake Host Bridge/DRAM Registers + * 5918: Xeon E3-1200 Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers * * Based on Intel specification: * http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e3-1200v3-vol-2-datasheet.pdf * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200-family-vol-2-datasheet.html + * http://www.intel.com/content/www/us/en/processors/core/7th-gen-core-family-mobile-h-processor-lines-datasheet-vol-2.html * * According to the above datasheet (p.16): * " @@ -57,6 +59,7 @@ #define PCI_DEVICE_ID_INTEL_IE31200_HB_6 0x0c04 #define PCI_DEVICE_ID_INTEL_IE31200_HB_7 0x0c08 #define PCI_DEVICE_ID_INTEL_IE31200_HB_8 0x1918 +#define PCI_DEVICE_ID_INTEL_IE31200_HB_9 0x5918 #define IE31200_DIMMS 4 #define IE31200_RANKS 8 @@ -376,7 +379,11 @@ static int ie31200_probe1(struct pci_dev *pdev, int dev_idx) void __iomem *window; struct ie31200_priv *priv; u32 addr_decode, mad_offset; - bool skl = (pdev->device == PCI_DEVICE_ID_INTEL_IE31200_HB_8); + /* +* Kaby Lake seems to work like Skylake. Please re-visit this logic +* when adding new cpu support. +*/ + bool skl = (pdev->device >= PCI_DEVICE_ID_INTEL_IE31200_HB_8); edac_dbg(0, "MC:\n"); @@ -560,6 +567,9 @@ static const struct pci_device_id ie31200_pci_tbl[] = { PCI_VEND_DEV(INTEL, IE31200_HB_8), PCI_ANY_ID, PCI_ANY_ID, 0, 0, IE31200}, { + PCI_VEND_DEV(INTEL, IE31200_HB_9), PCI_ANY_ID, PCI_ANY_ID, 0, 0, + IE31200}, + { 0, }/* 0 terminated list. */ }; -- 2.6.1
Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()
On 10/28/2017 10:05 AM, Davidlohr Bueso wrote: > On Wed, 18 Oct 2017, Jason Baron wrote: > >> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html >> >> We can work through these ideas or others... > > So, unsurprisingly, I got some _really_ good results on the epoll_wait() > benchmark by removing the readywalk list altogether -- something like > 4000% in some cases. It's also survived more testing, although I'm still > waiting for customer confirmation on his real workload, which helps > further test the changes (along with this patch to move the safewake > list to debug). > > Could you please resend patch 1 in the series above? > Hi, I've resent patch 1 with some cleanups now. Thanks for the testing feedback. Thanks, -Jason
Re: [PATCH] epoll: remove ep_call_nested() from ep_eventpoll_poll()
On 11/01/2017 04:53 PM, Andrew Morton wrote: > On Tue, 31 Oct 2017 07:58:21 -0700 Davidlohr Bueso wrote: > >> On Tue, 31 Oct 2017, Jason Baron wrote: >> >>> The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll >>> routine for an epoll fd, is used to prevent excessively deep epoll >>> nesting, and to prevent circular paths. However, we are already preventing >>> these conditions during EPOLL_CTL_ADD. In terms of too deep epoll chains, >>> we do in fact allow deep nesting of the epoll fds themselves (deeper >>> than EP_MAX_NESTS), however we don't allow more than EP_MAX_NESTS when >>> an epoll file descriptor is actually connected to a wakeup source. Thus, >>> we do not require the use of ep_call_nested(), since ep_eventpoll_poll(), >>> which is called via ep_scan_ready_list() only continues nesting if there >>> are events available. Since ep_call_nested() is implemented using a global >>> lock, applications that make use of nested epoll can see large performance >>> improvements with this change. >> >> Improvements are quite obscene actually, such as for the following >> epoll_wait() >> benchmark with 2 level nesting on a 80 core IvyBridge: >> >> ncpus vanilla dirty delta >> 1 2447092 3028315 +23.75% >> 4 231265 2986954 +1191.57% >> 8 121631 2898796 +2283.27% >> 16 59749 2902056 +4757.07% >> 32 268372326314 +8568.30% >> 64 12926 1341281 +10276.61% >> >> (http://linux-scalability.org/epoll/epoll-test.c) > > This is tempting, but boy it is late in the -rc cycle. > > How important are these workloads? Would the world end if we held off > on this for 4.15? > Hi Andrew, As Davidlohr pointed out, the nested epoll case is less common and these locks have been here for a long time. I also think the patch needs to be in linux-next for a bit and validated more, before hitting mainline. Thanks, -Jason
[PATCH] jump_label: invoke jump_label_test() via early_initcall()
Fengguang Wu reported that running the rcuperf test during boot can cause the jump_label_test() to hit a WARN_ON(). The issue is that the core jump label code relies on kernel_text_address() to detect when it can no longer update branches that may be contained in __init sections. The kernel_text_address() in turn assumes that if the system_state variable is greter than or equal to SYSTEM_RUNNING then __init sections are no longer valid (since the assumption is that they have been freed). However, when rcuperf is setup to run in early boot it can call kernel_power_off() which sets the system_state to SYSTEM_POWER_OFF. Since rcuperf initialization is invoked via a module_init(), we can make the dependency of jump_label_test() needing to complete before rcuperf explicit by calling it via early_initcall(). Reported-by: Fengguang Wu Cc: "Paul E. McKenney" Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Ingo Molnar Signed-off-by: Jason Baron --- kernel/jump_label.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 0bf2e8f5..7c3774a 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -769,7 +769,7 @@ static __init int jump_label_test(void) return 0; } -late_initcall(jump_label_test); +early_initcall(jump_label_test); #endif /* STATIC_KEYS_SELFTEST */ #endif /* HAVE_JUMP_LABEL */ -- 2.6.1
Re: [PATCH v3 2/2] livepatch: add atomic replace
On 10/17/2017 09:50 AM, Miroslav Benes wrote: > On Tue, 17 Oct 2017, Miroslav Benes wrote: > >> On Tue, 10 Oct 2017, Jason Baron wrote: >> >>> >>> >>> On 10/06/2017 06:32 PM, Josh Poimboeuf wrote: >>>> On Wed, Sep 27, 2017 at 11:41:30PM -0400, Jason Baron wrote: >>>>> Since 'atomic replace' has completely replaced all previous livepatch >>>>> modules, it explicitly disables all previous livepatch modules. However, >>>>> previous livepatch modules that have been replaced, can be re-enabled >>>>> if they have the 'replace' flag set. In this case the set of 'nop' >>>>> functions is re-calculated, such that it replaces all others. >>>>> >>>>> For example, if kpatch-a.ko and kpatch-b.ko have the 'replace' flag set >>>>> then: >>>>> >>>>> # insmod kpatch-a.ko >>>>> # insmod kpatch-b.ko >>>>> >>>>> At this point we have: >>>>> >>>>> # cat /sys/kernel/livepatch/kpatch-a/enabled >>>>> 0 >>>>> >>>>> # cat /sys/kernel/livepatch/kpatch-b/enabled >>>>> 1 >>>>> >>>>> To revert to the kpatch-a state we can do: >>>>> >>>>> echo 1 > /sys/kernel/livepatch/kpatch-a/enabled >>>>> >>>>> And now we have: >>>>> >>>>> # cat /sys/kernel/livepatch/kpatch-a/enabled >>>>> 1 >>>>> >>>>> # cat /sys/kernel/livepatch/kpatch-b/enabled >>>>> 0 >>>> >>>> I don't really like allowing a previously replaced patch to replace the >>>> current patch. It's just more unnecessary complexity. If the user >>>> wants to atomically revert back to kpatch-a, they should be able to: >>>> >>>> rmmod kpatch-a >>>> insmod kpatch-a.ko >>>> >> >> I agree. >> >>> Right - that's how I sent v1 (using rmmod/insmod to revert), but it >>> didn't account for the fact the patch or some functions may be marked >>> 'immediate' and thus its not possible to just do 'rmmod'. Thus, since in >>> some cases 'rmmod' was not feasible, I thought it would be simpler from >>> an operational pov to just say we always revert by re-enabling a >>> previously replaced patch as opposed to rmmod/insmod. >>> >>> >>>>> Note that it may be possible to unload (rmmod) replaced patches in some >>>>> cases based on the consistency model, when we know that all the functions >>>>> that are contained in the patch may no longer be in used, however its >>>>> left as future work, if this functionality is desired. >>>> >>>> If you don't allow a previously replaced patch to be enabled again, I >>>> think it would be trivial to let it be unloaded. >>>> >>> >>> The concern is around being replaced by 'immediate' functions and thus >>> not knowing if the code is still in use. >> >> Hm. Would it make sense to remove immediate and rely only on the >> consistency model? At least for the architectures where the model is >> implemented (x86_64)? >> >> If not, then I'd keep such modules there without a possibility to remove >> them ever. If its functionality was required again, it would of course >> mean to insmod a new module with it. > > Petr pointed out (offline) that immediate could still be useful. So let me > describe how I envision the whole atomic replace semantics. > > Let's have three functions - a, b, c. Patch 1 is immediate and patches > a(). > > func a b c > patches 1i > > Patch 2 is not immediate and patches b() > > func a b c > patches 1i > 2 > > Patch 3 is atomic replace, which patches a() and c(). > > func a b c > patches 1i > 2 > 3r 3r > > With 3 applied, 3versions of a() and c() are called, original b() is > called. 2 can be rmmoded. 1 cannot, because it is immediate. 1 and 2 > cannot be reenabled. Yes, if we change back to allowing the 'atomic replace' patch to drop the module reference on previous modules when possible, then yes 2 can be rmmoded. I originally started off with that model of being able to do 'rev
Re: [PATCH v4 3/3] livepatch: add atomic replace
On 10/17/2017 10:18 AM, Petr Mladek wrote: > On Thu 2017-10-12 17:12:29, Jason Baron wrote: >> Sometimes we would like to revert a particular fix. This is currently >> This is not easy because we want to keep all other fixes active and we >> could revert only the last applied patch. >> >> One solution would be to apply new patch that implemented all >> the reverted functions like in the original code. It would work >> as expected but there will be unnecessary redirections. In addition, >> it would also require knowing which functions need to be reverted at >> build time. >> >> A better solution would be to say that a new patch replaces >> an older one. This might be complicated if we want to replace >> a particular patch. But it is rather easy when a new cummulative >> patch replaces all others. >> >> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c >> index f53eed5..d1c7a06 100644 >> --- a/kernel/livepatch/core.c >> +++ b/kernel/livepatch/core.c >> @@ -283,8 +301,21 @@ static int klp_write_object_relocations(struct module >> *pmod, >> return ret; >> } >> >> +atomic_t klp_nop_release_count; >> +static DECLARE_WAIT_QUEUE_HEAD(klp_nop_release_wait); >> + >> static void klp_kobj_release_object(struct kobject *kobj) >> { >> +struct klp_object *obj; >> + >> +obj = container_of(kobj, struct klp_object, kobj); >> +/* Free dynamically allocated object */ >> +if (!obj->funcs) { >> +kfree(obj->name); >> +kfree(obj); >> +atomic_dec(&klp_nop_release_count); >> +wake_up(&klp_nop_release_wait); > > I would slightly optimize this by > > if (atomic_dec_and_test((&klp_nop_release_count)) > wake_up(&klp_nop_release_wait); > >> +} >> } >> >> static struct kobj_type klp_ktype_object = { >> @@ -294,6 +325,16 @@ static struct kobj_type klp_ktype_object = { >> >> static void klp_kobj_release_func(struct kobject *kobj) >> { >> +struct klp_func *func; >> + >> +func = container_of(kobj, struct klp_func, kobj); >> +/* Free dynamically allocated functions */ >> +if (!func->new_func) { >> +kfree(func->old_name); >> +kfree(func); >> +atomic_dec(&klp_nop_release_count); >> +wake_up(&klp_nop_release_wait); > > Same here > > if (atomic_dec_and_test((&klp_nop_release_count)) > wake_up(&klp_nop_release_wait); > > >> +} >> } >> >> static struct kobj_type klp_ktype_func = { >> @@ -436,8 +480,14 @@ static int klp_init_object(struct klp_patch *patch, >> struct klp_object *obj) >> if (ret) >> return ret; >> >> -klp_for_each_func(obj, func) { >> -ret = klp_init_func(obj, func); >> +list_add(&obj->obj_entry, &patch->obj_list); >> +INIT_LIST_HEAD(&obj->func_list); >> + >> +if (nop) >> +return 0; > > Ah, this is something that I wanted to avoid. It makes the code > very hard to read and maintain. It forces us to duplicate > some code in klp_alloc_nop_func(). I think that I complained > about this in v2 already. > > I understand that you actually kept it because of me. > It is related to the possibility to re-enable released > patches :-( hmmm, I actually think that it is not necessary - although I didn't try removing it. I was concerned that we would call klp_init_func() for functions that were already initialized. However, this only called right after creating a brand new object, so there are no associated functions, and thus I think the above early return is not necessary. I can try removing it. > > The klp_init_*() stuff is called from __klp_enable_patch() > for the "nop" functions now. And it has already been called > for the statically defined structures in klp_register_patch(). > Therefore we need to avoid calling it twice for the static > structures. > > One solution would be to do these operations for the statically > defined structures in __klp_enable_patch() as well. But this > would mean a big redesign of the code. > > Another solution would be to give up on the idea that the replaced > patches might be re-enabled without re-loading. I am afraid > that this the only reasonable approach. It will help to avoid > also the extra klp_replaced_patches list. All this will help to > make the code much easier. > It seems like the consensus is to abandon re-enable and go back to rmmod/insmod for revert. I do like the rmmod/insmod revert path in that it keeps module list smaller. However, there are cases where that is not possible, but it sounds like in the interest of keeping the code simpler for the common case, revert would not be possible in those cases, and we would simply have to add a new livepatch module in those cases to fix things. Thanks, -Jason
Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()
On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: > On Fri, 13 Oct 2017, Jason Baron wrote: > >> The ep_poll_safewake() function is used to wakeup potentially nested >> epoll >> file descriptors. The function uses ep_call_nested() to prevent entering >> the same wake up queue more than once, and to prevent excessively deep >> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >> since we are already preventing these conditions during EPOLL_CTL_ADD. >> This >> saves extra function calls, and avoids taking a global lock during the >> ep_call_nested() calls. > > This makes sense. > >> >> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >> case, since ep_call_nested() keeps track of the nesting level, and >> this is >> required by the call to spin_lock_irqsave_nested(). It would be nice to >> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >> well, however its not clear how to simply pass the nesting level through >> multiple wake_up() levels without more surgery. In any case, I don't >> think >> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >> also >> apparently fixes a workload at Google that Salman Qazi reported by >> completely removing the poll_safewake_ncalls->lock from wakeup paths. > > I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as > I was tackling the nested epoll scaling issue with loop and readywalk lists > in mind. >> I'm not sure the details of the workload - perhaps Salman can elaborate further about it. It would seem that the safewake would potentially be the most contended in general in the nested case, because generally you have a few epoll fds attached to lots of sources doing wakeups. So those sources are all going to conflict on the safewake lock. The readywalk is used when performing a 'nested' poll and in general this is likely going to be called on a few epoll fds. That said, we should remove it too. I will post a patch to remove it. The loop lock is used during EPOLL_CTL_ADD to check for loops and deep wakeup paths and so I would expect this to be less common, but I wouldn't doubt there are workloads impacted by it. We can potentially, I think remove this one too - and the global 'epmutex'. I posted some ideas a while ago on it: http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html We can work through these ideas or others... Thanks, -Jason >> Signed-off-by: Jason Baron >> Cc: Alexander Viro >> Cc: Andrew Morton >> Cc: Salman Qazi > > Acked-by: Davidlohr Bueso > >> --- >> fs/eventpoll.c | 47 ++- >> 1 file changed, 18 insertions(+), 29 deletions(-) > > Yay for getting rid of some of the callback hell. > > Thanks, > Davidlohr
Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks
On 10/17/2017 11:53 AM, Davidlohr Bueso wrote: > On Mon, 16 Oct 2017, Jason Baron wrote: > >> Hi, >> >> I posted a patch to completely remove the lock here from the >> ep_poll_safewake() patch here: >> >> http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html > > Kernel development never ceases to amaze me. Two complementary > fixes to a 8+ y/o performance issue on the same day - not that > nested epolls are that common, but it also comes from two real > workloads... > > Getting rid of the lock altogether is always the best way. > >> >> So these are going to conflict. The reason its safe to remove the lock >> is that there are loop and depth checks now that are performed during >> EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these >> checks once add add time as opposed to at each wakeup (even if they can >> be scaled better). > > Wrt conflicts, no worries, I'll rebase -- and hopefully we can get > the dlock stuff in for v4.15 as well. > >> >> I also have a pending patch to do something similar for >> poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most >> egregious one here? > > The customer's workload issues are for the loop_ncalls and readywalk_ncalls > lists, so I'd be interested in seeing your patch for the later. The reason > your patch above is likely not to help my scenario is that most of the time > is spent at a dispatcher level doing epoll_wait, not too many > EPOLL_CTL_ADDs > going on afaict. If there are not many EPOLL_CTL_ADDs, then I wouldn't think loop_ncalls would be highly contented (since it should only be called from the add path)? Thanks, -Jason > > Thanks, > Davidlohr > >> >> Thanks, >> >> -Jason >> >> On 10/13/2017 11:45 AM, Davidlohr Bueso wrote: >>> A customer reported massive contention on the ncalls->lock in which >>> the workload is designed around nested epolls (where the fd is already >>> an epoll). >>> >>> 83.49% [kernel] [k] __pv_queued_spin_lock_slowpath >>> 2.70% [kernel] [k] ep_call_nested.constprop.13 >>> 1.94% [kernel] [k] _raw_spin_lock_irqsave >>> 1.83% [kernel] [k] >>> __raw_callee_save___pv_queued_spin_unlock >>> 1.45% [kernel] [k] _raw_spin_unlock_irqrestore >>> 0.41% [kernel] [k] ep_scan_ready_list.isra.8 >>> 0.36% [kernel] [k] pvclock_clocksource_read >>> >>> The application running on kvm, is using a shared memory IPC >>> communication >>> with a pipe wakeup mechanism, and a two level dispatcher both built >>> around >>> 'epoll_wait'. There is one process per physical core and a full mesh of >>> pipes >>> between them, so on a 18 core system (the actual case), there are 18*18 >>> pipes >>> for the IPCs alone. >>> >>> This patch proposes replacing the nested calls global linked list with a >>> dlock >>> interface, for which we can benefit from pcpu lists when doing >>> ep_poll_safewake(), >>> and hashing for the current task, which provides two benefits: >>> >>> 1. Speeds up the process of loop and max-depth checks from O(N) lookups >>> to O(1) >>> (albeit possible collisions, which we have to iterate); and, >>> >>> 2. Provides smaller lock granularity. >>> >>> cpus before after diff >>> 1 1409370 1344804 -4.58% >>> 2 1015690 1337053 31.63% >>> 3 721009 1273254 76.59% >>> 4 380959 1128931 196.33% >>> 5 287603 1028362 257.56% >>> 6 221104 894943 304.76% >>> 7 173592 976395 462.46% >>> 8 145860 922584 532.51% >>> 9 127877 925900 624.05% >>> 10 112603 791456 602.87% >>> 11 97926 724296 639.63% >>> 12 80732 730485 804.82% >>> >>> With the exception of a single cpu, where the overhead of jhashing >>> influences), we >>> get some pretty good raw throughput increase. Similarly, the amount of >>> time spent >>> decreases immensely as well. >>> >>> Passes ltp related testcases. >>> >>> Signed-off-by: Davidlohr Bueso >>> --- >>> fs/eventpoll.c | 88 >>> +++--- >>> 1 file chan
Re: [PATCH v3 2/2] livepatch: add atomic replace
On 10/18/2017 07:25 AM, Petr Mladek wrote: > On Wed 2017-10-18 11:10:09, Miroslav Benes wrote: >> On Tue, 17 Oct 2017, Jason Baron wrote: >>> If the atomic replace patch does >>> not contain any immediates, then we can drop the reference on the >>> immediately preceding patch only. That is because there may have been >>> previous transitions to immediate functions in the func stack, and the >>> transition to the atomic replace patch only checks immediately preceding >>> transition. It would be possible to check all of the previous immediate >>> function transitions, but this adds complexity and seems like not a >>> common pattern. So I would suggest that we just drop the reference on >>> the previous patch if the atomic replace patch does not contain any >>> immediate functions. >> >> It is even more complicated and it is not connected only to atomic replace >> patch (I realized this while reading the first part of your email and >> then you confirmed it with this paragraph). The consistency model is >> broken with respect to immediate patches. >> >> func a >> patches 1i >> 2i >> 3 >> >> Now, when you're applying 3, only 2i function is checked. But there might >> be a task sleeping in 1i. Such task would be migrated to 3, because we do >> not check 1 in klp_check_stack_func() at all. >> >> I see three solutions. >> >> 1. Say it is an user's fault. Since it is not obvious and it is >> easy-to-make mistake, I would not go this way. >> >> 2. We can fix klp_check_stack_func() in an exact way you're proposing. >> We'd go back in func stack as long as there are immediate patches there. >> This adds complexity and I'm not sure if all the problems would be solved >> because scenarios how patches are stacked and applied to different >> functions may be quite complex. >> >> 3. Drop immediate. It causes problems only and its advantages on x86_64 >> are theoretical. You would still need to solve the interaction with atomic >> replace on other architecture with immediate preserved, but that may be >> easier. Or we can be aggressive and drop immediate completely. The force >> transition I proposed earlier could achieve the same. > > To make it clear. We currently rely on the immediate handling on > architectures without a reliable stack checking. The question > is if anyone uses it for another purpose in practice. > > A solution would be to remove the per-func immediate flag > and invert the logic of the per-patch one. We could rename > it to something like "consistency_required" or "semantic_changes". > A patch with this flag set then might be refused on systems > without reliable stacks. Otherwise, the consistency model > would be used for all patches. > > As a result, all patches would be handled either using > the consistency model or immediately. We would need to > care about any mix of these. > > Best Regards, > Petr > Hi, So for atomic replace, it seems as if we don't want to allow replaced patches to be re-enabled but instead, allow them to rmmod/insmod'ed to allow revert. In your above proposed model around immediate, if all patches are immediate then the atomic replace doesn't allow any of the previous patches to be removed from the kernel via rmmod, while if all patches are handled using the consistency model then all patches previous to the atomic replace patch could be removed via rmmod. So this would simplify the logic around when replaced patches could removed. I was thinking in the current model to only allow the one preceding patch to the atomic replace patch to be rmmod'able, if the atomic replace patch was using the consistency model (to avoid complications with immediate functions/patches). And this could be extended to all *all* patches previous to the atomic replace patch to be rmmod'able when we move to the proposed model around immediates. So I don't think this should hold up the atomic replace patches? Thanks, -Jason
Re: [PATCH v3 2/2] livepatch: add atomic replace
On 10/18/2017 05:10 AM, Miroslav Benes wrote: > On Tue, 17 Oct 2017, Jason Baron wrote: > >> >> >> On 10/17/2017 09:50 AM, Miroslav Benes wrote: >>> On Tue, 17 Oct 2017, Miroslav Benes wrote: >>> >>>> On Tue, 10 Oct 2017, Jason Baron wrote: >>>> >>>>> >>>>> >>>>> On 10/06/2017 06:32 PM, Josh Poimboeuf wrote: >>>>>> On Wed, Sep 27, 2017 at 11:41:30PM -0400, Jason Baron wrote: >>>>>>> Since 'atomic replace' has completely replaced all previous livepatch >>>>>>> modules, it explicitly disables all previous livepatch modules. However, >>>>>>> previous livepatch modules that have been replaced, can be re-enabled >>>>>>> if they have the 'replace' flag set. In this case the set of 'nop' >>>>>>> functions is re-calculated, such that it replaces all others. >>>>>>> >>>>>>> For example, if kpatch-a.ko and kpatch-b.ko have the 'replace' flag set >>>>>>> then: >>>>>>> >>>>>>> # insmod kpatch-a.ko >>>>>>> # insmod kpatch-b.ko >>>>>>> >>>>>>> At this point we have: >>>>>>> >>>>>>> # cat /sys/kernel/livepatch/kpatch-a/enabled >>>>>>> 0 >>>>>>> >>>>>>> # cat /sys/kernel/livepatch/kpatch-b/enabled >>>>>>> 1 >>>>>>> >>>>>>> To revert to the kpatch-a state we can do: >>>>>>> >>>>>>> echo 1 > /sys/kernel/livepatch/kpatch-a/enabled >>>>>>> >>>>>>> And now we have: >>>>>>> >>>>>>> # cat /sys/kernel/livepatch/kpatch-a/enabled >>>>>>> 1 >>>>>>> >>>>>>> # cat /sys/kernel/livepatch/kpatch-b/enabled >>>>>>> 0 >>>>>> >>>>>> I don't really like allowing a previously replaced patch to replace the >>>>>> current patch. It's just more unnecessary complexity. If the user >>>>>> wants to atomically revert back to kpatch-a, they should be able to: >>>>>> >>>>>> rmmod kpatch-a >>>>>> insmod kpatch-a.ko >>>>>> >>>> >>>> I agree. >>>> >>>>> Right - that's how I sent v1 (using rmmod/insmod to revert), but it >>>>> didn't account for the fact the patch or some functions may be marked >>>>> 'immediate' and thus its not possible to just do 'rmmod'. Thus, since in >>>>> some cases 'rmmod' was not feasible, I thought it would be simpler from >>>>> an operational pov to just say we always revert by re-enabling a >>>>> previously replaced patch as opposed to rmmod/insmod. >>>>> >>>>> >>>>>>> Note that it may be possible to unload (rmmod) replaced patches in some >>>>>>> cases based on the consistency model, when we know that all the >>>>>>> functions >>>>>>> that are contained in the patch may no longer be in used, however its >>>>>>> left as future work, if this functionality is desired. >>>>>> >>>>>> If you don't allow a previously replaced patch to be enabled again, I >>>>>> think it would be trivial to let it be unloaded. >>>>>> >>>>> >>>>> The concern is around being replaced by 'immediate' functions and thus >>>>> not knowing if the code is still in use. >>>> >>>> Hm. Would it make sense to remove immediate and rely only on the >>>> consistency model? At least for the architectures where the model is >>>> implemented (x86_64)? >>>> >>>> If not, then I'd keep such modules there without a possibility to remove >>>> them ever. If its functionality was required again, it would of course >>>> mean to insmod a new module with it. >>> >>> Petr pointed out (offline) that immediate could still be useful. So let me >>> describe how I envision the whole atomic replace semantics. >>> >>> Let's have three functions - a, b, c. Patch 1 is immediate and patches >>> a(). >>> >>> func
[PATCH] epoll: remove ep_call_nested() from ep_eventpoll_poll()
The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll routine for an epoll fd, is used to prevent excessively deep epoll nesting, and to prevent circular paths. However, we are already preventing these conditions during EPOLL_CTL_ADD. In terms of too deep epoll chains, we do in fact allow deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS), however we don't allow more than EP_MAX_NESTS when an epoll file descriptor is actually connected to a wakeup source. Thus, we do not require the use of ep_call_nested(), since ep_eventpoll_poll(), which is called via ep_scan_ready_list() only continues nesting if there are events available. Since ep_call_nested() is implemented using a global lock, applications that make use of nested epoll can see large performance improvements with this change. Signed-off-by: Jason Baron Cc: Davidlohr Bueso Cc: Alexander Viro Cc: Andrew Morton Cc: Salman Qazi Cc: Hou Tao --- fs/eventpoll.c | 80 +- 1 file changed, 35 insertions(+), 45 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 69acfab..2e85951 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -276,9 +276,6 @@ static DEFINE_MUTEX(epmutex); /* Used to check for epoll file descriptor inclusion loops */ static struct nested_calls poll_loop_ncalls; -/* Used to call file's f_op->poll() under the nested calls boundaries */ -static struct nested_calls poll_readywalk_ncalls; - /* Slab cache used to allocate "struct epitem" */ static struct kmem_cache *epi_cache __read_mostly; @@ -867,11 +864,33 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file) return 0; } -static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt) +static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, + void *priv); +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, +poll_table *pt); + +/* + * Differs from ep_eventpoll_poll() in that internal callers already have + * the ep->mtx so we need to start from depth=1, such that mutex_lock_nested() + * is correctly annotated. + */ +static unsigned int ep_item_poll(struct epitem *epi, poll_table *pt, int depth) { + struct eventpoll *ep; + bool locked; + pt->_key = epi->event.events; + if (!is_file_epoll(epi->ffd.file)) + return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & + epi->event.events; - return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events; + ep = epi->ffd.file->private_data; + poll_wait(epi->ffd.file, &ep->poll_wait, pt); + locked = pt && (pt->_qproc == ep_ptable_queue_proc); + + return ep_scan_ready_list(epi->ffd.file->private_data, + ep_read_events_proc, &depth, depth, + locked) & epi->event.events; } static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, @@ -879,13 +898,15 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, { struct epitem *epi, *tmp; poll_table pt; + int depth = *(int *)priv; init_poll_funcptr(&pt, NULL); + depth++; list_for_each_entry_safe(epi, tmp, head, rdllink) { - if (ep_item_poll(epi, &pt)) + if (ep_item_poll(epi, &pt, depth)) { return POLLIN | POLLRDNORM; - else { + } else { /* * Item has been dropped into the ready list by the poll * callback, but it's not actually ready, as far as @@ -899,48 +920,20 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, return 0; } -static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, -poll_table *pt); - -struct readyevents_arg { - struct eventpoll *ep; - bool locked; -}; - -static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests) -{ - struct readyevents_arg *arg = priv; - - return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL, - call_nests + 1, arg->locked); -} - static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait) { - int pollflags; struct eventpoll *ep = file->private_data; - struct readyevents_arg arg; - - /* -* During ep_insert() we already hold the ep->mtx for the tfile. -* Prevent re-aquisition. -*/ - arg.locked = wait && (wait->_qproc == ep_ptable_queue_proc); - arg.ep = ep; + int depth = 0; /* Insert i
Re: [PATCH v2 3/3] af_unix: optimize the unix_dgram_recvmsg()
On 10/05/2015 03:41 AM, Peter Zijlstra wrote: > On Fri, Oct 02, 2015 at 08:44:02PM +0000, Jason Baron wrote: >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c >> index f789423..b8ed1bc 100644 >> --- a/net/unix/af_unix.c >> +++ b/net/unix/af_unix.c > >> @@ -1079,6 +1079,9 @@ static long unix_wait_for_peer(struct sock *other, >> long timeo) >> >> prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE); >> >> +set_bit(UNIX_NOSPACE, &u->flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> sched = !sock_flag(other, SOCK_DEAD) && >> !(other->sk_shutdown & RCV_SHUTDOWN) && >> unix_recvq_full(other); >> @@ -1623,17 +1626,22 @@ restart: >> >> if (unix_peer(other) != sk && unix_recvq_full(other)) { >> if (!timeo) { >> +set_bit(UNIX_NOSPACE, &unix_sk(other)->flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> +if (unix_recvq_full(other)) { >> +err = -EAGAIN; >> +goto out_unlock; >> +} >> +} else { > >> @@ -1939,8 +1947,14 @@ static int unix_dgram_recvmsg(struct socket *sock, >> struct msghdr *msg, >> goto out_unlock; >> } >> >> +/* pairs with unix_dgram_poll() and wait_for_peer() */ >> +smp_mb(); >> +if (test_bit(UNIX_NOSPACE, &u->flags)) { >> +clear_bit(UNIX_NOSPACE, &u->flags); >> +wake_up_interruptible_sync_poll(&u->peer_wait, >> +POLLOUT | POLLWRNORM | >> +POLLWRBAND); >> +} >> >> if (msg->msg_name) >> unix_copy_addr(msg, skb->sk); >> @@ -2468,20 +2493,19 @@ static unsigned int unix_dgram_poll(struct file >> *file, struct socket *sock, >> if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT))) >> return mask; >> >> other = unix_peer_get(sk); >> +if (unix_dgram_writable(sk, other)) { >> mask |= POLLOUT | POLLWRNORM | POLLWRBAND; >> +} else { >> set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); >> +set_bit(UNIX_NOSPACE, &unix_sk(other)->flags); >> +/* pairs with mb in unix_dgram_recv */ >> +smp_mb__after_atomic(); >> +if (unix_dgram_writable(sk, other)) >> +mask |= POLLOUT | POLLWRNORM | POLLWRBAND; >> +} >> +if (other) >> +sock_put(other); >> >> return mask; >> } > > > So I must object to these barrier comments; stating which other barrier > they pair with is indeed good and required, but its not sufficient. > > A barrier comment should also explain the data ordering -- the most > important part. > > As it stands its not clear to me these barriers are required at all, but > this is not code I understand so I might well miss something obvious. > So in patch 1/3 I've added sockets that call connect() onto the 'peer_wait' queue of the server. The reason being that when the server reads from its socket (unix_dgram_recvmsg()) it can wake up n clients that may be waiting for the receive queue of the server to empty. This was previously accomplished in unix_dgram_poll() via the: sock_poll_wait(file, &unix_sk(other)->peer_wait, wait); which I've removed. The issue with that approach is that the 'other' socket or server socket can close() and go away out from under the poll() call. Thus, I've converted back unix_dgram_poll(), to simply wait on its own socket's wait queue, which is the semantics that poll()/select()/ epoll() expect. However, I still needed a way to signal the client socket, and thus I've added the callback on connect() in patch 1/3. However, now that the client sockets are added during connect(), and not poll() as was previously done, this means that any calls to unix_dgram_recvmsg() have to walk the entire wakeup list, even if nobody is sitting in poll(). Thus, I've introduced the UNIX_SOCK flag to signal that we are in poll() to the server side, such that it doesn't have to potentially walk a long list of wakeups for no reason. This makes a difference on this test case: http://www.spinics.net/lists/netdev/msg145533.html which I found while working on this issue.
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > Jason Baron writes: >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus, >> if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket p and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from the remote peer socket. >> >> The way that remote peer socket can be freed are: >> >> 1. If s calls connect() to a connect to a new socket other than p, it will >> drop its reference on p, and thus a close() on p will free it. >> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop >> the final reference to p, allowing it to be freed. > > Here's a more simple idea which _might_ work. The underlying problem > seems to be that the second sock_poll_wait introduces a covert reference > to the peer socket which isn't accounted for. The basic idea behind this > is to execute an additional sock_hold for the peer whenever the > sock_poll_wait is called for it and store it (the struct sock *) in a > new struct unix_sock member. Upon entering unix_dgram_poll, if the > member is not NULL, it's cleared and a sock_put for its former value is > done. The 'poll peer not NULL -> sock_put it' code is also added to the > destructor, although I'm unsure if this is really necessary. The patch > below also includes the additional SOCK_DEAD test suggested by Martin as > that seems generally sensible to me. > > NB: This has survived both Martin's and my test programs for a number > of executions/ longer periods of time than was common before without > generating list corruption warnings. The patch below is against 'my' > 3.2.54 and is here provided as a suggestion in the hope that it will be > useful for someting, not as patch submission, as I spent less time > thinking through this than I should ideally have but despite of this, > it's another 2.5 hours of my life spent on something completely > different than what I should be working on at the moment. > > -- > diff -pru linux-2-6/include/net/af_unix.h linux-2-6.p/include/net/af_unix.h > --- linux-2-6/include/net/af_unix.h 2014-01-20 21:52:53.0 + > +++ linux-2-6.p/include/net/af_unix.h 2015-10-05 15:11:20.270958297 +0100 > @@ -50,6 +50,7 @@ struct unix_sock { > struct vfsmount *mnt; > struct mutexreadlock; > struct sock *peer; > + struct sock *poll_peer; > struct sock *other; > struct list_headlink; > atomic_long_t inflight; > diff -pru linux-2-6/net/unix/af_unix.c linux-2-6.p/net/unix/af_unix.c > --- linux-2-6/net/unix/af_unix.c 2014-01-22 16:51:52.0 + > +++ linux-2-6.p/net/unix/af_unix.c2015-10-05 17:05:28.358273567 +0100 > @@ -361,6 +361,9 @@ static void unix_sock_destructor(struct > if (u->addr) > unix_release_addr(u->addr); > > + if (u->poll_peer) > + sock_put(u->poll_peer); > + > atomic_long_dec(&unix_nr_socks); > local_bh_disable(); > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > @@ -625,6 +628,7 @@ static struct sock *unix_create1(struct > u = unix_sk(sk); > u->dentry = NULL; > u->mnt= NULL; > + u->poll_peer = NULL; > spin_lock_init(&u->lock); > atomic_long_set(&u->inflight, 0); > INIT_LIST_HEAD(&u->link); > @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil > static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, > poll_table *wait) > { > - struct sock *sk = sock->sk, *other; > - unsigned int mask, writable; > + struct sock *sk = sock->sk, *other, *pp; > + struct unix_sock *u; > + unsigned int mask, writable, dead; > + > + u = unix_sk(sk); > + pp = u->poll_peer; > + if (pp) { > + u->poll_peer = NULL; > + sock_put(pp); > + } > > sock_poll_wait(file, sk_sleep(sk), wait); > mask = 0; > @@ -2170,7 +2182,20 @@ static unsigned int unix_dgram_poll(stru > other = unix_peer_get(sk); > if (other) { > if (unix_peer(other) != sk) { > - sock_poll_wait(file, &unix
[PATCH v3 2/3] net: unix: Convert gc_flags to flags
Convert gc_flags to flags in perparation for the subsequent patch, which will make use of a flag bit for a non-gc purpose. Signed-off-by: Jason Baron --- include/net/af_unix.h | 2 +- net/unix/garbage.c| 12 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 9698aff..6a4a345 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -58,7 +58,7 @@ struct unix_sock { atomic_long_t inflight; spinlock_t lock; unsigned char recursion_level; - unsigned long gc_flags; + unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; diff --git a/net/unix/garbage.c b/net/unix/garbage.c index a73a226..39794d9 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), * have been added to the queues after * starting the garbage collection */ - if (test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) { + if (test_bit(UNIX_GC_CANDIDATE, &u->flags)) { hit = true; func(u); @@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u) * of the list, so that it's checked even if it was already * passed over */ - if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags)) + if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->flags)) list_move_tail(&u->link, &gc_candidates); } @@ -305,8 +305,8 @@ void unix_gc(void) BUG_ON(total_refs < inflight_refs); if (total_refs == inflight_refs) { list_move_tail(&u->link, &gc_candidates); - __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); - __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); + __set_bit(UNIX_GC_CANDIDATE, &u->flags); + __set_bit(UNIX_GC_MAYBE_CYCLE, &u->flags); } } @@ -332,7 +332,7 @@ void unix_gc(void) if (atomic_long_read(&u->inflight) > 0) { list_move_tail(&u->link, ¬_cycle_list); - __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); + __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->flags); scan_children(&u->sk, inc_inflight_move_tail, NULL); } } @@ -343,7 +343,7 @@ void unix_gc(void) */ while (!list_empty(¬_cycle_list)) { u = list_entry(not_cycle_list.next, struct unix_sock, link); - __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags); + __clear_bit(UNIX_GC_CANDIDATE, &u->flags); list_move_tail(&u->link, &gc_inflight_list); } -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/3] net: unix: fix use-after-free
Hi, These patches are against mainline, I can re-base to net-next, just let me know. They have been tested against: https://lkml.org/lkml/2015/9/13/195, which causes the use-after-free quite quickly and here: https://lkml.org/lkml/2015/10/2/693. Thanks, -Jason v3: -beef up memory barrier comments in 3/3 (Peter Zijlstra) -clean up unix_dgram_writable() function in 3/3 (Joe Perches) Jason Baron (3): net: unix: fix use-after-free in unix_dgram_poll() net: unix: Convert gc_flags to flags net: unix: optimize wakeups in unix_dgram_recvmsg() include/net/af_unix.h | 4 +- net/unix/af_unix.c| 117 +++--- net/unix/garbage.c| 12 +++--- 3 files changed, 101 insertions(+), 32 deletions(-) -- 2.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/3] net: unix: fix use-after-free in unix_dgram_poll()
The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait queue associated with the socket s that we are poll'ing against, but also calls sock_poll_wait() for a remote peer socket p, if it is connected. Thus, if we call poll()/select()/epoll() for the socket s, there are then a couple of code paths in which the remote peer socket p and its associated peer_wait queue can be freed before poll()/select()/epoll() have a chance to remove themselves from the remote peer socket. The way that remote peer socket can be freed are: 1. If s calls connect() to a connect to a new socket other than p, it will drop its reference on p, and thus a close() on p will free it. 2. If we call close on p(), then a subsequent sendmsg() from s, will drop the final reference to p, allowing it to be freed. Address this issue, by reverting unix_dgram_poll() to only register with the wait queue associated with s and register a callback with the remote peer socket on connect() that will wake up the wait queue associated with s. If scenarios 1 or 2 occur above we then simply remove the callback from the remote peer. This then presents the expected semantics to poll()/select()/ epoll(). I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets"). Tested-by: Mathias Krause Signed-off-by: Jason Baron --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 32 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 4a167b3..9698aff 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -62,6 +62,7 @@ struct unix_sock { #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 struct socket_wqpeer_wq; + wait_queue_twait; }; #define unix_sk(__sk) ((struct unix_sock *)__sk) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 03ee4d3..f789423 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion) skpair = unix_peer(sk); if (skpair != NULL) { + if (sk->sk_type != SOCK_STREAM) + remove_wait_queue(&unix_sk(skpair)->peer_wait, + &u->wait); if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) { unix_state_lock(skpair); /* No more writes */ @@ -636,6 +639,16 @@ static struct proto unix_proto = { */ static struct lock_class_key af_unix_sk_receive_queue_lock_key; +static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct unix_sock *u; + + u = container_of(wait, struct unix_sock, wait); + wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key); + + return 0; +} + static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) { struct sock *sk = NULL; @@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern) INIT_LIST_HEAD(&u->link); mutex_init(&u->readlock); /* single task reading lock */ init_waitqueue_head(&u->peer_wait); + init_waitqueue_func_entry(&u->wait, peer_wake); unix_insert_socket(unix_sockets_unbound(sk), sk); out: if (sk == NULL) @@ -1030,7 +1044,11 @@ restart: */ if (unix_peer(sk)) { struct sock *old_peer = unix_peer(sk); + + remove_wait_queue(&unix_sk(old_peer)->peer_wait, + &unix_sk(sk)->wait); unix_peer(sk) = other; + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); unix_state_double_unlock(sk, other); if (other != old_peer) @@ -1038,8 +1056,12 @@ restart: sock_put(old_peer); } else { unix_peer(sk) = other; + add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait); unix_state_double_unlock(sk, other); } + /* New remote may have created write space for us */ + wake_up_interruptible_sync_poll(sk_sleep(sk), + POLLOUT | POLLWRNORM | POLLWRBAND); return 0; out_unlock: @@ -1194,6 +1216,8 @@ restart: sock_hold(sk); unix_peer(newsk)= sk; + if (sk->sk_type == SOCK_SEQPACKET) + add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait); newsk->sk_state = TCP_ESTABLISHED; newsk->sk_type = sk->sk_type; init_peercred(new
[PATCH v3 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
Now that connect() permanently registers a callback routine, we can induce extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up its peer_wait queue on every receive. This patch makes the wakeup there conditional on there being waiters. Signed-off-by: Jason Baron --- include/net/af_unix.h | 1 + net/unix/af_unix.c| 85 --- 2 files changed, 62 insertions(+), 24 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 6a4a345..cf21ffd 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -61,6 +61,7 @@ struct unix_sock { unsigned long flags; #define UNIX_GC_CANDIDATE 0 #define UNIX_GC_MAYBE_CYCLE1 +#define UNIX_NOSPACE 2 struct socket_wqpeer_wq; wait_queue_twait; }; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index f789423..66979d4 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -326,7 +326,7 @@ found: return s; } -static inline int unix_writable(struct sock *sk) +static inline bool unix_writable(struct sock *sk) { return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf; } @@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo) prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE); + set_bit(UNIX_NOSPACE, &u->flags); + /* Ensure that we either see space in the peer sk_receive_queue via the +* unix_recvq_full() check below, or we receive a wakeup when it +* empties. Pairs with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); sched = !sock_flag(other, SOCK_DEAD) && !(other->sk_shutdown & RCV_SHUTDOWN) && unix_recvq_full(other); @@ -1623,17 +1629,27 @@ restart: if (unix_peer(other) != sk && unix_recvq_full(other)) { if (!timeo) { - err = -EAGAIN; - goto out_unlock; - } + set_bit(UNIX_NOSPACE, &unix_sk(other)->flags); + /* Ensure that we either see space in the peer +* sk_receive_queue via the unix_recvq_full() check +* below, or we receive a wakeup when it empties. This +* makes sure that epoll ET triggers correctly. Pairs +* with the mb in unix_dgram_recvmsg(). +*/ + smp_mb__after_atomic(); + if (unix_recvq_full(other)) { + err = -EAGAIN; + goto out_unlock; + } + } else { + timeo = unix_wait_for_peer(other, timeo); - timeo = unix_wait_for_peer(other, timeo); + err = sock_intr_errno(timeo); + if (signal_pending(current)) + goto out_free; - err = sock_intr_errno(timeo); - if (signal_pending(current)) - goto out_free; - - goto restart; + goto restart; + } } if (sock_flag(other, SOCK_RCVTSTAMP)) @@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, goto out_unlock; } - wake_up_interruptible_sync_poll(&u->peer_wait, - POLLOUT | POLLWRNORM | POLLWRBAND); + /* Ensure that waiters on our sk->sk_receive_queue draining that check +* via unix_recvq_full() either see space in the queue or get a wakeup +* below. sk->sk_receive_queue is reduece by the __skb_recv_datagram() +* call above. Pairs with the mb in unix_dgram_sendmsg(), +*unix_dgram_poll(), and unix_wait_for_peer(). +*/ + smp_mb(); + if (test_bit(UNIX_NOSPACE, &u->flags)) { + clear_bit(UNIX_NOSPACE, &u->flags); + wake_up_interruptible_sync_poll(&u->peer_wait, + POLLOUT | POLLWRNORM | + POLLWRBAND); + } if (msg->msg_name) unix_copy_addr(msg, skb->sk); @@ -2432,11 +2459,19 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table return mask; } +static bool unix_dgram_writable(struct sock *sk, struct sock *other) +{ + if (other && unix_peer(other) != sk && unix_recvq_full(other)) + return false; + + return unix_writable(sk); +} + static unsigned int unix_dgram_poll(struct file *file, struct socket *sock, poll_table *wait) {
Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
On 10/09/2015 10:38 AM, Hannes Frederic Sowa wrote: > Hi, > > Jason Baron writes: > >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait >> queue associated with the socket s that we are poll'ing against, but also >> calls >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus, >> if we call poll()/select()/epoll() for the socket s, there are then >> a couple of code paths in which the remote peer socket p and its associated >> peer_wait queue can be freed before poll()/select()/epoll() have a chance >> to remove themselves from the remote peer socket. >> >> The way that remote peer socket can be freed are: >> >> 1. If s calls connect() to a connect to a new socket other than p, it will >> drop its reference on p, and thus a close() on p will free it. >> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop >> the final reference to p, allowing it to be freed. >> >> Address this issue, by reverting unix_dgram_poll() to only register with >> the wait queue associated with s and register a callback with the remote peer >> socket on connect() that will wake up the wait queue associated with s. If >> scenarios 1 or 2 occur above we then simply remove the callback from the >> remote peer. This then presents the expected semantics to poll()/select()/ >> epoll(). >> >> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET >> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll(). >> >> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected >> DGRAM sockets"). >> >> Tested-by: Mathias Krause >> Signed-off-by: Jason Baron > > While I think this approach works, I haven't seen where the current code > leaks a reference. Assignment to unix_peer(sk) in general take spin_lock > and increment refcount. Are there bugs at the two places you referred > to? > > Is an easier fix just to use atomic_inc_not_zero(&sk->sk_refcnt) in > unix_peer_get() which could also help other places? > Hi, So we could potentially inc the refcnt on the remote peer such that the remote peer does not free before the socket that has connected to it. However, then the socket that has taken the reference against the peer socket has to potentially record a number of remote sockets (all the ones that it has connected to over its lifetime), and then drop all of their refcnt's when it finally closes. The reason for this is that with the current code when we do poll()/select()/epoll() on a socket with a peer socket, those calls take reference on the peer socket. Specifically, they record the remote peer whead, such that they can remove their callbacks when they return. So its not safe to just drop a reference on the remote peer when it closes because their might be outstanding poll()/select()/epoll() references pending. Normally, poll()/select()/epoll() are waiting on a whead associated directly with the fd/file that they are waiting for. The other point here is that the way this patch structures things is that when the socket connects to a new remote and hence disconnects from an existing remote, POLLOUT events will continue to be correctly delivered. That was not possible with the current structure of things b/c there was no way to inform poll to re-register with the remote peer whead. So, that means that the first test case here now works: https://lkml.org/lkml/2015/10/4/154 Whereas with the old code test case would just hang for ever. So yes there is a bit of code churn here, but I think it moves the code-base in a direction that not only solves this issue, but corrects additional poll() behaviors as well. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/3] net: unix: fix use-after-free
On 10/11/2015 07:55 AM, David Miller wrote: > From: Jason Baron > Date: Fri, 9 Oct 2015 00:15:59 -0400 > >> These patches are against mainline, I can re-base to net-next, please >> let me know. >> >> They have been tested against: https://lkml.org/lkml/2015/9/13/195, >> which causes the use-after-free quite quickly and here: >> https://lkml.org/lkml/2015/10/2/693. > Hi, > I'd like to understand how patches that don't even compile can be > "tested"? > > net/unix/af_unix.c: In function ‘unix_dgram_writable’: > net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this > function) > net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only > once for each function it appears in > > Could you explain how that works, I'm having a hard time understanding > this? > Traveling this week, so responses a bit delayed. Yes, I screwed up the posting. I had some outstanding code in my local tree to make it compile, but I failed to refresh my patch series with this outstanding code before mailing it out. So what I tested/built was not quite what I mailed out. As soon as I noticed this issue in patch 3/3 I re-posted it here: http://marc.info/?l=linux-netdev&m=10355808472&w=2 in an attempt to avoid this confusion. I'm happy to re-post the series or whatever makes things easiest for you. > Also please address Hannes's feedback, thanks. > I've replied directly to Hannes. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/12/2015 04:41 PM, Rainer Weikusat wrote: > Jason Baron writes: >> On 10/05/2015 12:31 PM, Rainer Weikusat wrote: > > [...] > >>> Here's a more simple idea which _might_ work. The underlying problem >>> seems to be that the second sock_poll_wait introduces a covert reference >>> to the peer socket which isn't accounted for. The basic idea behind this >>> is to execute an additional sock_hold for the peer whenever the >>> sock_poll_wait is called for it and store it (the struct sock *) in a >>> new struct unix_sock member. > > [...] > >> Interesting - will this work for the test case you supplied where we are in >> epoll_wait() and another thread does a connect() to a new target? In that >> case, even if we issue a wakeup to the epoll thread, its not going to have >> a non-NULL poll_table, so it wouldn't be added to the new target. IE >> the first test case here: >> >> https://lkml.org/lkml/2015/10/4/154 > > "Indeed it would not." I've also meanwhile found the time to check what > is and isn't locked here and found that Eric's "this looks racy" was > also justified. In theory, a clean solution could be based on modifying > the various polling implementations to keep a piece of data for a polled > something and provided that again on each subsequent poll call. This > could then be used to keep the peer socket alive for as long as > necessary were it possible to change the set of wait queues with every > poll call. Since this also isn't the case, the idea to increment the > reference count of the peer socket won't fly. > > OTOH, something I seriously dislike about your relaying implementation > is the unconditional add_wait_queue upon connect as this builds up a > possibly large wait queue of entities entirely uninterested in the > event which will need to be traversed even if peer_wait wakeups will > only happen if at least someone actually wants to be notified. This > could be changed such that the struct unix_sock member is only put onto > the peer's wait queue in poll and only if it hasn't already been put > onto it. The connection could then be severed if some kind of > 'disconnect' occurs. > > The code below (again based on 3.2.54) is what I'm currently running and > it has survived testing during the day (without trying the exercise in > hexadecimal as that doesn't cause failure for me, anyway). The wakeup > relaying function checks that a socket wait queue actually still exists > because I used to get null pointers oopses without every now and then > (I've also tested this with an additional printk printing 'no q' in case > the pointer was actually null to verify that this really occurs here). > Hi, What about the following race? 1) thread A does poll() on f, finds the wakeup condition low, and adds itself to the remote peer_wait queue. 2) thread B sets the wake up condition in dgram_recvmsg(), but does not execute the wakeup of threads yet. 3) thread C also does a poll() on f, finds now that the wakeup condition is set, and hence removes f from the remote peer_wait queue. Then, thread A misses the POLLOUT, and hangs. I understand your concern about POLLIN only waiters-I think we could add the 'relay callback' in dgram_poll() only for those who are looking for POLLOUT, and simply avoid the de-registration, as in practice I think its unlikely we are going to have a socket switching from POLLOUT to *only* POLLIN. I suspect that will cover most of the cases that concern you? Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] locking/static_keys: fix a silly typo
On 09/07/2015 03:18 PM, Jonathan Corbet wrote: > 412758cb2670 (jump label, locking/static_keys: Update docs) introduced a > typo that might as well get fixed. > > Signed-off-by: Jonathan Corbet > --- > Documentation/static-keys.txt | 2 +- > include/linux/jump_label.h| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt > index f4cb0b2..ec91158 100644 > --- a/Documentation/static-keys.txt > +++ b/Documentation/static-keys.txt > @@ -16,7 +16,7 @@ The updated API replacements are: > DEFINE_STATIC_KEY_TRUE(key); > DEFINE_STATIC_KEY_FALSE(key); > static_key_likely() > -statick_key_unlikely() > +static_key_unlikely() > > 0) Abstract > > diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h > index 7f653e8..0684bd3 100644 > --- a/include/linux/jump_label.h > +++ b/include/linux/jump_label.h > @@ -22,7 +22,7 @@ > * DEFINE_STATIC_KEY_TRUE(key); > * DEFINE_STATIC_KEY_FALSE(key); > * static_key_likely() > - * statick_key_unlikely() > + * static_key_unlikely() > * > * Jump labels provide an interface to generate dynamic branches using > * self-modifying code. Assuming toolchain and architecture support, if we > Thanks. I actually messed this up further. That's supposed to be, 'static_branch_likely()', and 'static_branch_unlikely()'. So: s/static_key_likely()/static_branch_likely() and s/static_key_unlikely()/static_branch_unlikely() The rest of the doc appears to have it correctly. There are a few more typos in there as well: 1) s/addtion/addition 2) " The inc()/dec() interface is meant to be used exclusively from the inc()/dec() for a given key. " Was supposed to read: " The inc()/dec() interface is meant to be used exclusively from the enable()/disable() interface for a given key. " However, I think we should just delete this sentence. As the API inherently doesn't prevent this. The user just may need to be aware to properly serialize operations. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
> > X-Signed-Off-By: Rainer Weikusat > Hi, So the patches I've posted and yours both use the idea of a relaying the remote peer wakeup via callbacks that are internal to the net/unix, such that we avoid exposing the remote peer wakeup to the external poll()/select()/epoll(). They differ in when and how those callbacks are registered/unregistered. So I think your approach here will generally keep the peer wait wakeup queue to its absolute minimum, by removing from that queue when we set POLLOUT, however it requires taking the peer waitqueue lock on every poll() call. So I think there are tradeoffs here vs. what I've posted. So for example, if there are a lot of writers against one 'server' socket, there is going to be a lot of lock contention with your approach here. So I think the performance is going to depend on the workload that is tested. I don't have a specific workload that I am trying to solve here, but since you introduced the waiting on the remote peer queue, perhaps you can post numbers comparing the patches that have been posted for the workload that this was developed for? I will send you the latest version of what I have privately - so as not to overly spam the list. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()
On 10/18/2015 04:58 PM, Rainer Weikusat wrote: [...] > > The idea behind 'the wait queue' (insofar I'm aware of it) is that it > will be used as list of threads who need to be notified when the > associated event occurs. Since you seem to argue that the run-of-the-mill > algorithm is too slow for this particular case, is there anything to > back this up? > Generally the poll() routines only add to a wait queue once at the beginning, and all subsequent calls to poll() simply check the wakeup conditions. So here you are proposing to add/remove to the wait queue on subsequent invocations of poll(). So the initial patch I did, continued in the usual pattern and only added once on registration or connect(). However, I do think that this is a special case since the registration is on a shared wait queue, and thus having a long list of registered waiters is going to affect all waiters. So I am fine with doing the add/removes in the poll() routine and I agree that the patch below is more compact that what I initially posted. A couple of notes on your patch: 1) In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus it requires proper dereferencing. Something like: struct unix_sock *u; struct socket_wq *wq; u = container_of(wait, struct unix_sock, wait); rcu_read_lock(); wq = rcu_dereference(u->sk.sk_wq); if (wq_has_sleeper(wq)) wake_up_interruptible_sync_poll(&wq->wait, key); rcu_read_unlock(); 2) For the case of epoll() in edge triggered mode we need to ensure that when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full() is true, we need to add a unix_peer_wake_connect() call to guarantee a wakeup. Otherwise, we are going to potentially hang there. With these changes (or tell me why they are not needed), I'm happy to ack this patch. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/