Re: [PATCH v11 7/9] coresight: add support for CPU debug module
Hi Leo, [auto build test WARNING on next-20170522] [cannot apply to linus/master linux/master robh/for-next v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.12-rc2] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Leo-Yan/coresight-enable-debug-module/20170524-075217 config: arm-allmodconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All warnings (new ones prefixed by >>): drivers/hwtracing/coresight/coresight-cpu-debug.c: In function 'debug_adjust_pc': >> drivers/hwtracing/coresight/coresight-cpu-debug.c:263:44: warning: left >> shift count >= width of type [-Wshift-count-overflow] return (unsigned long)drvdata->edpcsr_hi << 32 | ^~ vim +263 drivers/hwtracing/coresight/coresight-cpu-debug.c 247 if (drvdata->edvidsr_present) 248 drvdata->edvidsr = readl_relaxed(drvdata->base + EDVIDSR); 249 250 out: 251 /* Restore EDPRCR register */ 252 writel_relaxed(save_edprcr, drvdata->base + EDPRCR); 253 254 CS_LOCK(drvdata->base); 255 } 256 257 static unsigned long debug_adjust_pc(struct debug_drvdata *drvdata) 258 { 259 unsigned long arm_inst_offset = 0, thumb_inst_offset = 0; 260 unsigned long pc; 261 262 if (IS_ENABLED(CONFIG_64BIT)) > 263 return (unsigned long)drvdata->edpcsr_hi << 32 | 264 (unsigned long)drvdata->edpcsr; 265 266 pc = (unsigned long)drvdata->edpcsr; 267 268 if (drvdata->pc_has_offset) { 269 arm_inst_offset = 8; 270 thumb_inst_offset = 4; 271 } --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v4 next 1/3] modules:capabilities: allow __request_module() to take a capability argument
On Tue, May 23, 2017 at 9:19 PM, Kees Cook wrote: > On Tue, May 23, 2017 at 3:29 AM, Djalal Harouni wrote: [...] >> I think if there is an interface request_module_capable() , then code >> will use it. The DCCP code path did not check capabilities at all and >> called request_module(), other code does the same. >> >> A new interface can be abused, the result of this: we may break >> "modules_autoload_mode" in mode 0 and 1. In the long term code will >> want to change may_autoload_module() to also allow mode 1 to load a >> module with CAP_NET_ADMIN or other caps in its own userns, resulting >> in "modules_autoload_mode == 0 == 1". Without userns in the game we >> may just see request_module_capable(CAP_SYS_ADMIN, ...) . There is >> already some code maybe phonet sockets ? that require CAP_SYS_ADMIN to >> get the appropriate protocol and no one will be able to review all >> this code or track new patches with request_module_capable() callers. > > I'm having some trouble following what you're saying here, but if I > understand, you're worried about getting the kernel into a state where > autoload state 0 == 1. Autoload 0 is "business as usual", and autoload > 1 is "CAP_SYS_MODULE required to be able to trigger a module auto-load > operation, or CAP_NET_ADMIN for modules with a 'netdev-%s' alias." Indeed. > In the v4 patch, under autoload==1, CAP_NET_ADMIN is needed to load > netdev- modules: > > if (no_module && capable(CAP_NET_ADMIN)) >no_module = __request_module(true, CAP_NET_ADMIN, > "netdev-%s", name); > > and in the LSM hook, CAP_NET_ADMIN is passed as an allowable "alias" > for the CAP_SYS_MODULE requirement: > >else if (modules_autoload_mode == MODULES_AUTOLOAD_PRIVILEGED) { >/* Check CAP_SYS_MODULE then allow_cap if valid */ >if (capable(CAP_SYS_MODULE) || >(allow_cap > 0 && capable(allow_cap))) > return 0; >} > > What I see is some needless double-checking. Since you're making > changes to the request_module() API, it would be possible to have That check is *not* a double check and it is *really* needed in v4 since how may_autoload_module() was implemented. It first checks if 'autoload' == 0 == ALLOWED, if so then it allows the operation regardless of the capability. That's why I didn't want to touch current network logic and assumed that net code knows what it should do. > request_module_cap(), which could be checked instead of open-coding > it: > > if (no_module) > no_module = request_module_cap(CAP_NET_ADMIN, "netdev-%s", name); > > If I'm understanding your objection correctly, it's that you want to > ONLY ever provide this one-time alias for CAP_SYS_MODULE with the > netdev-%s things, and you don't want to risk having other module > loading start using request_module_cap() which would lead to > CAP_SYS_MODULE aliases in other places? Yes. We can't really track capabilities usage or new code. > If the goal is to make sure that only privileged processes are > autoloading, I don't think adding a well defined interface for > cap-checks (request_module_cap()) would lead to a slippery slope. The > worst case scenario (which would never happen) would be all > request_module() users would convert to request_module_cap(). This I am also concerned a bit with new code. In the documentation we explicitly say CAP_SYS_MODULE, and new code should not break that assumption. > would mean that all module loading would require specific privileges. > That seems in line with autoload==1. They would not be tied to > CAP_SYS_MODULE, though, which is, I suspect, what you're concerned > about. Indeed, it is just easy to say hey it needs CAP_SYS_MODULE. The capability usage in the module subsystem more precisely with explicit loading is clean. CAP_SYS_MODULE is not overloaded, it has clear focus. As you say it, we should be concerned if we blindly trust callers and end up *aliasing* CAP_SYS_MODULE with some other cap... > Even in the existing code, there is a sense about CAP_NET_ADMIN and > CAP_SYS_MODULE having different privilege levels, in that > CAP_NET_ADMIN can only load netdev-%s modules, but CAP_SYS_MODULE can > load any module. What about refining request_module_cap() to _require_ > an explicit string prefix instead of an arbitrary format string? e.g. > request_module_cap(CAP_NET_ADMIN, "netdev", "%s", name) which would > make requests for ("netdev-%s", name) > > I see a few options: > > 1) keep what you have for v4, and hope other places don't use > __request_module. (I'm not a fan of this.) Yes even if it is documented I wouldn't bet on it, though. :-) > 2) switch the logic on autoload==1 from OR to AND: both the specified > caps _and_ CAP_SYS_MODULE are required. (This seems like it might make > autoload==1 less useful.) That will restrict some userspace that works only with CAP_NET_ADMIN. > 3) use the request_m
Re: [PATCH v5 0/3] watchdog: allow setting deadline for opening /dev/watchdogN
On 22 May 2017 at 16:06, Rasmus Villemoes wrote: > > Rasmus Villemoes (3): > watchdog: introduce watchdog_worker_should_ping helper > watchdog: introduce watchdog.open_timeout commandline parameter > watchdog: introduce CONFIG_WATCHDOG_OPEN_TIMEOUT > > Documentation/watchdog/watchdog-parameters.txt | 10 +++ > drivers/watchdog/Kconfig | 9 +++ > drivers/watchdog/watchdog_dev.c| 37 > +++--- > 3 files changed, 52 insertions(+), 4 deletions(-) Thanks. This is definitely useful for embedded applications. If this get merged, it should be able to replace several out-of-tree hacks to accomplish the same thing. For all 3 patches: Reviewed-by: Esben Haabendal /Esben -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: per-cgroup memory reclaim stats
On Thu, May 11, 2017 at 08:16:23PM +0100, Roman Gushchin wrote: > Track the following reclaim counters for every memory cgroup: > PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and > PGLAZYFREED. > > These values are exposed using the memory.stats interface of cgroup v2. > > The meaning of each value is the same as for global counters, > available using /proc/vmstat. > > Also, for consistency, rename mem_cgroup_count_vm_event() to > count_memcg_event_mm(). > > Signed-off-by: Roman Gushchin > Suggested-by: Johannes Weiner > Cc: Johannes Weiner > Cc: Tejun Heo > Cc: Li Zefan > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: cgro...@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux...@kvack.org Acked-by: Johannes Weiner Andrew, if there aren't any other objections, could you pick this one up please? -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
Hello, On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote: > All controllers can use the special sub-directory if userland chooses to > do so. The problem that I am trying to address in this patch is to allow > more natural hierarchy that reflect a certain purpose, like the task > classification done by systemd. Restricting tasks only to leaf nodes > makes the hierarchy unnatural and probably difficult to manage. I see but how is this different from userland just creating the leaf cgroup? I'm not sure what this actually enables in terms of what can be achieved with cgroup. I suppose we can argue that this is more convenient but I'd like to keep the interface orthogonal as much as reasonably possible. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
Hello, Mike. On Sat, May 20, 2017 at 04:10:07AM +0200, Mike Galbraith wrote: > On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote: > > Hello, Waiman. > > > > On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote: > > > The rationale behind the cgroup v2 no internal process constraint is > > > to avoid resouorce competition between internal processes and child > > > cgroups. However, not all controllers have problem with internal > > > process competiton. Enforcing this rule may lead to unnatural process > > > hierarchy and unneeded levels for those controllers. > > > > This isn't necessarily something we can determine by looking at the > > current state of controllers. It's true that some controllers - pid > > and perf - inherently only care about membership of each task but at > > the same time neither really suffers from the constraint either. CPU > > which is the problematic one here... > > (+ cpuacct + cpuset) Yeah, cpuacct and cpuset are in the same boat as perf. cpuset is completely so and we can move the tree walk to the reader side or aggregate propagation for cpuacct as necessary. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
Hello, Waiman. On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote: > > This breaks the invariant that in a cgroup its resource control knobs > > control distribution of resources from its parent. IOW, the resource > > control knobs of a cgroup always belong to the parent. This is also > > reflected in how delegation is done. The delegatee assumes ownership > > of the cgroup itself and the ability to manage sub-cgroups but doesn't > > get the ownership of the resource control knobs as otherwise the > > parent would lose control over how it distributes its resources. > > One twist that I am thinking is to have a controller enabled by the > parent in subtree_control, but then allow the child to either disable it > or set it in pass-through mode by writing to controllers file. IOW, a > child cannot enable a controller without parent's permission. Once a > child has permission, it can do whatever it wants. A parent cannot force > a child to have a controller enabled. Heh, I think I need more details to follow your proposal. Anyways, what we need to guarantee is that a descendant is never allowed to pull in more resources than its ancestors want it to. > > Another aspect is that most controllers aren't that sensitive to > > nesting several levels. Expensive operations can be and already are > > aggregated and the performance overhead of several levels of nesting > > barely shows up. Skipping levels can be an interesting optimization > > approach and we can definitely support from the core side; however, > > it'd be a lot nicer if we could do that optimization transparently > > (e.g. CPU can skip multi level queueing if there usually is only one > > item at some levels). > > The trend that I am seeing is that the total number of controllers is > going to grow over time. New controllers may be sensitive to the level > of nesting like the cpu controller. I am also thinking about how systemd > is using the cgroup filesystem for task classification purpose without > any controller attached to it. With this scheme, we can accommodate all > the different needs without using different cgroup filesystems. I'm not sure about that. It's true that cgroup hierarchy is being used more but there are only so many hard / complex resources that we deal with - cpu, memory and io. Beyond those, other uses are usually about identifying membership (perf, net) or propagating and restricting attributes (cpuset). pids can be considered an exception but we have it only because pids can globally run out a lot sooner than can be controlled through memory. Even then, it's trivial. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/31] gcc-plugins.txt: standardize document format
On Thu, May 18, 2017 at 6:22 PM, Mauro Carvalho Chehab wrote: > Each text file under Documentation follows a different > format. Some doesn't even have titles! > > Change its representation to follow the adopted standard, > using ReST markups for it to be parseable by Sphinx: > > - promote main title; > - use the right markup for footnotes; > - use bold markup for files name; > - identify literal blocks; > - add blank lines to avoid Sphinx to complain; > - remove numeration from titles. > > Signed-off-by: Mauro Carvalho Chehab Acked-by: Kees Cook This should probably get moved under "Kernel API documentation" but may need a new sub-category, maybe "instrumentation"? Things like KASan could be put under that too. -Kees -- Kees Cook Pixel Security -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
On 05/24/2017 01:31 PM, Tejun Heo wrote: > Hello, Waiman. > > On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote: >>> This breaks the invariant that in a cgroup its resource control knobs >>> control distribution of resources from its parent. IOW, the resource >>> control knobs of a cgroup always belong to the parent. This is also >>> reflected in how delegation is done. The delegatee assumes ownership >>> of the cgroup itself and the ability to manage sub-cgroups but doesn't >>> get the ownership of the resource control knobs as otherwise the >>> parent would lose control over how it distributes its resources. >> One twist that I am thinking is to have a controller enabled by the >> parent in subtree_control, but then allow the child to either disable it >> or set it in pass-through mode by writing to controllers file. IOW, a >> child cannot enable a controller without parent's permission. Once a >> child has permission, it can do whatever it wants. A parent cannot force >> a child to have a controller enabled. > Heh, I think I need more details to follow your proposal. Anyways, > what we need to guarantee is that a descendant is never allowed to > pull in more resources than its ancestors want it to. What I am saying is as follows: / A P - B \ C # echo +memory > P/cgroups.subtree_control # echo -memory > P/A/cgroup.controllers # echo "#memory" > P/B/cgroup.controllers The parent grants the memory controller to its children - A, B and C. Child A has the memory controller explicitly disabled. Child B has the memory controller in pass-through mode, while child C has the memory controller enabled by default. "echo +memory > cgroup.controllers" is not allowed. There are 2 possible choices with regard to the '-' or '#' prefixes. We can allow them before the grant from the parent or only after that. In the former case, the state remains dormant until after the grant from the parent. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
Hello, On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote: > What I am saying is as follows: > / A > P - B >\ C > > # echo +memory > P/cgroups.subtree_control > # echo -memory > P/A/cgroup.controllers > # echo "#memory" > P/B/cgroup.controllers > > The parent grants the memory controller to its children - A, B and C. > Child A has the memory controller explicitly disabled. Child B has the > memory controller in pass-through mode, while child C has the memory > controller enabled by default. "echo +memory > cgroup.controllers" is > not allowed. There are 2 possible choices with regard to the '-' or '#' > prefixes. We can allow them before the grant from the parent or only > after that. In the former case, the state remains dormant until after > the grant from the parent. Ah, I see, you want cgroup.controllers to be able to mask available controllers by the parent. Can you expand your example with further nesting and how #memory on cgroup.controllers would affect the nested descendant? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] [PATCH v4 next 0/3] modules: automatic module loading restrictions
On Tue, May 23, 2017 at 9:48 AM, Solar Designer wrote: >> >>> On Mon, May 22, 2017 at 2:08 PM, Solar Designer >> >>> wrote: >> >>> > On Mon, May 22, 2017 at 01:57:03PM +0200, Djalal Harouni wrote: >> >>> >> *) When modules_autoload_mode is set to (2), automatic module loading >> >>> >> is >> >>> >> disabled for all. Once set, this value can not be changed. >> >>> > >> >>> > What purpose does this securelevel-like property ("Once set, this value >> >>> > can not be changed.") serve here? I think this mode 2 is needed, but >> >>> > without this extra property, which is bypassable by e.g. explicitly >> >>> > loaded kernel modules anyway (and that's OK). > > On Mon, May 22, 2017 at 04:07:56PM -0700, Kees Cook wrote: >> I'm on the fence. For modules_disabled and Yama, it was tied to >> CAP_SYS_ADMIN, basically designed to be a at-boot setting that could >> not later be undone by an attacker gaining that privilege, keeping >> them out of either kernel memory or existing user process memory. >> Here, it's CAP_SYS_MODULE... it's hard to imagine the situation where >> a CAP_SYS_MODULE-capable process could write to this sysctl but NOT >> issue direct modprobe requests, but it's _possible_ via crazy symlink >> games to trick capable processes into writing to sysctls. We've seen >> this multiple times before, and it's a way for attackers to turn a >> single privileged write into a privileged exec. > > OK, tricking a process via crazy symlink games is finally a potentially > valid reason. The question then becomes: are there perhaps so many > other important sysctl's, disk files, etc. (which the vulnerable capable > process could similarly be tricked into writing) so that specifically > resetting modules_autoload_mode isn't particularly lucrative? I think > that the answer to that is usually yes. Another related question: do we > really want to inconsistently single out a handful of sysctl's for this > kind of extra protection? I think not. > > I agree there are some other settings where being unable to reset them > makes sense, but I think this isn't one of those. > Alright, I already replied to Andy, since it was requested I will drop it. I definitely prefer that we have something merged and usable ;-) >> I might turn the question around, though: why would we want to have it >> changeable at this setting? > > Convenience for the sysadmin - being able to correct one's error (e.g., > wrong order of shell commands), respond to new findings (thought module > autoloading was unneeded after some point, then found out some software > relies on it), change one's mind, reuse a system differently than > originally intended without a forced reboot. > >> I'm fine leaving that piece off, either way. > > I'm also fine with either decision. I just thought I'd point out what > looked weird to me. > > I think this is an important patch that should get in, but primarily > for modules_autoload_mode=1, which many distros could make the default > (and maybe the kernel eventually should?) I do not think that desktop, or interactive systems will set it. Ubuntu has snaps for their app store, and they can use the per-task flag and other vendors too Flatpak, etc. > For modules_autoload_mode=2, we already seem to have the equivalent of > modprobe=/bin/true (or does it differ subtly, maybe in return values?), > which I already use at startup on a GPU box like this (preloading > modules so that the OpenCL backends wouldn't need the autoloading): > > nvidia-smi > nvidia-modprobe -u -c=0 > #modprobe nvidia_uvm > #modprobe fglrx > > sysctl -w kernel.modprobe=/bin/true > sysctl -w kernel.hotplug=/bin/true > > but it's good to also have this supported more explicitly and more > consistently through modules_autoload_mode=2 while we're at it. So I > support having this mode as well. I just question the need to have it > non-resettable. > Ok, yes with mode=2 it is clear interface, it logs what was denied, it is consistent with the per-task flag that we want for sandboxes and desktop, it will work with CONFIG_STATIC_USERMODEHELPER, more importantly it will avoid doing the usermod upcall, even if one day the kernel support upcall into namespaces, I'm pretty sure that no one will like the idea of namespaced modprobe paths, modules are global anyway, this allows to say we are safe by default from any future change that may make an upcall into the wrong context, we just avoid that. -- tixxdz -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
make pdfdoc errors with 4.12-rc2
make pdfdocs, on my Fedora 25 desktop with the latest patches, fails with Running Sphinx v1.5.2 [...lots-o-stuff...] make PDFLATEX=xelatex LATEXOPTS="-interaction=batchmode" -C Documentation/output/./latex || exit; xelatex -interaction=batchmode 'linux-input.tex' This is XeTeX, Version 3.14159265-2.6-0.6 (TeX Live 2016) (preloaded format=xelatex) restricted \write18 enabled. entering extended mode Makefile:67: recipe for target 'linux-input.pdf' failed make[2]: *** [linux-input.pdf] Error 1 Documentation/Makefile.sphinx:87: recipe for target 'pdfdocs' failed make[1]: *** [pdfdocs] Error 2 Makefile:1472: recipe for target 'pdfdocs' failed make: *** [pdfdocs] Error 2 -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
On 05/24/2017 01:56 PM, Tejun Heo wrote: > Hello, > > On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote: >> What I am saying is as follows: >> / A >> P - B >>\ C >> >> # echo +memory > P/cgroups.subtree_control >> # echo -memory > P/A/cgroup.controllers >> # echo "#memory" > P/B/cgroup.controllers >> >> The parent grants the memory controller to its children - A, B and C. >> Child A has the memory controller explicitly disabled. Child B has the >> memory controller in pass-through mode, while child C has the memory >> controller enabled by default. "echo +memory > cgroup.controllers" is >> not allowed. There are 2 possible choices with regard to the '-' or '#' >> prefixes. We can allow them before the grant from the parent or only >> after that. In the former case, the state remains dormant until after >> the grant from the parent. > Ah, I see, you want cgroup.controllers to be able to mask available > controllers by the parent. Can you expand your example with further > nesting and how #memory on cgroup.controllers would affect the nested > descendant? > > Thanks. > I would allow enabling the controller in subtree_control if granted from the parent and not explicitly disabled. IOW, both B and C can "echo +memory" to their subtree_control to grant memory controller to their children, but not A. A has to re-enable memory controller or set it to pass-through mode before it can enable it in subtree_control. I need to clarify that "echo +memory > cgroup.controllers" is allowed to re-enable it, but not without the granting from its parent. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
On 05/24/2017 01:05 PM, Tejun Heo wrote: > Hello, > > On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote: >> All controllers can use the special sub-directory if userland chooses to >> do so. The problem that I am trying to address in this patch is to allow >> more natural hierarchy that reflect a certain purpose, like the task >> classification done by systemd. Restricting tasks only to leaf nodes >> makes the hierarchy unnatural and probably difficult to manage. > I see but how is this different from userland just creating the leaf > cgroup? I'm not sure what this actually enables in terms of what can > be achieved with cgroup. I suppose we can argue that this is more > convenient but I'd like to keep the interface orthogonal as much as > reasonably possible. > > Thanks. > I am just thinking that it is a bit more natural with the concept of the special resource domain sub-directory. You are right that the same effect can be achieved by proper placement of tasks and enabling of controllers. A (cpu,memory) [T1] - B(cpu,memory) [T2] \ cgroups.resource_domain (memory) A (cpu,memory) - B(cpu,memory) [T2] \ C (memory) [T1] With respect to the tasks T1 and T2, the above 2 configurations are the same. I am OK to drop this patch. However, I still think the current no-internal process constraint is too restricting. I will suggest either 1. Allow internal processes and document the way to avoid internal process competition as shown above from the userland, or 2. Mark only certain controllers as not allowing internal processes when they are enabled. What do you think about this? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/31] gcc-plugins.txt: standardize document format
Em Wed, 24 May 2017 10:35:42 -0700 Kees Cook escreveu: > On Thu, May 18, 2017 at 6:22 PM, Mauro Carvalho Chehab > wrote: > > Each text file under Documentation follows a different > > format. Some doesn't even have titles! > > > > Change its representation to follow the adopted standard, > > using ReST markups for it to be parseable by Sphinx: > > > > - promote main title; > > - use the right markup for footnotes; > > - use bold markup for files name; > > - identify literal blocks; > > - add blank lines to avoid Sphinx to complain; > > - remove numeration from titles. > > > > Signed-off-by: Mauro Carvalho Chehab > > Acked-by: Kees Cook > > This should probably get moved under "Kernel API documentation" but > may need a new sub-category, maybe "instrumentation"? Things like > KASan could be put under that too. Yeah, I guess that most documents under Documentation/ will need to be renamed and placed into an existing or new book. Kasan documentation is currently under dev-tools, with is, currently, an unsorted book with: coccinelle sparse kcov gcov kasan ubsan kmemleak kmemcheck gdb-kernel-debugging kgdb I agree with you: it probably makes sense to split external development tools, like coccinelle/sparse from Kernel instrumentation, like kgdb, kasan, gcc-plugins, etc. So, perhaps we can change the content of Documentation/dev-tools/index.rst to something like. Development tools for the kernel This document describe tools and instrumentation features of the Linux Kernel used by developers to do quality assurance (QA). This section describes Kernel internal features designed to provide mechanisms for developers to test their code. .. toctree:: :maxdepth: 2 (add here books like kasan, printk related docs, gcc-plugins, etc) This section describes external tools used to ensure Kernel quality assurance (QA). .. toctree:: :maxdepth: 2 (add here books related external tools and robots, like coccinelle, sparse, kernel build robot, ktest, Coverity, etc) .. only:: subproject and html Indices === * :ref:`genindex` Cheers, Mauro -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
Hello, Waiman. On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote: > > Maybe I'm misunderstanding the design, but this seems to push the > > processes which belong to the threaded subtree to the parent which is > > part of the usual resource domain hierarchy thus breaking the no > > internal competition constraint. I'm not sure this is something we'd > > want. Given that the limitation of the original threaded mode was the > > required nesting below root and that we treat root special anyway > > (exactly in the way necessary), I wonder whether it'd be better to > > simply allow root to be both domain and thread root. > > Yes, root can be both domain and thread root. I haven't placed any > restriction on that. I've been playing with the proposed "make the parent resource domain". Unfortunately, the parent - child relationship becomes weird. The parent becomes the thread root, which means that its cgroup.threads file becomes writable and threads can be put in there. It's really weird to write to a child's interface and have the parent's behavior changed. This becomes weirder with delegation. If a cgroup is delegated, its cgroup.threads should be delegated too but if the child enables threaded mode, that makes the undelegated parent thread root, which means that either 1. the delegatee can't migrate threads to the thread root or 2. if the parent's cgroup.threads is writeable, the delegatee can mass with other descendants under it which shouldn't be allowed. I think the operation of making a cgroup a thread root should happen on the cgroup where that's requested; otherwise, nesting becomes too twisted. This should be solvable. Will think more about it. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
On 05/24/2017 04:36 PM, Tejun Heo wrote: > Hello, Waiman. > > On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote: >>> Maybe I'm misunderstanding the design, but this seems to push the >>> processes which belong to the threaded subtree to the parent which is >>> part of the usual resource domain hierarchy thus breaking the no >>> internal competition constraint. I'm not sure this is something we'd >>> want. Given that the limitation of the original threaded mode was the >>> required nesting below root and that we treat root special anyway >>> (exactly in the way necessary), I wonder whether it'd be better to >>> simply allow root to be both domain and thread root. >> Yes, root can be both domain and thread root. I haven't placed any >> restriction on that. > I've been playing with the proposed "make the parent resource domain". > Unfortunately, the parent - child relationship becomes weird. > > The parent becomes the thread root, which means that its > cgroup.threads file becomes writable and threads can be put in there. > It's really weird to write to a child's interface and have the > parent's behavior changed. This becomes weirder with delegation. If > a cgroup is delegated, its cgroup.threads should be delegated too but > if the child enables threaded mode, that makes the undelegated parent > thread root, which means that either 1. the delegatee can't migrate > threads to the thread root or 2. if the parent's cgroup.threads is > writeable, the delegatee can mass with other descendants under it > which shouldn't be allowed. > > I think the operation of making a cgroup a thread root should happen > on the cgroup where that's requested; otherwise, nesting becomes too > twisted. This should be solvable. Will think more about it. > > Thanks. > An alternative is to have separate enabling for thread root. For example, # echo root > cgroup.threads # echo enable > child/cgroup.threads The first statement make the current cgroup the thread root. However, setting it to a thread root doesn't make its child to be threaded. This have to be explicitly done on each of the children. Once a child cgroup is made to be threaded, all its descendants will be threaded. That will have the same effect as the current patch. With delegation, do you mean the relationship between a container and its host? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
Hello, On Wed, May 24, 2017 at 05:17:13PM -0400, Waiman Long wrote: > An alternative is to have separate enabling for thread root. For example, > > # echo root > cgroup.threads > # echo enable > child/cgroup.threads > > The first statement make the current cgroup the thread root. However, > setting it to a thread root doesn't make its child to be threaded. This > have to be explicitly done on each of the children. Once a child cgroup > is made to be threaded, all its descendants will be threaded. That will > have the same effect as the current patch. Yeah, I'm toying with different ideas. I'll get back to you once things get more concrete. > With delegation, do you mean the relationship between a container and > its host? It can be but doesn't have to be. For example, it can be delegations to users without namespace / container being involved. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] kmod: preempt on kmod_umh_threads_get()
On Fri, May 19, 2017 at 03:27:12PM -0700, Dmitry Torokhov wrote: > On Thu, May 18, 2017 at 08:24:43PM -0700, Luis R. Rodriguez wrote: > > In theory it is possible multiple concurrent threads will try to > > kmod_umh_threads_get() and as such atomic_inc(&kmod_concurrent) at > > the same time, therefore enabling a small time during which we've > > bumped kmod_concurrent but have not really enabled work. By using > > preemption we mitigate this a bit. > > > > Preemption is not needed when we kmod_umh_threads_put(). > > > > Signed-off-by: Luis R. Rodriguez > > --- > > kernel/kmod.c | 24 ++-- > > 1 file changed, 22 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/kmod.c b/kernel/kmod.c > > index 563600fc9bb1..7ea11dbc7564 100644 > > --- a/kernel/kmod.c > > +++ b/kernel/kmod.c > > @@ -113,15 +113,35 @@ static int call_modprobe(char *module_name, int wait) > > > > static int kmod_umh_threads_get(void) > > { > > + int ret = 0; > > + > > + /* > > +* Disabling preemption makes sure that we are not rescheduled here > > +* > > +* Also preemption helps kmod_concurrent is not increased by mistake > > +* for too long given in theory two concurrent threads could race on > > +* atomic_inc() before we atomic_read() -- we know that's possible > > +* and but we don't care, this is not used for object accounting and > > +* is just a subjective threshold. The alternative is a lock. > > +*/ > > + preempt_disable(); > > atomic_inc(&kmod_concurrent); > > if (atomic_read(&kmod_concurrent) <= max_modprobes) > > That is very "fancy" way to basically say: > > if (atomic_inc_return(&kmod_concurrent) <= max_modprobes) Do you mean to combine the atomic_inc() and atomic_read() in one as you noted (as that is not a change in this patch), *or* that using a memory barrier here with atomic_inc_return() should suffice to address the same and avoid an explicit preemption enable / disable ? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] kmod: preempt on kmod_umh_threads_get()
On Thu, May 25, 2017 at 02:14:52AM +0200, Luis R. Rodriguez wrote: > On Fri, May 19, 2017 at 03:27:12PM -0700, Dmitry Torokhov wrote: > > On Thu, May 18, 2017 at 08:24:43PM -0700, Luis R. Rodriguez wrote: > > > In theory it is possible multiple concurrent threads will try to > > > kmod_umh_threads_get() and as such atomic_inc(&kmod_concurrent) at > > > the same time, therefore enabling a small time during which we've > > > bumped kmod_concurrent but have not really enabled work. By using > > > preemption we mitigate this a bit. > > > > > > Preemption is not needed when we kmod_umh_threads_put(). > > > > > > Signed-off-by: Luis R. Rodriguez > > > --- > > > kernel/kmod.c | 24 ++-- > > > 1 file changed, 22 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/kmod.c b/kernel/kmod.c > > > index 563600fc9bb1..7ea11dbc7564 100644 > > > --- a/kernel/kmod.c > > > +++ b/kernel/kmod.c > > > @@ -113,15 +113,35 @@ static int call_modprobe(char *module_name, int > > > wait) > > > > > > static int kmod_umh_threads_get(void) > > > { > > > + int ret = 0; > > > + > > > + /* > > > + * Disabling preemption makes sure that we are not rescheduled here > > > + * > > > + * Also preemption helps kmod_concurrent is not increased by mistake > > > + * for too long given in theory two concurrent threads could race on > > > + * atomic_inc() before we atomic_read() -- we know that's possible > > > + * and but we don't care, this is not used for object accounting and > > > + * is just a subjective threshold. The alternative is a lock. > > > + */ > > > + preempt_disable(); > > > atomic_inc(&kmod_concurrent); > > > if (atomic_read(&kmod_concurrent) <= max_modprobes) > > > > That is very "fancy" way to basically say: > > > > if (atomic_inc_return(&kmod_concurrent) <= max_modprobes) > > Do you mean to combine the atomic_inc() and atomic_read() in one as you noted > (as that is not a change in this patch), *or* that using a memory barrier here > with atomic_inc_return() should suffice to address the same and avoid an > explicit preemption enable / disable ? I am saying that atomic_inc_return() will avoid situation where you have more than one threads incrementing the counter and believing that they are [not] allowed to start modprobe. I have no idea why you think preempt_disable() would help here. It only ensures that current thread will not be preempted between the point where you update the counter and where you check the result. It does not stop interrupts nor does it affect other threads that might be updating the same counter. Thanks. -- Dmitry -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/3] watchdog: introduce watchdog.open_timeout commandline parameter
On 05/22/2017 07:06 AM, Rasmus Villemoes wrote: The watchdog framework takes care of feeding a hardware watchdog until userspace opens /dev/watchdogN. If that never happens for some reason (buggy init script, corrupt root filesystem or whatnot) but the kernel itself is fine, the machine stays up indefinitely. This patch allows setting an upper limit for how long the kernel will take care of the watchdog, thus ensuring that the watchdog will eventually reset the machine. This is particularly useful for embedded devices where some fallback logic is implemented in the bootloader (e.g., use a different root partition, boot from network, ...). The open timeout is also used as a maximum time for an application to re-open /dev/watchdogN after closing it. A value of 0 (the default) means infinite timeout, preserving the current behaviour. Signed-off-by: Rasmus Villemoes --- Documentation/watchdog/watchdog-parameters.txt | 9 + drivers/watchdog/watchdog_dev.c| 26 +- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/watchdog/watchdog-parameters.txt b/Documentation/watchdog/watchdog-parameters.txt index 914518a..4801ec6 100644 --- a/Documentation/watchdog/watchdog-parameters.txt +++ b/Documentation/watchdog/watchdog-parameters.txt @@ -8,6 +8,15 @@ See Documentation/admin-guide/kernel-parameters.rst for information on providing kernel parameters for builtin drivers versus loadable modules. +The watchdog core currently understands one parameter, We have a second parameter queued, handle_boot_enabled. Please rephrase so we don't have to change the text with each new parameter. Thanks, Guenter +watchdog.open_timeout. This is the maximum time, in milliseconds, for +which the watchdog framework will take care of pinging a hardware +watchdog until userspace opens the corresponding /dev/watchdogN +device. A value of 0 (the default) means an infinite timeout. Setting +this to a non-zero value can be useful to ensure that either userspace +comes up properly, or the board gets reset and allows fallback logic +in the bootloader to try something else. + - acquirewdt: diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c index caa4b90..c807067 100644 --- a/drivers/watchdog/watchdog_dev.c +++ b/drivers/watchdog/watchdog_dev.c @@ -66,6 +66,7 @@ struct watchdog_core_data { struct mutex lock; unsigned long last_keepalive; unsigned long last_hw_keepalive; + unsigned long open_deadline; struct delayed_work work; unsigned long status; /* Internal status bits */ #define _WDOG_DEV_OPEN0 /* Opened ? */ @@ -80,6 +81,21 @@ static struct watchdog_core_data *old_wd_data; static struct workqueue_struct *watchdog_wq; +static unsigned open_timeout; +module_param(open_timeout, uint, 0644); + +static bool watchdog_past_open_deadline(struct watchdog_core_data *data) +{ + if (!open_timeout) + return false; + return time_is_before_jiffies(data->open_deadline); +} + +static void watchdog_set_open_deadline(struct watchdog_core_data *data) +{ + data->open_deadline = jiffies + msecs_to_jiffies(open_timeout); +} + static inline bool watchdog_need_worker(struct watchdog_device *wdd) { /* All variables in milli-seconds */ @@ -196,7 +212,13 @@ static bool watchdog_worker_should_ping(struct watchdog_core_data *wd_data) { struct watchdog_device *wdd = wd_data->wdd; - return wdd && (watchdog_active(wdd) || watchdog_hw_running(wdd)); + if (!wdd) + return false; + + if (watchdog_active(wdd)) + return true; + + return watchdog_hw_running(wdd) && !watchdog_past_open_deadline(wd_data); } static void watchdog_ping_work(struct work_struct *work) @@ -857,6 +879,7 @@ static int watchdog_release(struct inode *inode, struct file *file) watchdog_ping(wdd); } + watchdog_set_open_deadline(wd_data); watchdog_update_worker(wdd); /* make sure that /dev/watchdog can be re-opened */ @@ -955,6 +978,7 @@ static int watchdog_cdev_register(struct watchdog_device *wdd, dev_t devno) /* Record time of most recent heartbeat as 'just before now'. */ wd_data->last_hw_keepalive = jiffies - 1; + watchdog_set_open_deadline(wd_data); /* * If the watchdog is running, prevent its driver from being unloaded, -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] kmod: preempt on kmod_umh_threads_get()
On Wed, May 24, 2017 at 05:45:37PM -0700, Dmitry Torokhov wrote: > On Thu, May 25, 2017 at 02:14:52AM +0200, Luis R. Rodriguez wrote: > > On Fri, May 19, 2017 at 03:27:12PM -0700, Dmitry Torokhov wrote: > > > On Thu, May 18, 2017 at 08:24:43PM -0700, Luis R. Rodriguez wrote: > > > > In theory it is possible multiple concurrent threads will try to > > > > kmod_umh_threads_get() and as such atomic_inc(&kmod_concurrent) at > > > > the same time, therefore enabling a small time during which we've > > > > bumped kmod_concurrent but have not really enabled work. By using > > > > preemption we mitigate this a bit. > > > > > > > > Preemption is not needed when we kmod_umh_threads_put(). > > > > > > > > Signed-off-by: Luis R. Rodriguez > > > > --- > > > > kernel/kmod.c | 24 ++-- > > > > 1 file changed, 22 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/kernel/kmod.c b/kernel/kmod.c > > > > index 563600fc9bb1..7ea11dbc7564 100644 > > > > --- a/kernel/kmod.c > > > > +++ b/kernel/kmod.c > > > > @@ -113,15 +113,35 @@ static int call_modprobe(char *module_name, int > > > > wait) > > > > > > > > static int kmod_umh_threads_get(void) > > > > { > > > > + int ret = 0; > > > > + > > > > + /* > > > > +* Disabling preemption makes sure that we are not rescheduled > > > > here > > > > +* > > > > +* Also preemption helps kmod_concurrent is not increased by > > > > mistake > > > > +* for too long given in theory two concurrent threads could > > > > race on > > > > +* atomic_inc() before we atomic_read() -- we know that's > > > > possible > > > > +* and but we don't care, this is not used for object > > > > accounting and > > > > +* is just a subjective threshold. The alternative is a lock. > > > > +*/ > > > > + preempt_disable(); > > > > atomic_inc(&kmod_concurrent); > > > > if (atomic_read(&kmod_concurrent) <= max_modprobes) > > > > > > That is very "fancy" way to basically say: > > > > > > if (atomic_inc_return(&kmod_concurrent) <= max_modprobes) > > > > Do you mean to combine the atomic_inc() and atomic_read() in one as you > > noted > > (as that is not a change in this patch), *or* that using a memory barrier > > here > > with atomic_inc_return() should suffice to address the same and avoid an > > explicit preemption enable / disable ? > > I am saying that atomic_inc_return() will avoid situation where you have > more than one threads incrementing the counter and believing that they > are [not] allowed to start modprobe. > > I have no idea why you think preempt_disable() would help here. It only > ensures that current thread will not be preempted between the point > where you update the counter and where you check the result. It does not > stop interrupts nor does it affect other threads that might be updating > the same counter. The preemption was inspired by __module_get() and try_module_get(), was that rather silly ? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/29] sync_file.txt: standardize document format
Hi Mauro, 2017-05-18 Mauro Carvalho Chehab : > Each text file under Documentation follows a different > format. Some doesn't even have titles! > > Change its representation to follow the adopted standard, > using ReST markups for it to be parseable by Sphinx: > - Use markup for document title and authorship; > - Mark literal blocks; > - Use a numbered list for references. > > Signed-off-by: Mauro Carvalho Chehab > --- > Documentation/sync_file.txt | 23 +-- > 1 file changed, 13 insertions(+), 10 deletions(-) We went ahead and applied this to drm-misc-next. Thanks. Gustavo -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] kmod: preempt on kmod_umh_threads_get()
On Thu, May 25, 2017 at 03:00:17AM +0200, Luis R. Rodriguez wrote: > On Wed, May 24, 2017 at 05:45:37PM -0700, Dmitry Torokhov wrote: > > On Thu, May 25, 2017 at 02:14:52AM +0200, Luis R. Rodriguez wrote: > > > On Fri, May 19, 2017 at 03:27:12PM -0700, Dmitry Torokhov wrote: > > > > On Thu, May 18, 2017 at 08:24:43PM -0700, Luis R. Rodriguez wrote: > > > > > In theory it is possible multiple concurrent threads will try to > > > > > kmod_umh_threads_get() and as such atomic_inc(&kmod_concurrent) at > > > > > the same time, therefore enabling a small time during which we've > > > > > bumped kmod_concurrent but have not really enabled work. By using > > > > > preemption we mitigate this a bit. > > > > > > > > > > Preemption is not needed when we kmod_umh_threads_put(). > > > > > > > > > > Signed-off-by: Luis R. Rodriguez > > > > > --- > > > > > kernel/kmod.c | 24 ++-- > > > > > 1 file changed, 22 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/kernel/kmod.c b/kernel/kmod.c > > > > > index 563600fc9bb1..7ea11dbc7564 100644 > > > > > --- a/kernel/kmod.c > > > > > +++ b/kernel/kmod.c > > > > > @@ -113,15 +113,35 @@ static int call_modprobe(char *module_name, int > > > > > wait) > > > > > > > > > > static int kmod_umh_threads_get(void) > > > > > { > > > > > + int ret = 0; > > > > > + > > > > > + /* > > > > > + * Disabling preemption makes sure that we are not rescheduled > > > > > here > > > > > + * > > > > > + * Also preemption helps kmod_concurrent is not increased by > > > > > mistake > > > > > + * for too long given in theory two concurrent threads could > > > > > race on > > > > > + * atomic_inc() before we atomic_read() -- we know that's > > > > > possible > > > > > + * and but we don't care, this is not used for object > > > > > accounting and > > > > > + * is just a subjective threshold. The alternative is a lock. > > > > > + */ > > > > > + preempt_disable(); > > > > > atomic_inc(&kmod_concurrent); > > > > > if (atomic_read(&kmod_concurrent) <= max_modprobes) > > > > > > > > That is very "fancy" way to basically say: > > > > > > > > if (atomic_inc_return(&kmod_concurrent) <= max_modprobes) > > > > > > Do you mean to combine the atomic_inc() and atomic_read() in one as you > > > noted > > > (as that is not a change in this patch), *or* that using a memory barrier > > > here > > > with atomic_inc_return() should suffice to address the same and avoid an > > > explicit preemption enable / disable ? > > > > I am saying that atomic_inc_return() will avoid situation where you have > > more than one threads incrementing the counter and believing that they > > are [not] allowed to start modprobe. > > > > I have no idea why you think preempt_disable() would help here. It only > > ensures that current thread will not be preempted between the point > > where you update the counter and where you check the result. It does not > > stop interrupts nor does it affect other threads that might be updating > > the same counter. > > The preemption was inspired by __module_get() and try_module_get(), was that > rather silly ? As far as I can see prrempt_disable() was needed in __module_get() when modules user per-cpu refcounts: you did not want to move away from CPU while manipulating refcount. Now that modules use simple atomics for refcounting I think these preempt_disable() and preempt_enable() can be removed. Thanks. -- Dmitry -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] rtmutex: update rt-mutex-design
The rt-mutex-design documents didn't gotten meaningful update from its first version. Even after owner's pending bit was removed in commit 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") and priority list 'plist' changed to rbtree. And Peter Zijlstra did some clean up and fix for deadline task changes on tip tree. So update it to latest code and make it meaningful. Signed-off-by: Alex Shi Cc: Steven Rostedt Cc: Sebastian Siewior Cc: Mathieu Poirier Cc: Juri Lelli Cc: Thomas Gleixner To: linux-doc@vger.kernel.org To: linux-ker...@vger.kernel.org To: Jonathan Corbet To: Ingo Molnar To: Peter Zijlstra --- Documentation/locking/rt-mutex-design.txt | 418 +++--- 1 file changed, 97 insertions(+), 321 deletions(-) diff --git a/Documentation/locking/rt-mutex-design.txt b/Documentation/locking/rt-mutex-design.txt index 8666070..1a0da32 100644 --- a/Documentation/locking/rt-mutex-design.txt +++ b/Documentation/locking/rt-mutex-design.txt @@ -97,9 +97,9 @@ waiter - A waiter is a struct that is stored on the stack of a blocked a process being blocked on the mutex, it is fine to allocate the waiter on the process's stack (local variable). This structure holds a pointer to the task, as well as the mutex that - the task is blocked on. It also has the plist node structures to - place the task in the waiter_list of a mutex as well as the - pi_list of a mutex owner task (described below). + the task is blocked on. It also has a rbtree node structures to + place the task in waiters rbtree of a mutex as well as the + pi_waiters rbtree of a mutex owner task (described below). waiter is sometimes used in reference to the task that is waiting on a mutex. This is the same as waiter->task. @@ -179,53 +179,35 @@ again. | F->L5-+ - -Plist -- - -Before I go further and talk about how the PI chain is stored through lists -on both mutexes and processes, I'll explain the plist. This is similar to -the struct list_head functionality that is already in the kernel. -The implementation of plist is out of scope for this document, but it is -very important to understand what it does. - -There are a few differences between plist and list, the most important one -being that plist is a priority sorted linked list. This means that the -priorities of the plist are sorted, such that it takes O(1) to retrieve the -highest priority item in the list. Obviously this is useful to store processes -based on their priorities. - -Another difference, which is important for implementation, is that, unlike -list, the head of the list is a different element than the nodes of a list. -So the head of the list is declared as struct plist_head and nodes that will -be added to the list are declared as struct plist_node. - +If process G has the highest priority in the chain, then all the tasks up +the chain (A and B in this example), must have their priorities increased +to that of G. Mutex Waiter List - Every mutex keeps track of all the waiters that are blocked on itself. The mutex -has a plist to store these waiters by priority. This list is protected by +has a rbtree to store these waiters by priority. This tree is protected by a spin lock that is located in the struct of the mutex. This lock is called -wait_lock. Since the modification of the waiter list is never done in +wait_lock. Since the modification of the waiter tree is never done in interrupt context, the wait_lock can be taken without disabling interrupts. -Task PI List +Task PI Tree -To keep track of the PI chains, each process has its own PI list. This is -a list of all top waiters of the mutexes that are owned by the process. -Note that this list only holds the top waiters and not all waiters that are +To keep track of the PI chains, each process has its own PI rbtree. This is +a tree of all top waiters of the mutexes that are owned by the process. +Note that this tree only holds the top waiters and not all waiters that are blocked on mutexes owned by the process. -The top of the task's PI list is always the highest priority task that +The top of the task's PI tree is always the highest priority task that is waiting on a mutex that is owned by the task. So if the task has inherited a priority, it will always be the priority of the task that is -at the top of this list. +at the top of this tree. -This list is stored in the task structure of a process as a plist called -pi_list. This list is protected by a spin lock also in the task structure, +This tree is stored in the task structure of a process as a rbtree called +pi_waiters. It is protected by a spin lock also in the task structure, called pi_lock. This lock may also be taken in interrupt context, so when locking the pi_lock, inter
[PATCH v2 2/3] rtmutex: update rt-mutex
The rtmutex remove a pending owner bit in in rt_mutex::owner, in commit 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") But the document was changed accordingly. Updating it to a meaningful state. BTW, as 'Steven Rostedt' mentioned: There is still technically a "Pending Owner", it's just not called that anymore. The pending owner happens to be the top_waiter of a lock that has no owner and has been woken up to grab the lock. Signed-off-by: Alex Shi Cc: Steven Rostedt Cc: Sebastian Siewior Cc: Mathieu Poirier Cc: Juri Lelli Cc: Thomas Gleixner To: linux-doc@vger.kernel.org To: linux-ker...@vger.kernel.org To: Jonathan Corbet To: Ingo Molnar To: Peter Zijlstra --- Documentation/locking/rt-mutex.txt | 58 +- 1 file changed, 26 insertions(+), 32 deletions(-) diff --git a/Documentation/locking/rt-mutex.txt b/Documentation/locking/rt-mutex.txt index 243393d..35793e0 100644 --- a/Documentation/locking/rt-mutex.txt +++ b/Documentation/locking/rt-mutex.txt @@ -28,14 +28,13 @@ magic bullet for poorly designed applications, but it allows well-designed applications to use userspace locks in critical parts of an high priority thread, without losing determinism. -The enqueueing of the waiters into the rtmutex waiter list is done in +The enqueueing of the waiters into the rtmutex waiter tree is done in priority order. For same priorities FIFO order is chosen. For each rtmutex, only the top priority waiter is enqueued into the owner's -priority waiters list. This list too queues in priority order. Whenever +priority waiters tree. This tree too queues in priority order. Whenever the top priority waiter of a task changes (for example it timed out or -got a signal), the priority of the owner task is readjusted. [The -priority enqueueing is handled by "plists", see include/linux/plist.h -for more details.] +got a signal), the priority of the owner task is readjusted. The +priority enqueueing is handled by "pi_waiters". RT-mutexes are optimized for fastpath operations and have no internal locking overhead when locking an uncontended mutex or unlocking a mutex @@ -46,34 +45,29 @@ is used] The state of the rt-mutex is tracked via the owner field of the rt-mutex structure: -rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1 -are used to keep track of the "owner is pending" and "rtmutex has -waiters" state. +lock->owner holds the task_struct pointer of the owner. Bit 0 is used to +keep track of the "lock has waiters" state. - owner bit1bit0 - NULL 0 0 mutex is free (fast acquire possible) - NULL 0 1 invalid state - NULL 1 0 Transitional state* - NULL 1 1 invalid state - taskpointer 0 0 mutex is held (fast release possible) - taskpointer 0 1 task is pending owner - taskpointer 1 0 mutex is held and has waiters - taskpointer 1 1 task is pending owner and mutex has waiters + ownerbit0 + NULL 0 lock is free (fast acquire possible) + NULL 1 lock is free and has waiters and the top waiter + is going to take the lock* + taskpointer 0 lock is held (fast release possible) + taskpointer 1 lock is held and has waiters** -Pending-ownership handling is a performance optimization: -pending-ownership is assigned to the first (highest priority) waiter of -the mutex, when the mutex is released. The thread is woken up and once -it starts executing it can acquire the mutex. Until the mutex is taken -by it (bit 0 is cleared) a competing higher priority thread can "steal" -the mutex which puts the woken up thread back on the waiters list. +The fast atomic compare exchange based acquire and release is only +possible when bit 0 of lock->owner is 0. -The pending-ownership optimization is especially important for the -uninterrupted workflow of high-prio tasks which repeatedly -takes/releases locks that have lower-prio waiters. Without this -optimization the higher-prio thread would ping-pong to the lower-prio -task [because at unlock time we always assign a new owner]. +(*) It also can be a transitional state when grabbing the lock +with ->wait_lock is held. To prevent any fast path cmpxchg to the lock, +we need to set the bit0 before looking at the lock, and the owner may be +NULL in this small time, hence this can be a transitional state. -(*) The "mutex has waiters" bit gets set to take the lock. If the lock -doesn't already have an owner, this bit is quickly cleared if there are -no waiters. So this is a transitional state to synchronize with looking -at the owner field of the mutex and the mutex owner releasing the lock. +(**) There is a small time when bit 0 is set but there are no +waiters. This can happen when grabbing the lock in the slow path. +To prevent a cmpxchg of the owner releasing the lock