Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
On Mon, 9 Apr 2001 [EMAIL PROTECTED] wrote: > As you've observed, with the approach of waiting for all pre-empted tasks > to synchronize, the possibility of a task staying pre-empted for a long > time could affect the latency of an update/synchonize (though its hard for > me to judge how likely that is). It's very unlikely on a system that doesn't already have problems with CPU starvation because of runaway real-time tasks or interrupt handlers. First, preemption is a comparitively rare event with a mostly timesharing load, typically from 1% to 10% of all context switches. Second, the scheduler should not penalize the preempted task for being preempted, so that it should usually get to continue running as soon as the preempting task is descheduled, which is at most one timeslice for timesharing tasks. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
On Tue, 10 Apr 2001, Paul McKenney wrote: > The algorithms we have been looking at need to have absolute guarantees > that earlier activity has completed. The most straightforward way to > guarantee this is to have the critical-section activity run with preemption > disabled. Most of these code segments either take out locks or run > with interrupts disabled anyway, so there is little or no degradation of > latency in this case. In fact, in many cases, latency would actually be > improved due to removal of explicit locking primitives. > > I believe that one of the issues that pushes in this direction is the > discovery that "synchronize_kernel()" could not be a nop in a UP kernel > unless the read-side critical sections disable preemption (either in > the natural course of events, or artificially if need be). Andi or > Rusty can correct me if I missed something in the previous exchange... > > The read-side code segments are almost always quite short, and, again, > they would almost always otherwise need to be protected by a lock of > some sort, which would disable preemption in any event. > > Thoughts? Disabling preemption is a possible solution if the critical section is short - less than 100us - otherwise preemption latencies become a problem. The implementation of synchronize_kernel() that Rusty and I discussed earlier in this thread would work in other cases, such as module unloading, where there was a concern that it was not practical to have any sort of lock in the read-side code path and the write side was not time critical. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
On Tue, 10 Apr 2001, Paul McKenney wrote: > > Disabling preemption is a possible solution if the critical section > > is > short > > - less than 100us - otherwise preemption latencies become a problem. > > Seems like a reasonable restriction. Of course, this same limit > applies to locks and interrupt disabling, right? That's the goal I'd like to see us achieve in 2.5. Interrupts are already in this range (with a few notable exceptions), but there is still the big kernel lock and a few other long held spin locks to deal with. So I want to make sure that any new locking scheme like the ones under discussion play nicely with the efforts to achieve low-latency Linux such as the preemptible kernel. > > The implementation of synchronize_kernel() that Rusty and I > > discussed earlier in this thread would work in other cases, such as > > module unloading, where there was a concern that it was not > > practical to have any sort of lock in the read-side code path and > > the write side was not time critical. > > True, but only if the synchronize_kernel() implementation is applied > to UP kernels, also. Yes, that is the idea. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lse-tech] Re: [PATCH for 2.5] preemptible kernel
On Tue, 10 Apr 2001 [EMAIL PROTECTED] wrote: > On Tue, Apr 10, 2001 at 09:08:16PM -0700, Paul McKenney wrote: > > > Disabling preemption is a possible solution if the critical section is > > short > > > - less than 100us - otherwise preemption latencies become a problem. > > > > Seems like a reasonable restriction. Of course, this same limit applies > > to locks and interrupt disabling, right? > > So supposing 1/2 us per update > lock process list > for every process update pgd > unlock process list > > is ok if #processes < 200, but can cause some unspecified system failure > due to a dependency on the 100us limit otherwise? Only to a hard real-time system. > And on a slower machine or with some heavy I/O possibilities I'm mostly interested in Linux in embedded systems, where we have a lot of control over the overall system, such as how many processes are running. This makes it easier to control latencies than on a general purpose computer. > We have a tiny little kernel to worry about inRTLinux and it's quite > hard for us to keep track of all possible delays in such cases. How's this > going to work for Linux? The same way everything works for Linux: with enough people around the world interested in and working on these problems, they will be fixed. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Scheduling bug for SCHED_FIFO and SCHED_RR
A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a running task with priority n. You need to give the higher priority task a priority of at least n+2 for it to be chosen by the scheduler. The problem is caused by reschedule_idle(), uniprocessor version: if (preemption_goodness(tsk, p, this_cpu) > 1) tsk->need_resched = 1; For real-time scheduling to work correctly, need_resched should be set whenever preemption_goodness() is greater than 0, not 1. Here is a patch against 2.4.3: --- 2.4.3/kernel/sched.cThu Apr 19 15:03:21 2001 +++ linux/kernel/sched.cFri Apr 20 16:45:07 2001 @@ -290,7 +290,7 @@ struct task_struct *tsk; tsk = cpu_curr(this_cpu); - if (preemption_goodness(tsk, p, this_cpu) > 1) + if (preemption_goodness(tsk, p, this_cpu) > 0) tsk->need_resched = 1; #endif } Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.3+ sound distortion
On Sat, 21 Apr 2001, Victor Julien wrote: > I have a problem with kernels higher than 2.4.2, the sound distorts when > playing a song with xmms while the seti@home client runs. 2.4.2 did not have > this problem. I tried 2.4.3, 2.4.4-pre5 and 2.4.3-ac11. They al showed the > same problem. Try running xmms as root with the "Use realtime priority when available" option checked. If the distortion is because xmms isn't getting enough CPU time, then running it at a realtime priority will fix it. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Wed, 3 Jan 2001, ludovic fernandez wrote: > For hackers, > The following patch makes the kernel preemptable. > It is against 2.4.0-prerelease on for i386 only. > It should work for UP and SMP even though I > didn't validate it on SMP. > Comments are welcome. Hi Ludo, I didn't realise you were still working on this. Did you know that I am also? Our most recent version is at: ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/ although I have yet to put up a 2.4.0-prerelease patch (coming soon). We should probably pool our efforts on this for 2.5. Cheers, Nigel Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Thu, 4 Jan 2001, Daniel Phillips wrote: > A more ambitious way to proceed is to change spinlocks so they can sleep > (not in interrupts of course). There would not be any extra overhead > for this on spin_lock (because the sleep test is handled off the fast > path) but spin_unlock gets a little slower - it has to test and jump on > a flag if there are sleepers. I already have a preemption patch that also changes the longest held spinlocks into sleep locks, i.e. the locks that are routinely held for > 1ms. This gives a kernel with very good interactive response, good enough for most audio apps. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Thu, 4 Jan 2001, Andi Kleen wrote: > On Thu, Jan 04, 2001 at 08:35:02AM +0100, Daniel Phillips wrote: > > A more ambitious way to proceed is to change spinlocks so they can sleep > > (not in interrupts of course). There would not be any extra overhead > > Imagine what happens when a non sleeping spinlock in a interrupt waits > for a "sleeping spinlock" somewhere else... > I'm not sure if this is a good idea. Sleeping locks everywhere would > imply scheduled interrupts, which are nasty. Yes, you have to make sure that you never call a sleeping lock while holding a spinlock. And you can't call a sleeping lock from interrupt handlers in the current model. But this is easy to avoid. > I think a better way to proceed would be to make semaphores a bit more > intelligent and turn them into something like adaptive spinlocks and use > them more where appropiate (currently using semaphores usually causes > lots of context switches where some could probably be avoided). Problem > is that for some cases like your producer-consumer pattern (which has been > used previously in unreleased kernel code BTW) it would be a pessimization > to spin, so such adaptive locks would probably need a different name. Experience has shown that adaptive spinlocks are not worth the extra overhead (if you mean the type that spin for a short time and then decide to sleep). It is better to use spin_lock_irqsave() (which, by definition, disables kernel preemption without the need to set a no-preempt flag) to protect regions where the lock is held for a maximum of around 100us, and to use a sleeping mutex lock for longer regions. This is what I'm working towards. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Thu, 4 Jan 2001, Andi Kleen wrote: > The problem is that current Linux semaphores are very costly locks -- they > always cause a context switch. My preemptible kernel patch currently just uses Linux semaphores to implement sleeping kernel mutexes, but we (at MontaVista Software) are working on a new implementation that also does priority inheritance, to avoid the priority inversion problem, and that does the minimum necessary context switches. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Thu, 4 Jan 2001, Andi Kleen wrote: > On Thu, Jan 04, 2001 at 01:39:57PM -0800, Nigel Gamble wrote: > > Experience has shown that adaptive spinlocks are not worth the extra > > overhead (if you mean the type that spin for a short time > > and then decide to sleep). It is better to use spin_lock_irqsave() > > (which, by definition, disables kernel preemption without the need > > to set a no-preempt flag) to protect regions where the lock is held > > for a maximum of around 100us, and to use a sleeping mutex lock for > > longer regions. This is what I'm working towards. > > What experience ? Only real-time latency testing or SMP scalability > testing? Both. We spent a lot of time on this when I was at SGI working on IRIX. I think we ended up with excellent SMP scalability and good real-time latency. There is also some academic research that suggests that the extra overhead of a dynamic adaptive spinlock usually outweighs any possible gains. > The case I was thinking about is a heavily contended lock like the > inode semaphore of a file that is used by several threads on several > CPUs in parallel or the mm semaphore of a often faulted shared mm. > > It's not an option to convert them to a spinlock, but often the delays > are short enough that a short spin could make sense. I think the first order performance problem of a heavily contended lock is not how it is implemented, but the fact that it is heavily contended. In IRIX we spent a lot of time looking for these bottlenecks and re-architecting to avoid them. (This would mean minimizing the shared accesses in your examples.) Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2.4.0-prerelease: preemptive kernel.
On Thu, 4 Jan 2001, ludovic fernandez wrote: > This is not the point I was trying to make . > So far we are talking about real time behaviour. This is a very interesting/exciting > thing and we all agree it's a huge task which goes much more behind > just having a preemptive kernel. You're right that it is more than just a preemptible kernel, but I don't agree that it's all that huge. But this is the third time I have worked on enabling real-time behavior in unix-like OSes, so I may be biased ;-) > I'm not convinced that a preemptive kernel is interesting for apps using > the time sharing scheduling, mainly because it is not deterministic and the > price of a mmu conntext switch is still way to heavy (that's my 2 cents belief > anyway). But as Roger pointed out, the number of extra context switches introduced by having a preemptible kernel is actually very low. If an interrupt occurs while running in user mode, the context switch it may cause will happen even in a non-preemptible kernel. I think that running a kernel compile for example, the number of context switches per second caused by kernel preemption is probably between 1% and 10% of the total context switches per second. And it's certainly interesting to me that I can listen to MP3s without interruption now, while doing a kernel build! Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0
On Wed, 10 Jan 2001, David S. Miller wrote: > Opinion: Personally, I think the approach in Andrew's patch >is the way to go. > >Not because it can give the absolute best results. >But rather, it is because it says "here is where a lot > of time is spent". > >This has two huge benefits: >1) It tells us where possible algorithmic improvements may > be possible. In some cases we may be able to improve the > code to the point where the pre-emption points are no > longer necessary and can thus be removed. This is definitely an important goal. But lock-metering code in a fully preemptible kernel an also identify spots where algorithmic improvements are most important. >2) It affects only code which can burn a lot of cpu without > scheduling. Compare this to schemes which make the kernel > fully pre-emptable, causing _EVERYONE_ to pay the price of > low-latency. If we were to later fine algorithmic > improvements to the high-latency pieces of code, we > couldn't then just "undo" support for pre-emption because > dependencies will have swept across the whole kernel > already. > > Pre-emption, by itself, also doesn't help in situations > where lots of time is spent while holding spinlocks. > There are several other operating systems which support > pre-emption where you will find hard coded calls to the > scheduler in time-consuming code. Heh, it's almost like, > "what's the frigging point of pre-emption then if you > still have to manually check in some spots?" Spinlocks should not be held for lots of time. This adversely affects SMP scalability as well as latency. That's why MontaVista's kernel preemption patch uses sleeping mutex locks instead of spinlocks for the long held locks. In a fully preemptible kernel that is implemented correctly, you won't find any hard-coded calls to the scheduler in time consuming code. The scheduler should only be called in response to an interrupt (IO or timeout) when we know that a higher priority process has been made runnable, or when the running process sleeps (voluntarily or when it has to wait for something) or exits. This is the case in both of the fully preemptible kernels which I've worked on (IRIX and REAL/IX). Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0
On Fri, 12 Jan 2001, Tim Wright wrote: > On Sat, Jan 13, 2001 at 12:30:46AM +1100, Andrew Morton wrote: > > what worries me about this is the Apache-flock-serialisation saga. > > > > Back in -test8, kumon@fujitsu demonstrated that changing this: > > > > lock_kernel() > > down(sem) > > > > up(sem) > > unlock_kernel() > > > > into this: > > > > down(sem) > > > > up(sem) > > > > had the effect of *decreasing* Apache's maximum connection rate > > on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec. > > > > That's downright scary. > > > > Obviously, was very quick, and the CPUs were passing through > > this section at a great rate. > > > > How can we be sure that converting spinlocks to semaphores > > won't do the same thing? Perhaps for workloads which we > > aren't testing? > > > > So this needs to be done with caution. > > > > Hmmm... > if is very quick, and is guaranteed not to sleep, then a semaphore > is the wrong way to protect it. A spinlock is the correct choice. If it's > always slow, and can sleep, then a semaphore makes more sense, although if > it's highly contented, you're going to serialize and throughput will die. > At that point, you need to redesign :-) > If it's mostly quick but occasionally needs to sleep, I don't know what the > correct idiom would be in Linux. DYNIX/ptx has the concept of atomically > releasing a spinlock and going to sleep on a semaphore, and that would be > the solution there e.g. > > p_lock(lock); > retry: > ... > if (condition where we need to sleep) { > p_sema_v_lock(sema, lock); > /* we got woken up */ > p_lock(lock); > goto retry; > } > ... > > I'm stating the obvious here, and re-iterating what you said, and that is that > we need to carefully pick the correct primitive for the job. Unless there's > something very unusual in the Linux implementation that I've missed, a > spinlock is a "cheaper" method of protecting a short critical section, and > should be chosen. > > I know the BKL is a semantically a little unusual (the automatic release on > sleep stuff), but even so, isn't > > lock_kernel() > down(sem) > > up(sem) > unlock_kernel() > > actually equivalent to > > lock_kernel() > > unlock_kernel() > > If so, it's no great surprise that performance dropped given that we replaced > a spinlock (albeit one guarding somewhat more than the critical section) with > a semaphore. > > Tim > > -- > Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED] > IBM Linux Technology Center, Beaverton, Oregon > "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI > Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0
On Sat, 13 Jan 2001, Andrew Morton wrote: > Nigel Gamble wrote: > > Spinlocks should not be held for lots of time. This adversely affects > > SMP scalability as well as latency. That's why MontaVista's kernel > > preemption patch uses sleeping mutex locks instead of spinlocks for the > > long held locks. > > Nigel, > > what worries me about this is the Apache-flock-serialisation saga. > > Back in -test8, kumon@fujitsu demonstrated that changing this: > > lock_kernel() > down(sem) > > up(sem) > unlock_kernel() > > into this: > > down(sem) > > up(sem) > > had the effect of *decreasing* Apache's maximum connection rate > on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec. > > That's downright scary. > > Obviously, was very quick, and the CPUs were passing through > this section at a great rate. Yes, this demonstrates that spinlocks are preferable to sleep locks for short sections. However, it looks to me like the implementation of up() may be partly to blame. It looks to me as if it tends to prefer to context switch to the woken up process, instead of continuing to run the current process. Surrounding the semaphore with the BKL has the effect of enforcing the latter behavior, because the semaphore itself will never have any waiters. > How can we be sure that converting spinlocks to semaphores > won't do the same thing? Perhaps for workloads which we > aren't testing? > > So this needs to be done with caution. > > As davem points out, now we know where the problems are > occurring, a good next step is to redesign some of those > parts of the VM and buffercache. I don't think this will > be too hard, but they have to *want* to change :) Yes, wherever the code can be redesigned to avoid long held locks, that would definitely be my preferred solution. I think everyone would be happy if we could end up with a maintainable solution using only spinlocks that are held for no longer than a couple of hundred microseconds. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Latency: allowing resheduling while holding spin_locks
On Sat, 13 Jan 2001, Roger Larsson wrote: > A rethinking of the rescheduling strategy... Actually, I think you have more-or-less described how successful preemptible kernels have already been developed, given that your "sleeping spin locks" are really just sleeping mutexes (or binary semaphores). 1. Short critical regions are protected by spin_lock_irq(). The maximum value of "short" is therefore bounded by the maximum time we are happy to disable (local) interrupts - ideally ~100us. 2. Longer regions are protected by sleeping mutexes. 3. Algorithms are rearchitected until all of the highly contended locks are of type 1, and only low contention locks are of type 2. This approach has the advantage that we don't need to use a no-preempt count, and test it on exit from every spinlock to see if a preempting interrupt that has caused a need_resched has occurred, since we won't see the interrupt until it's safe to do the preemptive resched. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0
On Sat, 20 Jan 2001 [EMAIL PROTECTED] wrote: > Let me just point out that Nigel (I think) has previously stated that > the purpose of this approach is to bring the stunning success of > IRIX style "RT" to Linux. Since some of us believe that IRIX is a virtual > handbook of OS errors, it really comes down to a design style. I think > that simplicity and "does the main job well" wins every time over > "really cool algorithms" and "does everything badly". Others > disagree. Let me just point out that Victor has his own commercial axe to grind in his continual bad-mouthing of IRIX, the internals of which he knows nothing about. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0
On Sun, 21 Jan 2001, Paul Barton-Davis wrote: > >Let me just point out that Victor has his own commercial axe to grind in > >his continual bad-mouthing of IRIX, the internals of which he knows > >nothing about. > > 1) do you actually disagree with victor ? Yes, I most emphatically do disagree with Victor! IRIX is used for mission-critical audio applications - recording as well playback - and other low-latency applications. The same OS scales to large numbers of CPUs. And it has the best desktop interactive response of any OS I've used. I will be very happy when Linux is as good in all these areas, and I'm working hard to achieve this goal with negligible impact on the current Linux "sweet-spot" applications such as web serving. > this discussion has the hallmarks of turning into a personal > bash-fest, which is really pointless. what is *not* pointless is a > considered discussion about the merits of the IRIX "RT" approach over > possible approaches that Linux might take which are dissimilar to the > IRIX one. on the other hand, as Victor said, a large part of that > discussion ultimately comes down to a design style rather than hard > factual or logical reasoning. I agree. I'm not wedded to any particular design - I just want a low-latency Linux by whatever is the best way of achieving that. However, I am hearing Victor say that we shouldn't try to make Linux itself low-latency, we should just use his so-called "RTLinux" environment for low-latency tasks. RTLinux is not Linux, it is a separate environment with a separate, limited set of APIs. You can't run XMMS, or any other existing Linux audio app in RTLinux. I want a low-latency Linux, not just another RTOS living parasitically alongside Linux. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Latest preemptible kernel (low latency) patch available
MontaVista Software's latest preemptible kernel patch, preempt-2.4.0-test11-1.patch.bz2, is now available in ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/ Here is an extract from the README file: The patches in this directory, when applied to the corresponding kernel source, will define a new configure option, 'Preemptable Kernel', under the 'Processor type and features' section. When enabled, and the kernel is rebuilt it will be fully preemptable, subject to SMP lock areas (i.e. it uses SMP locking on a UP to control preemptability). The patch can only be enabled for ix86 uniprocessor platforms. (Stay tuned for other platforms and SMP support.) Notes for preempt-2.4.0-test11-1.patch -- - Updated to kernel 2.4.0-test11 Notes for preempt-2.4.0-test10-1.patch -- The main changes between this and previous patches are: - Updated to kernel 2.4.0-test10 - Long held spinlocks changed into mutex locks, currently implemented using semaphores. (We are working on a fast, priority inheriting, binary semaphore implementation of these locks.) The patch gives good results on Benno's Audio-Latency test http://www.gardena.net/benno/linux/audio/, with maximum latencies less than a couple of milliseconds recorded using a 750MHz PIII machine. However, there are still some >10ms non-preemptible paths that are not exercised by this test. The worst non-preemtible paths are now dominated by the big kernel lock, which we hope can be completely eliminated in 2.5 by finer grained locks. (I will be at the Linux Real-Time Workshop in Orlando next week, and may not be able to access my work email address ([EMAIL PROTECTED]), which is why I'm posting this from my personal address.) Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Locking problem in autofs4_expire(), 2.4.0-test10
dput() is called with dcache_lock already held, resulting in deadlock. Here is a suggested fix: = expire.c 1.3 vs edited = --- 1.3/linux/fs/autofs4/expire.c Tue Oct 31 15:14:06 2000 +++ edited/expire.c Fri Nov 3 17:47:47 2000 @@ -223,8 +223,10 @@ mntput(p); return dentry; } + spin_unlock(&dcache_lock); dput(d); mntput(p); + spin_lock(&dcache_lock); } spin_unlock(&dcache_lock); Nigel Gamble MontaVista Software - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Semaphores used for daemon wakeup
On Sun, 17 Dec 2000, Daniel Phillips wrote: > This patch illustrates an alternative approach to waking and waiting on > daemons using semaphores instead of direct operations on wait queues. > The idea of using semaphores to regulate the cycling of a daemon was > suggested to me by Arjan Vos. The basic idea is simple: on each cycle > a daemon down's a semaphore, and is reactivated when some other task > up's the semaphore. > Is this better, worse, or lateral? This is much better, especially from a maintainability point of view. It is also the method that a lot of operating systems already use. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Interrupt/Sleep deadlock
You could use a semaphore for this. Initialize it to 0, then call down() from the ioctl, and up() from the interrupt handler. If the up() happens before the down(), the down() won't go to sleep. Nigel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Weightless process class
On Wed, 4 Oct 2000, Rik van Riel wrote: > On Wed, 4 Oct 2000, LA Walsh wrote: > > > I had another thought regarding resource scheduling -- has the > > idea of a "weightless" process been brought up? > > Yes, look for "idle priority", etc.. > It also turned out to have some problems ... > > > Weightless means it doesn't count toward 'load' and the class > > strictly has lowest priority in the system and gets *no* CPU > > unless there are "idle" cycles. So even a process niced to -19 > > could CPU starve a weightless process. > > One problem here is that you might end up with a weightless > process having grabbed a superblock lock, after which a > normal priority CPU hog kicks in and starves the weightless > process. > > The result is that that superblock lock never gets released, > and everybody needing to grab that lock blocks forever, even > if they have a higher priority than the CPU hog that's starving > our idle process... > > The solution to this would be only starve these processes > when they are in user space and can no longer be holding > any kernel locks. The general solution, which SGI implements in IRIX, is to implement priority inheritance for blocking locks. So the weightless process gets the priority of the blocked process until it releases the lock. IRIX multi-reader semaphores initially did not implement priority inheritance, until this type of starvation scenario occured! I'm working on making the linux kernel fully preemptible (as I did for IRIX when I used to work at SGI), and will need priority inheritance mutexes to enable real-time behavior for SCHED_FIFO and SCHED_RR tasks. So someone at MontaVista will be looking at this in the 2.5 timeframe. Nigel Gamble [EMAIL PROTECTED] www.mvista.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: spinlock help
On Tue, 6 Mar 2001, Manoj Sontakke wrote: > 1. when spin_lock_irqsave() function is called the subsequent code is > executed untill spin_unloc_irqrestore()is called. is this right? Yes. The protected code will not be interrupted, or simultaneously executed by another CPU. > 2. is this sequence valid? > spin_lock_irqsave(a,b); > spin_lock_irqsave(c,d); Yes, as long as it is followed by: spin_unlock_irqrestore(c, d); spin_unlock_irqrestore(a, b); Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH for 2.5] preemptible kernel
Here is the latest preemptible kernel patch. It's much cleaner and smaller than previous versions, so I've appended it to this mail. This patch is against 2.4.2, although it's not intended for 2.4. I'd like comments from anyone interested in a low-latency Linux kernel solution for the 2.5 development tree. Kernel preemption is not allowed while spinlocks are held, which means that this patch alone cannot guarantee low preemption latencies. But as long held locks (in particular the BKL) are replaced by finer-grained locks, this patch will enable lower latencies as the kernel also becomes more scalable on large SMP systems. Notwithstanding the comments in the Configure.help section for CONFIG_PREEMPT, I think this patch has a negligible effect on throughput. In fact, I got better average results from running 'dbench 16' on a 750MHz PIII with 128MB with kernel preemption turned on (~30MB/s) than on the plain 2.4.2 kernel (~26MB/s). (I had to rearrange three headers files that are needed in sched.h before task_struct is defined, but which include inline functions that cannot now be compiled until after task_struct is defined. I chose not to move them into sched.h, like d_path(), as I don't want to make it more difficult to apply kernel patches to my kernel source tree.) Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ diff -Nur 2.4.2/CREDITS linux/CREDITS --- 2.4.2/CREDITS Wed Mar 14 12:15:49 2001 +++ linux/CREDITS Wed Mar 14 12:21:42 2001 @@ -907,8 +907,8 @@ N: Nigel Gamble E: [EMAIL PROTECTED] -E: [EMAIL PROTECTED] D: Interrupt-driven printer driver +D: Preemptible kernel S: 120 Alley Way S: Mountain View, California 94040 S: USA diff -Nur 2.4.2/Documentation/Configure.help linux/Documentation/Configure.help --- 2.4.2/Documentation/Configure.help Wed Mar 14 12:16:10 2001 +++ linux/Documentation/Configure.help Wed Mar 14 12:22:04 2001 @@ -130,6 +130,23 @@ If you have system with several CPU's, you do not need to say Y here: APIC will be used automatically. +Preemptible Kernel +CONFIG_PREEMPT + This option reduces the latency of the kernel when reacting to + real-time or interactive events by allowing a low priority process to + be preempted even if it is in kernel mode executing a system call. + This allows applications that need real-time response, such as audio + and other multimedia applications, to run more reliably even when the + system is under load due to other, lower priority, processes. + + This option is currently experimental if used in conjuction with SMP + support. + + Say Y here if you are building a kernel for a desktop system, embedded + system or real-time system. Say N if you are building a kernel for a + system where throughput is more important than interactive response, + such as a server system. Say N if you are unsure. + Kernel math emulation CONFIG_MATH_EMULATION Linux can emulate a math coprocessor (used for floating point diff -Nur 2.4.2/arch/i386/config.in linux/arch/i386/config.in --- 2.4.2/arch/i386/config.in Wed Mar 14 12:14:18 2001 +++ linux/arch/i386/config.in Wed Mar 14 12:20:02 2001 @@ -161,6 +161,11 @@ define_bool CONFIG_X86_IO_APIC y define_bool CONFIG_X86_LOCAL_APIC y fi + bool 'Preemptible Kernel' CONFIG_PREEMPT +else + if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then + bool 'Preemptible SMP Kernel (EXPERIMENTAL)' CONFIG_PREEMPT + fi fi if [ "$CONFIG_SMP" = "y" -a "$CONFIG_X86_CMPXCHG" = "y" ]; then diff -Nur 2.4.2/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S --- 2.4.2/arch/i386/kernel/entry.S Wed Mar 14 12:17:37 2001 +++ linux/arch/i386/kernel/entry.S Wed Mar 14 12:23:42 2001 @@ -72,7 +72,7 @@ * these are offsets into the task-struct. */ state = 0 -flags = 4 +preempt_count = 4 sigpending = 8 addr_limit = 12 exec_domain= 16 @@ -80,8 +80,30 @@ tsk_ptrace = 24 processor = 52 +/* These are offsets into the irq_stat structure + * There is one per cpu and it is aligned to 32 + * byte boundry (we put that here as a shift count) + */ +irq_array_shift = CONFIG_X86_L1_CACHE_SHIFT + +irq_stat_softirq_active = 0 +irq_stat_softirq_mask = 4 +irq_stat_local_irq_count= 8 +irq_stat_local_bh_count = 12 + ENOSYS = 38 +#ifdef CONFIG_SMP +#define GET_CPU_INDX movl processor(%ebx),%eax; \ +shll $irq_array_shift,%eax +#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx); \ + GET_CPU_INDX +#define CPU_INDX (,%eax) +#else +#define GET_CPU_INDX +#define GET_CURRENT_CPU_INDX GET_CURRENT(%ebx) +#define CPU_INDX +#endif #define SAVE_ALL \ cld; \ @@ -270,16 +292,44 @@
Locking question (was: [CHECKER] 9 potential copy_*_user bugs in2.4.1)
On Thu, 15 Mar 2001, Dawson Engler wrote: > 2. And, unrelated: given the current locking discipline, is > it bad to hold any type of lock (not just a spin lock) when you > call a potentially blocking function? (It at least seems bad > for performance, since you'll hold the lock for milliseconds.) In general, yes. The lock may be held for much longer than milliseconds if the potentially blocking function is waiting for I/O from a network, or a terminal, potentially causing all threads to block on the lock until someone presses a key, in this extreme example. If the lock is a spinlock, then complete deadlock can occur. You're probably aware that semaphores are used both as blocking mutex locks, where the down (lock) and up (unlock) calls are made by the same thread to protect critical data, and as a synchronization mechanism, where the down and up calls are made by different threads. The former use is a "lock", while the latter down() use is a "potentially blocking function" in terms of your question. I don't know how easy it would be for your analysis tools to distinguish between them. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
Hi Pavel, Thanks for you comments. On Sat, 17 Mar 2001, Pavel Machek wrote: > > diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c > > --- 2.4.2/arch/i386/kernel/traps.c Wed Mar 14 12:16:46 2001 > > +++ linux/arch/i386/kernel/traps.c Wed Mar 14 12:22:45 2001 > > @@ -973,7 +973,7 @@ > > set_trap_gate(11,&segment_not_present); > > set_trap_gate(12,&stack_segment); > > set_trap_gate(13,&general_protection); > > - set_trap_gate(14,&page_fault); > > + set_intr_gate(14,&page_fault); > > set_trap_gate(15,&spurious_interrupt_bug); > > set_trap_gate(16,&coprocessor_error); > > set_trap_gate(17,&alignment_check); > > Are you sure about this piece? Add least add a comment, because it > *looks* strange. With a preemptible kernel, we need to enter the page fault handler with interrupts disabled to protect the cr2 register. The interrupt state is restored immediately after cr2 has been saved. Otherwise, an interrupt could cause the faulting thread to be preempted, and the new thread could also fault, clobbering the cr2 register for the preempted thread. See the diff for linux/arch/i386/mm/fault.c. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Tue, 20 Mar 2001, Roger Larsson wrote: > One little readability thing I found. > The prev->state TASK_ value is mostly used as a plain value > but the new TASK_PREEMPTED is or:ed together with whatever was there. > Later when we switch to check the state it is checked against TASK_PREEMPTED > only. Since TASK_RUNNING is 0 it works OK but... Yes, you're right. I had forgotten that TASK_RUNNING is 0 and I think I was assuming that there could be (rare) cases where a task was preempted while prev->state was in transition such that no other flags were set. This is, of course, impossible given that TASK_RUNNING is 0. So your change makes the common case more obvious (to me, at least!) > --- sched.c.nigel Tue Mar 20 18:52:43 2001 > +++ sched.c.roger Tue Mar 20 19:03:28 2001 > @@ -553,7 +553,7 @@ > #endif > del_from_runqueue(prev); > #ifdef CONFIG_PREEMPT > - case TASK_PREEMPTED: > + case TASK_RUNNING | TASK_PREEMPTED: > #endif > case TASK_RUNNING: > } > > > We could add all/(other common) combinations as cases > > switch (prev->state) { > case TASK_INTERRUPTIBLE: > if (signal_pending(prev)) { > prev->state = TASK_RUNNING; > break; > } > default: > #ifdef CONFIG_PREEMPT > if (prev->state & TASK_PREEMPTED) > break; > #endif > del_from_runqueue(prev); > #ifdef CONFIG_PREEMPT > case TASK_RUNNING | TASK_PREEMPTED: > case TASK_INTERRUPTIBLE | TASK_PREEMPTED: > case TASK_UNINTERRUPTIBLE | TASK_PREEMPTED: > #endif > case TASK_RUNNING: > } > > > Then the break in default case could almost be replaced with a BUG()... > (I have not checked the generated code) The other cases are not very common, as they only happen if a task is preempted during the short time that it is running while in the process of changing state while going to sleep or waking up, so the default case is probably OK for them; and I'd be happier to leave the default case for reliability reasons anyway. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Tue, 20 Mar 2001, Rusty Russell wrote: > I can see three problems with this approach, only one of which > is serious. > > The first is code which is already SMP unsafe is now a problem for > everyone, not just the 0.1% of SMP machines. I consider this a good > thing for 2.5 though. So do I. > The second is that there are "manual" locking schemes which are used > in several places in the kernel which rely on non-preemptability; > de-facto spinlocks if you will. I consider all these uses flawed: (1) > they are often subtly broken anyway, (2) they make reading those parts > of the code much harder, and (3) they break when things like this are > done. Likewise. > The third is that preemtivity conflicts with the naive > quiescent-period approach proposed for module unloading in 2.5, and > useful for several other things (eg. hotplugging CPUs). This method > relies on knowing that when a schedule() has occurred on every CPU, we > know noone is holding certain references. The simplest example is a > single linked list: you can traverse without a lock as long as you > don't sleep, and then someone can unlink a node, and wait for a > schedule on every other CPU before freeing it. The non-SMP case is a > noop. See synchonize_kernel() below. So, to make sure I understand this, the code to free a node would look like: prev->next = node->next; /* assumed to be atomic */ synchronize_kernel(); free(node); So that any other CPU concurrently traversing the list would see a consistent state, either including or not including "node" before the call to synchronize_kernel(); but after synchronize_kernel() all other CPUs are guaranteed to see a list that no longer includes "node", so it is now safe to free it. It looks like there are also implicit assumptions to this approach, like no other CPU is trying to use the same approach simultaneously to free "prev". So my initial reaction is that this approach is, like the manual locking schemes you commented on above, open to being subtly broken when people don't understand all the implicit assumptions and subsequently invalidate them. > This, too, is soluble, but it means that synchronize_kernel() must > guarantee that each task which was running or preempted in kernel > space when it was called, has been non-preemtively scheduled before > synchronize_kernel() can exit. Icky. Yes, you're right. > Thoughts? Perhaps synchronize_kernel() could take the run_queue lock, mark all the tasks on it and count them. Any task marked when it calls schedule() voluntarily (but not if it is preempted) is unmarked and the count decremented. synchronize_kernel() continues until the count is zero. As you said, "Icky." > /* We could keep a schedule count for each CPU and make idle tasks >schedule (some don't unless need_resched), but this scales quite >well (eg. 64 processors, average time to wait for first schedule = >jiffie/64. Total time for all processors = jiffie/63 + jiffie/62... > >At 1024 cpus, this is about 7.5 jiffies. And that assumes noone >schedules early. --RR */ > void synchronize_kernel(void) > { > unsigned long cpus_allowed, policy, rt_priority; > > /* Save current state */ > cpus_allowed = current->cpus_allowed; > policy = current->policy; > rt_priority = current->rt_priority; > > /* Create an unreal time task. */ > current->policy = SCHED_FIFO; > current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO); > > /* Make us schedulable on all CPUs. */ > current->cpus_allowed = (1UL< > /* Eliminate current cpu, reschedule */ > while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0) > schedule(); > > /* Back to normal. */ > current->cpus_allowed = cpus_allowed; > current->policy = policy; > current->rt_priority = rt_priority; > } > Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Tue, 20 Mar 2001, Keith Owens wrote: > The preemption patch only allows preemption from interrupt and only for > a single level of preemption. That coexists quite happily with > synchronize_kernel() which runs in user context. Just count user > context schedules (preempt_count == 0), not preemptive schedules. I'm not sure what you mean by "only for a single level of preemption." It's possible for a preempting process to be preempted itself by a higher priority process, and for that process to be preempted by an even higher priority one, limited only by the number of processes waiting for interrupt handlers to make them runnable. This isn't very likely in practice (kernel preemptions tend to be rare compared to normal calls to schedule()), but it could happen in theory. If you're looking at preempt_schedule(), note the call to ctx_sw_off() only increments current->preempt_count for the preempted task - the higher priority preempting task that is about to be scheduled will have a preempt_count of 0. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Wed, 21 Mar 2001, Keith Owens wrote: > I misread the code, but the idea is still correct. Add a preemption > depth counter to each cpu, when you schedule and the depth is zero then > you know that the cpu is no longer holding any references to quiesced > structures. A task that has been preempted is on the run queue and can be rescheduled on a different CPU, so I can't see how a per-CPU counter would work. It seems to me that you would need a per run queue counter, like the example I gave in a previous posting. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Wed, 21 Mar 2001, Andrew Morton wrote: > It's a problem for uniprocessors as well. > > Example: > > #define current_cpu_data boot_cpu_data > #define pgd_quicklist (current_cpu_data.pgd_quick) > > extern __inline__ void free_pgd_fast(pgd_t *pgd) > { > *(unsigned long *)pgd = (unsigned long) pgd_quicklist; > pgd_quicklist = (unsigned long *) pgd; > pgtable_cache_size++; > } > > Preemption could corrupt this list. Thanks, Andrew, for pointing this out. I've added fixes to the patch for this problem and the others in pgalloc.h. If you know of any other similar problems on uniprocessors, please let me know. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
lock_kernel() usage and sync_*() functions
Why is the kernel lock held around sync_supers() and sync_inodes() in sync_old_buffers() and fsync_dev(), but not in sync_dev()? Is it just to serialize calls to these functions, or is there some other reason? Since this use of the BKL is one of the causes of high preemption latency in a preemptible kernel, I'm hoping it would be OK to replace them with a semaphore. Please let me know if this is not the case. Thanks! Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Thu, 22 Mar 2001, Rusty Russell wrote: > Nigel's "traverse the run queue and mark the preempted" solution is > actually pretty nice, and cheap. Since the runqueue lock is grabbed, > it doesn't require icky atomic ops, either. You'd have to mark both the preempted tasks, and the tasks currently running on each CPU (which could become preempted before reaching a voluntary schedule point). > Despite Nigel's initial belief that this technique is fragile, I > believe it will become an increasingly fundamental method in the > kernel, so (with documentation) it will become widely understood, as > it offers scalability and efficiency. Actually, I agree with you now that I've had a chance to think about this some more. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use semaphore for producer/consumer case...
On Fri, 23 Mar 2001, Stelian Pop wrote: > I want to use a semaphore for the classic producer/consumer case > (put the consumer to wait until X items are produced, where X != 1). > > If X is 1, the semaphore is a simple MUTEX, ok. > > But if the consumer wants to wait for several items, it doesn't > seem to work (or something is bad in my code). > > What is wrong in the following ? > > DECLARE_MUTEX(sem); For the producer/consumer case, you want to initialize the semaphore to 0, not 1 which DECLARE_MUTEX(sem) does. So I would use __DECLARE_SEMAPHORE_GENERIC(sem, 0) The count is then the number of items produced but not yet consumed. > producer() { > /* One item produced */ > up(&sem); > } > > consumer() { > /* Let's wait for 10 items */ > atomic_set(&sem->count, -10); > > /* This starts the producers, they will call producer() > some time in the future */ > start_producers(); > > /* Wait for completion */ > down(&sem); > } Then consumer could be: consumer() { int i; start_producers(); /* Wait for 10 items to be produced */ for (i = 0; i < 10; i++) down(&sem); } Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Wed, 21 Mar 2001, David S. Miller wrote: > Basically, anything which uses smp_processor_id() would need to > be holding some lock so as to not get pre-empted. Not necessarily. Another solution for the smp_processor_id() case is to ensure that the task can only be scheduled on the current CPU for the duration that the value of smp_processor_id() is used. Or, if the critical region is very short, to disable interrupts on the local CPU. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Tue, 20 Mar 2001, Nigel Gamble wrote: > On Tue, 20 Mar 2001, Rusty Russell wrote: > > Thoughts? > > Perhaps synchronize_kernel() could take the run_queue lock, mark all the > tasks on it and count them. Any task marked when it calls schedule() > voluntarily (but not if it is preempted) is unmarked and the count > decremented. synchronize_kernel() continues until the count is zero. Hi Rusty, Here is an attempt at a possible version of synchronize_kernel() that should work on a preemptible kernel. I haven't tested it yet. static int sync_count = 0; static struct task_struct *syncing_task = NULL; static DECLARE_MUTEX(synchronize_kernel_mtx); void synchronize_kernel() { struct list_head *tmp; struct task_struct *p; /* Guard against multiple calls to this function */ down(&synchronize_kernel_mtx); /* Mark all tasks on the runqueue */ spin_lock_irq(&runqueue_lock); list_for_each(tmp, &runqueue_head) { p = list_entry(tmp, struct task_struct, run_list); if (p == current) continue; if (p->state == TASK_RUNNING || (p->state == (TASK_RUNNING|TASK_PREEMPTED))) { p->flags |= PF_SYNCING; sync_count++; } } if (sync_count == 0) goto out; syncing_task = current; spin_unlock_irq(&runqueue_lock); /* * Cause a schedule on every CPU, as for a non-preemptible * kernel */ /* Save current state */ cpus_allowed = current->cpus_allowed; policy = current->policy; rt_priority = current->rt_priority; /* Create an unreal time task. */ current->policy = SCHED_FIFO; current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO); /* Make us schedulable on all CPUs. */ current->cpus_allowed = (1UL<cpus_allowed &= ~(1 << smp_processor_id())) != 0) schedule(); /* Back to normal. */ current->cpus_allowed = cpus_allowed; current->policy = policy; current->rt_priority = rt_priority; /* * Wait, if necessary, until all preempted tasks * have reached a sync point. */ spin_lock_irq(&runqueue_lock); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (sync_count == 0) break; spin_unlock_irq(&runqueue_lock); schedule(); spin_lock_irq(&runqueue_lock); } current->state = TASK_RUNNING; syncing_task = NULL; out: spin_unlock_irq(&runqueue_lock); up(&synchronize_kernel_mtx); } And add this code to the beginning of schedule(), just after the runqueue_lock is taken (the flags field is probably not be the right place to put the synchronize mark; and the test should be optimized for the fast path in the same way as the other tests in schedule(), but you get the idea): if ((prev->flags & PF_SYNCING) && !(prev->state & TASK_PREEMPTED)) { prev->flags &= ~PF_SYNCING; if (--sync_count == 0) { syncing_task->state = TASK_RUNNING; if (!task_on_runqueue(syncing_task)) add_to_runqueue(syncing_task); syncing_task = NULL; } } Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Sat, 31 Mar 2001, george anzinger wrote: > I think this should be: > if (p->has_cpu || p->state & TASK_PREEMPTED)) { > to catch tasks that were preempted with other states. But the other states are all part of the state change that happens at a non-preemtive schedule() point, aren't they, so those tasks are already safe to access the data we are protecting. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 2.5] preemptible kernel
On Sat, 31 Mar 2001, Rusty Russell wrote: > > if (p->state == TASK_RUNNING || > > (p->state == (TASK_RUNNING|TASK_PREEMPTED))) { > > p->flags |= PF_SYNCING; > > Setting a running task's flags brings races, AFAICT, and checking > p->state is NOT sufficient, consider wait_event(): you need p->has_cpu > here I think. My thought here was that if p->state is anything other than TASK_RUNNING or TASK_RUNNING|TASK_PREEMPTED, then that task is already at a synchonize point, so we don't need to wait for it to arrive at another one - it will get a consistent view of the data we are protecting. wait_event() qualifies as a synchronize point, doesn't it? Or am I missing something? > The only way I can see is to have a new element in "struct > task_struct" saying "syncing now", which is protected by the runqueue > lock. This looks like (and I prefer wait queues, they have such nice > helpers): > > static DECLARE_WAIT_QUEUE_HEAD(syncing_task); > static DECLARE_MUTEX(synchronize_kernel_mtx); > static int sync_count = 0; > > schedule(): > if (!(prev->state & TASK_PREEMPTED) && prev->syncing) > if (--sync_count == 0) wake_up(&syncing_task); Don't forget to reset prev->syncing. I agree with you about wait queues, but didn't use them here because of the problem of avoiding deadlock on the runqueue lock, which the wait queues also use. The above code in schedule needs the runqueue lock to protect sync_count. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reschedule_idle changes in ac kernels
On Mon, 4 Jun 2001, Mike Kravetz wrote: > I just noticed the changes to reschedule_idle() in the 2.4.5-ac > kernel. I suspect these are the changes made for: > > o Fix off by one on real time pre-emption in scheduler > > I'm curious if anyone has ran any benchmarks before and after > applying this fix. I was running realtime benchmarks, which was how I found the bug. > The reason I ask is that during the development of my multi-queue > scheduler, I 'accidently' changed reschedule_idle code to trigger > a preemption if preemption_goodness() was greater than 0, as > opposed to greater than 1. I believe this is the same change made > to the ac kernel. After this change, we saw a noticeable drop in > performance for some benchmarks. > > The drop in performance I saw could have been the result of a > combination of the change, and my multi-queue scheduler. However, > in any case aren't we now going to trigger more preemptions? > > I understand that we need to make the fig to get the realtime > semantics correct, but we also need to be aware of performance in > the non-realtime case. The realtime bug was caused by whoever decided, sometime in 2.4, that the result of preemption_goodness() should be compared to 1 instead of 0 (without changing the comment above that function). An alternative fix for the realtime bug would be weight = 1000 + (p->rt_priority * 2); in goodness(), so that two realtime tasks with priorities that differ by 1 would have goodness values that differ by more than one. However, before anyone rushes to implement this, I'd like to suggest that any performance problems that may be found with the SCHED_OTHER goodness calculation should be fixed in goodness(), if at all possible, and not leak out as an undocumented magic number into reschedule_idle(). Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Scheduling bug for SCHED_FIFO and SCHED_RR
On Fri, 20 Apr 2001, Nigel Gamble wrote: > A SCHED_FIFO or SCHED_RR task with priority n+1 will not preempt a > running task with priority n. You need to give the higher priority task > a priority of at least n+2 for it to be chosen by the scheduler. > > The problem is caused by reschedule_idle(), uniprocessor version: > > if (preemption_goodness(tsk, p, this_cpu) > 1) > tsk->need_resched = 1; > > For real-time scheduling to work correctly, need_resched should be set > whenever preemption_goodness() is greater than 0, not 1. This bug is also in the SMP version of reschedule_idle(). The corresponding fix (against 2.4.3-ac14) is: --- 2.4.3-ac14/kernel/sched.c Tue Apr 24 18:40:15 2001 +++ linux/kernel/sched.cTue Apr 24 18:41:32 2001 @@ -246,7 +246,7 @@ */ oldest_idle = (cycles_t) -1; target_tsk = NULL; - max_prio = 1; + max_prio = 0; for (i = 0; i < smp_num_cpus; i++) { cpu = cpu_logical_map(i); Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: #define HZ 1024 -- negative effects?
On Tue, 24 Apr 2001, Michael Rothwell wrote: > Are there any negative effects of editing include/asm/param.h to change > HZ from 100 to 1024? Or any other number? This has been suggested as a > way to improve the responsiveness of the GUI on a Linux system. Does it > throw off anything else, like serial port timing, etc.? Why not just run the X server at a realtime priority? Then it will get to respond to existing events, such as keyboard and mouse input, promptly without creating lots of superfluous extra clock interrupts. I think you will find this is a better solution. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Viewing SCHED_FIFO, SCHED_RR stats in /proc
I've just noticed that the priority and nice values listed in /proc//stat aren't very useful for SCHED_FIFO or SCHED_RR tasks. I'd like to be able to distinguish tasks with these policies from SCHED_OTHER tasks, and to view task->rt_priority. Am I correct that this information is not currently available through /proc? Here is one way to expose this information that should be compatible with existing tools like top and ps. For SCHED_OTHER, the values are unchanged. For SCHED_RR and SCHED_FIFO, the priority value displayed is (20 + task->rt_priority), which distinguishes them for SCHED_OTHER priorities which can't be greater than 20. And SCHED_FIFO tasks, whose nice value is ignored by the scheduler, are distinguished from SCHED_RR tasks by begin displayed with a nice value of -99. diff -u -r1.2 array.c --- linux/fs/proc/array.c 2001/04/16 23:26:41 1.2 +++ linux/fs/proc/array.c 2001/04/26 22:37:56 @@ -336,11 +336,18 @@ collect_sigign_sigcatch(task, &sigign, &sigcatch); - /* scale priority and nice values from timeslices to -20..20 */ - /* to make it look like a "normal" Unix priority/nice value */ - priority = task->counter; - priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER; - nice = task->nice; + if (task->policy == SCHED_OTHER) { + /* scale priority and nice values from timeslices to -20..20 */ + /* to make it look like a "normal" Unix priority/nice value */ + priority = task->counter; + priority = 20 - (priority * 10 + DEF_COUNTER / 2) / DEF_COUNTER; + } else { + priority = 20 + task->rt_priority; + } + if (task->policy == SCHED_FIFO) + nice = -99; + else + nice = task->nice; read_lock(&tasklist_lock); ppid = task->p_opptr->pid; Can anyone think of a better way of doing this? Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: #define HZ 1024 -- negative effects?
On Fri, 27 Apr 2001, Mike Galbraith wrote: > > Rubbish. Whenever a higher-priority thread than the current > > thread becomes runnable the current thread will get preempted, > > regardless of whether its timeslices is over or not. > > What about SCHED_YIELD and allocating during vm stress times? > > Say you have only two tasks. One is the gui and is allocating, > the other is a pure compute task. The compute task doesn't do > anything which will cause preemtion except use up it's slice. > The gui may yield the cpu but the compute job never will. > > (The gui won't _become_ runnable if that matters. It's marked > as running, has yielded it's remaining slice and went to sleep.. > with it's eyes open;) A well-written GUI should not be using SCHED_YIELD. If it is "allocating", anything, it won't be using SCHED_YIELD or be marked runnable, it will be blocked, waiting until the resource becomes available. When that happens, it will preempt the compute task (if its priority is high enough, which is very likely - and can be assured if it's running at a real-time priority as I suggested earlier). Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: #define HZ 1024 -- negative effects?
On Fri, 27 Apr 2001, Mike Galbraith wrote: > On Fri, 27 Apr 2001, Nigel Gamble wrote: > > > What about SCHED_YIELD and allocating during vm stress times? > > snip > > > A well-written GUI should not be using SCHED_YIELD. If it is > > I was refering to the gui (or other tasks) allocating memory during > vm stress periods, and running into the yield in __alloc_pages().. > not a voluntary yield. Oh, I see. Well, if this were causing the problem, then running the GUI at a real-time priority would be a better solution than increasing the clock frequency, since SCHED_YIELD has no effect on real-time tasks unless there are other runnable real-time tasks at the same priority. The call to schedule() would just reschedule the real-time GUI task itself immediately. However, in times of vm stress it is more likely that GUI performance problems would be caused by parts of the GUI having been paged out, rather than by anything which could be helped by scheduling differences. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86 page fault handler not interrupt safe
On Mon, 7 May 2001, Linus Torvalds wrote: > On Mon, 7 May 2001, Brian Gerst wrote: > > This patch will still cause the user process to seg fault: The error > > code on the stack will not match the address in %cr2. > > You've convinced me. Good thinking. Let's do the irq thing. I've actually seen user processes seg faulting because of this with the fully preemptible kernel patch applied. The fix we used in that patch was to use an interrupt gate for the fault handler, then to simply restore the interrupt state: diff -Nur 2.4.2/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c --- 2.4.2/arch/i386/kernel/traps.c Mon Mar 26 18:41:05 2001 +++ linux/arch/i386/kernel/traps.c Tue Mar 27 15:13:33 2001 @@ -973,7 +973,7 @@ set_trap_gate(11,&segment_not_present); set_trap_gate(12,&stack_segment); set_trap_gate(13,&general_protection); - set_trap_gate(14,&page_fault); + set_intr_gate(14,&page_fault); set_trap_gate(15,&spurious_interrupt_bug); set_trap_gate(16,&coprocessor_error); set_trap_gate(17,&alignment_check); diff -Nur 2.4.2/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c --- 2.4.2/arch/i386/mm/fault.c Mon Mar 26 18:41:06 2001 +++ linux/arch/i386/mm/fault.c Tue Mar 27 15:13:33 2001 @@ -117,6 +117,9 @@ /* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); + /* It's safe to allow preemption after cr2 has been saved */ + local_irq_restore(regs->eflags); + tsk = current; /* Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86 page fault handler not interrupt safe
On Mon, 7 May 2001, Brian Gerst wrote: > Nigel Gamble wrote: > > > > On Mon, 7 May 2001, Linus Torvalds wrote: > > > On Mon, 7 May 2001, Brian Gerst wrote: > > > > This patch will still cause the user process to seg fault: The error > > > > code on the stack will not match the address in %cr2. > > > > > > You've convinced me. Good thinking. Let's do the irq thing. > > > > I've actually seen user processes seg faulting because of this with the > > fully preemptible kernel patch applied. The fix we used in that patch > > was to use an interrupt gate for the fault handler, then to simply > > restore the interrupt state: > > Keep in mind that regs->eflags could be from user space, and could have > some undesirable flags set. That's why I did a test/sti instead of Good point. > reloading eflags. Plus my patch leaves interrupts disabled for the > minimum time possible. I'm not sure that it makes much difference, as interrupts are disabled for such a short time anyway. I'd prefer to put the test/sti in do_page_fault(), and reduce the complexity needed in assembler routines as much as possible, for maintainability reasons. Nigel Gamble[EMAIL PROTECTED] Mountain View, CA, USA. http://www.nrg.org/ MontaVista Software [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/