from:"George Anzinger"

Re: [Kgdb-bugreport] [PATCH 1/5] KGDB: improve early init

2008-01-31 Thread George Anzinger


On 01/31/2008 01:36 AM,  Jan Kiszka was caught saying:
> Jan Kiszka wrote:
>> George Anzinger wrote:
>>> On 01/30/2008 04:08 PM,  Jan Kiszka was caught saying:
>>>> [Here comes a rebased version against latest x86/mm]
>>>>
>>>> In case "kgdbwait" is passed as kernel parameter, KGDB tries to set up
>>>> and connect to the front-end already during early_param evaluation.
>>>> This
>>>> fails on x86 as the exception stack is not yet initialized, 
effectively

>>>> delaying kgdbwait until late-init.
>>>
>>> I wonder how much work it would take to just set up the exception
>>> stack and proceed.  After all the kgbdwait is there to help debug
>>> very early kernel code...
>>
>> In principle a valid question, but I'm not the one to answer it. I
>> would not feel very well if I had to reorder this critical setup code.
>> Look, we would have to move trap_init in start_kernel before
>> parse_early_param, and that would affect _every_ arch...

I can not speak to other archs, but for x86 I called trap_init from the 
code that caught the kgdbwait.  At that time (since I retired, I have 
not looked at the actual kernel code) it could be called again later by 
the kernel code.  I.e. I did not try to reorder the kernel bring up 
code, but just added an additional call to trap_init and then only in 
the case of finding a kgdbwait.


As such, this would need to be arch specific...

>>
>
> BTW, do you know if EXCEPTION_STACK_READY fails for other archs in
> parse_early_param as well? It should, because my under standing of
> trap_init is that it's the functions to arm things like... exception
> handlers? And that raises the question of the deeper purpose of this
> check (and the invocation of kgdb_early_init from the argument parsing
> function). Sigh, KGDB is still a quite improvable piece of code.

Likely.  Once you get it in the main line kernel, one would hope that 
other arch code would be forth coming as many more "eyes" will be in play.

>
> Jan
>
> PS: Can we move this to some public list?

Sure, sorry I picked the wrong reply button, never intended it to be 
private.

>

--
George Anzinger   [EMAIL PROTECTED]


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUESTION] 2.4.x nice level

2001-04-09 Thread george anzinger


SodaPop wrote:
> 
> I too have noticed that nicing processes does not work nearly as
> effectively as I'd like it to.  I run on an underpowered machine,
> and have had to stop running things such as seti because it steals too
> much cpu time, even when maximally niced.
> 
> As an example, I can run mpg123 and a kernel build concurrently without
> trouble; but if I add a single maximally niced seti process, mpg123 runs
> out of gas and will start to skip while decoding.
> 
> Is there any way we can make nice levels stronger than they currently are
> in 2.4?  Or is this perhaps a timeslice problem, where once seti gets cpu
> time it runs longer than it should since it makes relatively few system
> calls?
> 
In kernel/sched.c for HZ < 200 an adjustment of nice to tick is set up
to be nice>>2 (i.e. nice /4).  This gives the ratio of nice to time
slice.  Adjustments are made to make the MOST nice yield 1 jiffy, so
using this scale and remembering nice ranges from -19 to 20 the least
nice is 40/4 or 10 ticks.  This implies that if only two tasks are
running and they are most and least niced then one will get 1/11 of the
processor, the other 10/11 (about 10% and 90%).  If one is niced and the
other is not you get 1 and 5 for the time slices or 1/6 and 5/6 (17% and
83%).  

In 2.2.x systems the full range of nice was used one to one to give 1
and 39 or 40 or 2.5% and 97.5% for max nice to min.  For most nice to
normal you would get 1 and 20 or 4.7% and 95.3%.

The comments say the objective is to come up with a time slice of 50ms,
presumably for the normal nice value of zero.  After translating the
range this would be a value of 20 and, yep 20/4 give 5 jiffies or 50
ms.  Sure puts a crimp in the min to max range, however.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUESTION] 2.4.x nice level

2001-04-10 Thread george anzinger


Rik van Riel wrote:
> 
> On Mon, 9 Apr 2001, george anzinger wrote:
> > SodaPop wrote:
> > >
> > > I too have noticed that nicing processes does not work nearly as
> > > effectively as I'd like it to.  I run on an underpowered machine,
> > > and have had to stop running things such as seti because it steals too
> > > much cpu time, even when maximally niced.
> 
> > In kernel/sched.c for HZ < 200 an adjustment of nice to tick is set up
> > to be nice>>2 (i.e. nice /4).  This gives the ratio of nice to time
> > slice.  Adjustments are made to make the MOST nice yield 1 jiffy, so
> [snip 2.4 nice scale is too limited]
> 
> I'll try to come up with a recalculation change that will make
> this thing behave better, while still retaining the short time
> slices for multiple normal-priority tasks and the cache footprint
> schedule() and friends currently have...
> 
> [I've got some vague ideas ... give me a few hours to put them
> into code ;)]

You might check out this:

http://rtsched.sourceforge.net/

I did some work on leveling out the recalculation overhead.  I think, as
the code shows, that it can be done without dropping the run queue lock.

I wonder if the wave nature of the recalculation cycle is a problem.  By
this I mean after a recalculation tasks run for relatively long times
(50 ms today) but as the recalculation time approaches, the time reduces
to 10 ms.  Gets one to thinking about a way to come up with a more
uniform, over time, mix.

George

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer !

2001-04-10 Thread george anzinger

Just for your information we have a project going that is trying to come
up with a good solution for all of this:

http://sourceforge.net/projects/high-res-timers

We have a mailing list there where we have discussed much of the same
stuff.  The mailing list archives are available at sourceforge.

Lets separate this into findings and tentative conclusions :)

Findings:

a) The University of Kansas and others have done a lot of work here.

b) High resolution timer events can be had with or without changing HZ.

c) High resolution timer events can be had with or without eliminating
the 1/HZ tick.

d) The organization of the timer list should reflect the existence of
the 1/HZ tick or not.  The current structure is not optimal for a "tick
less" implementation.  Better would be strict expire order with indexes
to "interesting times".

e) The current organization of the timer list generates a hiccup every
2.56 seconds to handle "cascading".  Hiccups are bad.

f) As noted, the account timers (task user/system times) would be much
more accurate with the tick less approach.  The cost is added code in
both the system call and the schedule path.  

Tentative conclusions:

Currently we feel that the tick less approach is not acceptable due to
(f).  We felt that this added code would NOT be welcome AND would, in a
reasonably active system, have much higher overhead than any savings in
not having a tick.  Also (d) implies a list organization that will, at
the very least, be harder to understand.  (We have some thoughts here,
but abandoned the effort because of (f).)  We are, of course, open to
discussion on this issue and all others related to the project
objectives.

We would reorganize the current timer list structure to eliminate the
cascade (e) and to add higher resolution entries.  The higher resolution
entries would carry an addition word which would be the fraction of a
jiffie that needs to be added to the jiffie value for the timer.  This
fraction would be in units defined by the platform to best suit the sub
jiffie interrupt generation code.  Each of the timer lists would then be
ordered by time based on this sub jiffie value.  In addition, in order
to eliminate the cascade, each timer list would carry all timers for
times that expire on the (jiffie mod (size of list)).  Thus, with the
current 256 first order lists, all timers with the same (jiffies & 255)
would be in the same list, again in expire order.  We also think that
the list size should be configurable to some power of two.  Again we
welcome discussion of these issues.

George

Alan Cox wrote:

>> It's also all interrupts, not only syscalls, and also context switch if you
>> want to be accurate.

>We dont need to be that accurate. Our sample rate is currently so low the
>data is worthless anyway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer !

2001-04-10 Thread george anzinger


mark salisbury wrote:
> 
> george anzinger wrote:
> 
> > f) As noted, the account timers (task user/system times) would be much
> > more accurate with the tick less approach.  The cost is added code in
> > both the system call and the schedule path.
> >
> > Tentative conclusions:
> >
> > Currently we feel that the tick less approach is not acceptable due to
> > (f).  We felt that this added code would NOT be welcome AND would, in a
> > reasonably active system, have much higher overhead than any savings in
> > not having a tick.  Also (d) implies a list organization that will, at
> > the very least, be harder to understand.  (We have some thoughts here,
> > but abandoned the effort because of (f).)  We are, of course, open to
> > discussion on this issue and all others related to the project
> > objectives.
> 
> f does not imply tick-less is not acceptable, it implies that better process time
> accounting is not acceptable.

My thinking is that a timer implementation that forced (f) would have
problems gaining acceptance (even with me :).  I think a tick less
system does force this and this is why we have, at least for the moment,
abandoned it.  In no way does this preclude (f) as it is compatible with
either ticks or tick less time keeping.  On the other hand, the stated
project objectives do not include (f) unless, of course we do a tick
less time system.
> 
> list organization is not complex, it is a sorted absolute time list.  I would
> argue that this is a hell of a lot easier to understand that ticks + offsets.

The complexity comes in when you want to maintain indexes into the list
for quick insertion of new timers.  To get the current insert
performance, for example, you would need pointers to (at least) each of
the next 256 centasecond boundaries in the list.  But the list ages, so
these pointers need to be continually updated.  The thought I had was to
update needed pointers (and only those needed) each time an insert was
done and a needed pointer was found to be missing or stale.  Still it
adds complexity that the static structure used now doesn't have.
> 
> still, better process time accounting should be a compile CONFIG option, not
> ignored and ruled out because some one thinks that is is to expensive in the
> general case.

As I said above, we are not ruling it out, but rather, we are not
requiring it by going tick less.
> 
> the whole point of linux and CONFIG options is to get you the kernel with the
> features you want, not what someone else wants.
> 
> there should be a whole range of config options associated with this issue:
> 
> CONFIG_JIFFIES   == old jiffies implementation
> CONFIG_TICKLESS  == tickless
> CONFIG_HYBRID  == old jiffies plus a tickless high-res timer system on
> the side but not assoc w/ process and global
> timekeeping
> 
> CONFIG_USELESS_PROCESS_TIME_ACCOUNTING = old style, cheap time acctg
> CONFIG_USEFUL_BUT_COSTS_TIME_ACCOUNTING = accurate but expensive time accounting
> 
> this way, users who want tickless and lousy time acctg can have it AND people who
> want jiffies and good time acctg could have it.

As I said, it is not clear how you could get
CONFIG_USELESS_PROCESS_TIME_ACCOUNTING unless you did a tick every
jiffie.  What did you have in mind?
> 
> these features are largely disjoint and easily seperable.  it is also relatively
> trivial to do this in such a way that drivers depending on the jiffie abstraction
> can be supported without modification no matter what the configuration.
> 
For the most part, I agree.  I am not sure that it makes a lot of sense
to mix some of these options, however.  I think it comes down to the
question of benefit vs cost.  If keeping an old version around that is
not any faster or more efficient in any way would seem too costly to
me.  We would like to provide a system that is better in every way and
thus eliminate the need to keep the old one around.  We could leave it
in as a compile option so folks would have a fall back, I suppose.

An Issue no one has raised is that the tick less system would need to
start a timer each time it scheduled a task.  This would lead to either
slow but very precise time slicing or about what we have today with more
schedule overhead.

George


> Mark Salisbury
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer !

2001-04-10 Thread george anzinger


Mark Salisbury wrote:
> 
> > mark salisbury wrote:
> > >
> > > george anzinger wrote:
> > >
> > > > f) As noted, the account timers (task user/system times) would be much
> > > > more accurate with the tick less approach.  The cost is added code in
> > > > both the system call and the schedule path.
> > > >
> > > > Tentative conclusions:
> > > >
> > > > Currently we feel that the tick less approach is not acceptable due to
> > > > (f).  We felt that this added code would NOT be welcome AND would, in
> a
> > > > reasonably active system, have much higher overhead than any savings
> in
> > > > not having a tick.  Also (d) implies a list organization that will, at
> > > > the very least, be harder to understand.  (We have some thoughts here,
> > > > but abandoned the effort because of (f).)  We are, of course, open to
> > > > discussion on this issue and all others related to the project
> > > > objectives.
> > >
> > > f does not imply tick-less is not acceptable, it implies that better
> process time
> > > accounting is not acceptable.
> >
> > My thinking is that a timer implementation that forced (f) would have
> > problems gaining acceptance (even with me :).  I think a tick less
> > system does force this and this is why we have, at least for the moment,
> > abandoned it.  In no way does this preclude (f) as it is compatible with
> > either ticks or tick less time keeping.  On the other hand, the stated
> > project objectives do not include (f) unless, of course we do a tick
> > less time system.
> > >
> > > list organization is not complex, it is a sorted absolute time list.  I
> would
> > > argue that this is a hell of a lot easier to understand that ticks +
> offsets.
> >
> > The complexity comes in when you want to maintain indexes into the list
> > for quick insertion of new timers.  To get the current insert
> > performance, for example, you would need pointers to (at least) each of
> > the next 256 centasecond boundaries in the list.  But the list ages, so
> > these pointers need to be continually updated.  The thought I had was to
> > update needed pointers (and only those needed) each time an insert was
> > done and a needed pointer was found to be missing or stale.  Still it
> > adds complexity that the static structure used now doesn't have.
> 
> actually, I think a head and tail pointer would be sufficient for most
> cases. (most new timers are either going to be a new head of list or a new
> tail, i.e. long duration timeouts that will never be serviced or short
> duration timers that are going to go off "real soon now (tm)")  the oddball
> cases would be mostly coming from user-space, i.e. nanosleep which a longerr
> list insertion disapears in the block/wakeup/context switch overhead
> 
> > >
> > > still, better process time accounting should be a compile CONFIG option,
> not
> > > ignored and ruled out because some one thinks that is is to expensive in
> the
> > > general case.
> >
> > As I said above, we are not ruling it out, but rather, we are not
> > requiring it by going tick less.
> > As I said, it is not clear how you could get
> > CONFIG_USELESS_PROCESS_TIME_ACCOUNTING unless you did a tick every
> > jiffie.  What did you have in mind?
> 
> time accounting can be limited to the quantum expiration and voluntary
> yields in the tickless/useless case.
> 
> > For the most part, I agree.  I am not sure that it makes a lot of sense
> > to mix some of these options, however.  I think it comes down to the
> > question of benefit vs cost.  If keeping an old version around that is
> > not any faster or more efficient in any way would seem too costly to
> > me.  We would like to provide a system that is better in every way and
> > thus eliminate the need to keep the old one around.  We could leave it
> > in as a compile option so folks would have a fall back, I suppose.
> 
> I agree that some combinations don't make much sense _TO_ME_ but that
> doesn't mean they don't meet sombody's needs.
> 
> in my case (embedded, medium hard real time, massively parallel
> multicomputer)  the only choices that makes sense to my customers is
> tickless/useless in deployment and tickless/useful in
> development/profiling/optimization.

I suspect you might go for ticked if its overhead was less.  The thing
that makes me think the overhead is high for tick less is the accounting
and time slice stuff.  This has to be handled each

Re: [test-PATCH] Re: [QUESTION] 2.4.x nice level

2001-04-11 Thread george anzinger


One rule of optimization is to move any code you can outside the loop. 
Why isn't the nice_to_ticks calculation done when nice is changed
instead of EVERY recalc.?  I guess another way to ask this is, who needs
to see the original nice?  Would it be worth another task_struct entry
to move this calculation out of the loop?

George

Rik van Riel wrote:
> 
> On Tue, 10 Apr 2001, Rik van Riel wrote:
> 
> > I'll try to come up with a recalculation change that will make
> > this thing behave better, while still retaining the short time
> > slices for multiple normal-priority tasks and the cache footprint
> > schedule() and friends currently have...
> 
> OK, here it is. It's nothing like montavista's singing-dancing
> scheduler patch that does all, just a really minimal change that
> should stretch the nice levels to yield the following CPU usage:
> 
> Nice05   10   15   19
> %CPU  100   56   2561
> 
> Note that the code doesn't change the actual scheduling code,
> just the recalculation. Care has also been taken to not increase
> the cache footprint of the scheduling and recalculation code.
> 
> I'd love to hear some test results from people who are interested
> in wider nice levels. How does this run on your system?  Can you
> trigger bad behaviour in any way?
> 
> regards,
> 
> Rik
> --
> Virtual memory is like a game you can't win;
> However, without VM there's truly nothing to lose...
> 
> http://www.surriel.com/
> http://www.conectiva.com/   http://distro.conectiva.com.br/
> 
> --- linux-2.4.3-ac4/kernel/sched.c.orig Tue Apr 10 21:04:06 2001
> +++ linux-2.4.3-ac4/kernel/sched.c  Wed Apr 11 06:18:46 2001
> @@ -686,8 +686,26 @@
> struct task_struct *p;
> spin_unlock_irq(&runqueue_lock);
> read_lock(&tasklist_lock);
> -   for_each_task(p)
> +   for_each_task(p) {
> +   if (p->nice <= 0) {
> +   /* The normal case... */
> p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
> +   } else {
> +   /*
> +* Niced tasks get less CPU less often, leading to
> +* the following distribution of CPU time:
> +*
> +* Nice05   10   15   19
> +* %CPU  100   56   2561
> +*/
> +   short prio = 20 - p->nice;
> +   p->nice_calc += prio;
> +   if (p->nice_calc >= 20) {
> +   p->nice_calc -= 20;
> +   p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
> +   }
> +   }
> +   }
> read_unlock(&tasklist_lock);
> spin_lock_irq(&runqueue_lock);
> }
> --- linux-2.4.3-ac4/include/linux/sched.h.orig  Tue Apr 10 21:04:13 2001
> +++ linux-2.4.3-ac4/include/linux/sched.h   Wed Apr 11 06:26:47 2001
> @@ -303,7 +303,8 @@
>   * the goodness() loop in schedule().
>   */
> long counter;
> -   long nice;
> +   short nice_calc;
> +   short nice;
> unsigned long policy;
> struct mm_struct *mm;
> int has_cpu, processor;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer !

2001-04-11 Thread george anzinger


Jamie Locker wrote:
> 
> Mark Salisbury wrote:
> > > The complexity comes in when you want to maintain indexes into the list
> > > for quick insertion of new timers.  To get the current insert
> > > performance, for example, you would need pointers to (at least) each of
> > > the next 256 centasecond boundaries in the list.  But the list ages, so
> > > these pointers need to be continually updated.  The thought I had was to
> > > update needed pointers (and only those needed) each time an insert was
> > > done and a needed pointer was found to be missing or stale.  Still it
> > > adds complexity that the static structure used now doesn't have.
> >
> > actually, I think a head and tail pointer would be sufficient for most
> > cases. (most new timers are either going to be a new head of list or a new
> > tail, i.e. long duration timeouts that will never be serviced or short
> > duration timers that are going to go off "real soon now (tm)")  the oddball
> > cases would be mostly coming from user-space, i.e. nanosleep which a longerr
> > list insertion disapears in the block/wakeup/context switch overhead
> 
> A pointer-based priority queue is really not a very complex thing, and
> there are ways to optimise them for typical cases like the above.
> 
Please do enlighten me.  The big problem in my mind is that the
pointers, pointing at points in time, are perishable.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Bug in sys_sched_yield

2001-04-12 Thread george anzinger


Hubertus Franke wrote:
> 
> In the recent optimizations to sys_sched_yield a bug was introduced.
> In the current implementation of sys_sched_yield()
> the aligned_data and idle_tasks are indexed by logical cpu-#.
> 
> They should however be indexed by physical cpu-#.
> Since logical==physical on the x86 platform, it doesn't matter there,
> for other platforms where this is not true it will matter.
> Below is the fix.
> 
Uh...  I do know about this map, but I wonder if it is at all needed. 
What is the real difference between a logical cpu and the physical one. 
Or is this only interesting if the machine is not Smp, i.e. all the cpus
are not the same?  It just seems to me that introducing an additional
mapping just slows things down and, if all the cpus are the same, does
not really do anything.  Of course, I am assuming that ALL usage would
be to the logical :)

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Bug in sys_sched_yield

2001-04-12 Thread george anzinger


Walt Drummond wrote:
> 
> george anzinger writes:
> > Uh...  I do know about this map, but I wonder if it is at all needed.
> > What is the real difference between a logical cpu and the physical one.
> > Or is this only interesting if the machine is not Smp, i.e. all the cpus
> > are not the same?  It just seems to me that introducing an additional
> > mapping just slows things down and, if all the cpus are the same, does
> > not really do anything.  Of course, I am assuming that ALL usage would
> > be to the logical :)
> 
> Right.  That is not always the case.  IA32 is somewhat special. ;) The
> logical mapping allows you to, among other things, easily enumerate
> over the set of active processors without having to check if a
> processor exists at the current processor address.
> 
> The difference is apparent when the physical CPU ID is, say, an
> address on a processor bus, or worse, an address on a set of processor
> busses.  Take a look at the IA-64's smp.h.  The IA64 physical
> processor ID is a 64-bit structure that has to 8-bit ID's; an EID for
> what amounts to a "processor bus" ID and an ID that corresponds to a
> specific processor on a processor bus.  Together, they're a system
> global ID for a specific processor.  But there is no guarantee that
> the set of global ID's will be contiguous.
> 
> It's possible to have disjoint (non-contiguous) physical processor
> ID's if a processor bus is not completely populated, or there is an
> empty processor slot or odd processor numbering in firmware, or
> whatever.
> 
All that is cool.  Still, most places we don't really address the
processor, so the logical cpu number is all we need.  Places like
sched_yield, for example, should be using this, not the actual number,
which IMO should only be used when, for some reason, we NEED the hard
address of the cpu.  I don't think this ever has to leak out to the
common kernel code, or am i missing something here.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-12 Thread george anzinger


Bret Indrelee wrote:
> 
> Mikulas Patocka ([EMAIL PROTECTED]) wrote:
> > Adding and removing timers happens much more frequently than PIT tick,
> > so
> > comparing these times is pointless.
> >
> > If you have some device and timer protecting it from lockup on buggy
> > hardware, you actually
> >
> > send request to device
> > add timer
> >
> > receive interrupt and read reply
> > remove timer
> >
> > With the curent timer semantics, the cost of add timer and del timer is
> > nearly zero. If you had to reprogram the PIT on each request and reply,
> > it
> > would slow things down.
> >
> > Note that you call mod_timer also on each packet received - and in worst
> > case (which may happen), you end up reprogramming the PIT on each
> > packet.
> 
> You can still have nearly zero cost for the normal case. Avoiding worst
> case behaviour is also pretty easy.
> 
> You only reprogram the PIT if you have to change the interval.
> 
> Keep all timers in a sorted double-linked list. Do the insert
> intelligently, adding it from the back or front of the list depending on
> where it is in relation to existing entries.

I think this is too slow, especially for a busy system, but there are
solutions...
> 
> You only need to reprogram the interval timer when:
> 1. You've got a new entry at the head of the list
> AND
> 2. You've reprogrammed the interval to something larger than the time to
> the new head of list.

Uh, I think 1. IMPLIES 2.  If it is at the head, it must be closer in
than what the system is waiting for (unless, of course its time has
already passed, but lets not consider that here).
> 
> In the case of a device timeout, it is usually not going to be inserted at
> the head of the list. It is very seldom going to actually timeout.

Right, and if the system doesn't have many timers, thus putting this at
the head, it has the time to do the extra work.
> 
> Choose your interval wisely, only increasing it when you know it will pay
> off. The best way of doing this would probably be to track some sort
> of LCD for timeouts.

I wonder about a more relaxed device timeout semantic that says
something like: wait X + next timer interrupt.  This would cause the
timer insert code to find an entry at least X out and put this timer
just after it.  There are other ways to do this of course.  The question
here is: Would this be worth while?
> 
> The real trick is to do a lot less processing on every tick than is
> currently done. Current generation PCs can easily handle 1000s of
> interrupts a second if you keep the overhead small.

I don't see the logic here.  Having taken the interrupt, one would tend
to want to do as much as possible, rather than schedule another
interrupt to continue the processing.  Rather, I think you are trying to
say that we can afford to take more interrupts for time keeping.  Still,
I think what we are trying to get with tick less timers is a system that
takes FEWER interrupts, not more.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-12 Thread george anzinger


Bret Indrelee wrote:
> 
> On Thu, 12 Apr 2001, george anzinger wrote:
> > Bret Indrelee wrote:
> > > Keep all timers in a sorted double-linked list. Do the insert
> > > intelligently, adding it from the back or front of the list depending on
> > > where it is in relation to existing entries.
> >
> > I think this is too slow, especially for a busy system, but there are
> > solutions...
> 
> It is better than the current solution.

Uh, where are we talking about.  The current time list insert is real
close to O(1) and never more than O(5).
> 
> The insert takes the most time, having to scan through the list. If you
> had to scan the whole list it would be O(n) with a simple linked list. If
> you insert it from the end, it is almost always going to be less than
> that.

Right, but compared to the current O(5) max, this is just too long.
> 
> The time to remove is O(1).
> 
> Fetching the first element from the list is also O(1), but you may have to
> fetch several items that have all expired. Here you could do something
> clever. Just make sure it is O(1) to determine if the list is empty.
> 
I would hope to move expired timers to another list or just process
them.  In any case they should not be a problem here.

One of the posts that started all this mentioned a tick less system (on
a 360 I think) that used the current time list.  They had to scan
forward in time to find the next event and easy 10 ms was a new list to
look at.  Conclusion: The current list structure is NOT organized for
tick less time keeping.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-Kernel Archive: No 100 HZ timer !

2001-04-12 Thread george anzinger


Andre Hedrick wrote:
> 
> On Fri, 13 Apr 2001, Alan Cox wrote:
> 
> > > Okay but what will be used for a base for hardware that has critical
> > > timing issues due to the rules of the hardware?
> >
> > > #define WAIT_MIN_SLEEP  (2*HZ/100)  /* 20msec - minimum sleep time */
> > >
> > > Give me something for HZ or a rule for getting a known base so I can have
> > > your storage work and not corrupt.
> >
> >
> > The same values would be valid with add_timer and friends regardless. Its just
> > that people who do
> >
> >   while(time_before(jiffies, started+DELAY))
> >   {
> >   if(poll_foo())
> >   break;
> >   }
> >
> > would need to either use add_timer or we could implement get_jiffies()

Actually we could do the same thing they did for errno, i.e.

#define jiffies get_jiffies()
extern unsigned get_jiffies(void);

> 
> Okay regardless of the call what is it going to be or do we just random
> and go oh-crap data!?!?
> 
> Since HZ!==100 of all archs that have ATA/ATAPI support, it is a mircale
> that FS corruption and system death is not more rampant, except for the
> fact that hardware is quick by a factor of 10+ so that 1000 does not quite
> do as much harm but the associated mean of HZ changes and that is a
> problem with slower hardware.

No, not really.  HZ still defines the units of jiffies and most all the
timing is still related to it.  Its just that interrupts are only "set
up" when a "real" time event is due.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

POSIX 52 53? 54

2001-04-12 Thread george anzinger


Any one know any thing about a POSIX draft 52 or 53 or 54.  I think they
are suppose to have something to do with real time.

Where can they be found?  What do they imply for the kernel?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-Kernel Archive: No 100 HZ timer !

2001-04-13 Thread george anzinger

"Eric W. Biederman" wrote:
> 
> Andre Hedrick <[EMAIL PROTECTED]> writes:
> 
> > On Thu, 12 Apr 2001, george anzinger wrote:
> >
> > > Actually we could do the same thing they did for errno, i.e.
> > >
> > > #define jiffies get_jiffies()
> > > extern unsigned get_jiffies(void);
> >
> > > No, not really.  HZ still defines the units of jiffies and most all the
> > > timing is still related to it.  Its just that interrupts are only "set
> > > up" when a "real" time event is due.
> >
> > Great HZ always defines units of jiffies, but that is worthless if there
> > is not a ruleset that tells me a value to divide by to return it to a
> > specific quantity of time.

The definition of HZ is the number of units in a second.  Thus if HZ is
100 we are talking about a 10 ms jiffie.
> 
> Actually in rearranging it.  jiffies should probably be redefined as
> the smallest sleep we are prepared to take.  And then HZ because the
> number of those smallest sleeps per second.  So we might see HZ values
> up in the millions but otherwise things should be pretty much as
> normal.

Actually I think it is more useful to define it as something like 100
for the following reasons.  

I think it makes the most sense to keep jiffie as a simple unsigned
int.  If we leave drivers, and other code as is they can deal with
single word (32 bit) values and get reasonable results.  If we make HZ
too high (say 10,000 to get micro second resolution) we will start
limiting the max time we can handle, in this case to about 71.5 hours. 
(Actually 1/2 this value would start giving us trouble.)  HZ only
affects the kernel internals (the user API is either seconds/micro
seconds or seconds/nano seconds).  For those cases where we want a higer
resolution we just add a sub jiffie component.  Another way of looking
at this is to set up HZ as the "normal" resolution.  This would be the
resolution (as it is today) of the usual API calls.  Only calls to the
POSIX 1b timers would be allowed to have higher resolution.  I propose
that we use the POSIX standard to define "CLOCKS" with various
resolution, with the understanding that the higher resolutions will have
higher overhead.  An yet another consideration, to get high resolution
with a reasonable maximum timer interval we will need to use two words
in the timer.  I think it makes sense to use the jiffie (i.e. 1/HZ) as
the high order part of the timer's time.  

Note that all of these considerations on jiffie size hold with or with
out a tick less system.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-13 Thread george anzinger


Ben Greear wrote:
> 
> Bret Indrelee wrote:
> >
> > On Thu, 12 Apr 2001, george anzinger wrote:
> > > Bret Indrelee wrote:
> > > >
> > > > On Thu, 12 Apr 2001, george anzinger wrote:
> > > > > Bret Indrelee wrote:
> > > > > > Keep all timers in a sorted double-linked list. Do the insert
> > > > > > intelligently, adding it from the back or front of the list depending on
> > > > > > where it is in relation to existing entries.
> > > > >
> > > > > I think this is too slow, especially for a busy system, but there are
> > > > > solutions...
> > > >
> > > > It is better than the current solution.
> > >
> > > Uh, where are we talking about.  The current time list insert is real
> > > close to O(1) and never more than O(5).
> >
> > I don't like the cost of the cascades every (as I recall) 256
> > interrupts. This is more work than is done in the rest of the interrupt
> > processing, happens during the tick interrupt, and results in a rebuild of
> > much of the table.

Right, it needs to go, we need to eliminate the "lumps" in time :)
> >
> > -Bret
> 
> Wouldn't a heap be a good data structure for a list of timers?  Insertion
> is log(n) and finding the one with the least time is O(1), ie pop off the
> front  It can be implemented in an array which should help cache
> coherency and all those other things they talked about in school :)
> 
I must be missing something here.  You get log(n) from what?  B-trees? 
How would you manage them to get the needed balance?  Stopping the world
to re-balance is worse than the cascade.  I guess I need to read up on
this stuff.  A good pointer would be much appreciated. 

Meanwhile, I keep thinking of a simple doubly linked list in time
order.  To speed it up keep an array of pointers to the first N whole
jiffie points and maybe pointers to coarser points beyond the first N. 
Make N, say 512.  Build the pointers as needed.  This should give
something like O(n/N) insertion and O(1) removal.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-Kernel Archive: No 100 HZ timer !

2001-04-13 Thread george anzinger


Mark Salisbury wrote:
> 
> > I think it makes the most sense to keep jiffie as a simple unsigned
> > int.  If we leave drivers, and other code as is they can deal with
> > single word (32 bit) values and get reasonable results.  If we make HZ
> > too high (say 10,000 to get micro second resolution) we will start
> > limiting the max time we can handle, in this case to about 71.5 hours.
> > (Actually 1/2 this value would start giving us trouble.)  HZ only
> > affects the kernel internals (the user API is either seconds/micro
> > seconds or seconds/nano seconds).  For those cases where we want a higer
> > resolution we just add a sub jiffie component.  Another way of looking
> > at this is to set up HZ as the "normal" resolution.  This would be the
> > resolution (as it is today) of the usual API calls.  Only calls to the
> > POSIX 1b timers would be allowed to have higher resolution.  I propose
> > that we use the POSIX standard to define "CLOCKS" with various
> > resolution, with the understanding that the higher resolutions will have
> > higher overhead.  An yet another consideration, to get high resolution
> > with a reasonable maximum timer interval we will need to use two words
> > in the timer.  I think it makes sense to use the jiffie (i.e. 1/HZ) as
> > the high order part of the timer's time.
> >
> > Note that all of these considerations on jiffie size hold with or with
> > out a tick less system.
> 
> inner loop, i.e. interrupt timer code should never have to convert from some
> real time value into number of decrementer ticks in order to set up the next
> interrupt as that requires devides (and 64 bit ones at that) in a tickless
> system.

A good point!  If we keep jiffies and sub jiffies in the timer
structure, then the sub jiffies should be in hardware defined units
(i.e. what we need to directly talk to the hardware).  This argues for
the jiffie to be longer than the longest expected inter event interval,
i.e. for it to be larger than 10 ms.  Seconds and sub seconds?  At least
this eliminates the mpy from the high order part.  

I think we need to balance the units used in the time list with the
needs of the list management code.  It may not be a good idea to tie it
too tightly to the jiffie.  Thinking aloud here...

Another though, if we keep a small conversion table to convert from
jiffies to machine units for say the first 10 jiffies, then force an
"non event" timer interrupt if needed.  This would, I guess, get us back
to a tick system, but with tick every 10 jiffies...  Still thinking...


George
> 
> this is why any variable interval list/heap/tree/whatever should be kept in
> local ticks.  frequently used values can be pre-computed at boot time to
> speed up certain calculations (like how many local ticks/proc quantum)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-13 Thread george anzinger


Jamie Lokier wrote:
> 
> george anzinger wrote:
> > > Wouldn't a heap be a good data structure for a list of timers?  Insertion
> > > is log(n) and finding the one with the least time is O(1), ie pop off the
> > > front  It can be implemented in an array which should help cache
> > > coherency and all those other things they talked about in school :)
> > >
> > I must be missing something here.  You get log(n) from what?  B-trees?
> > How would you manage them to get the needed balance?  Stopping the world
> > to re-balance is worse than the cascade.  I guess I need to read up on
> > this stuff.  A good pointer would be much appreciated.
> 
> Look for "priority queues" and "heaps".  In its simplest form, you use a
> heap-ordered tree, which can be implemented using an array (that's
> usually how it's presented), or having the objects in the heap point to
> each other.
> 
> A heap-ordered tree is not as strictly ordered as, well, an ordered tree
> :-)  The rule is: if A is the parent of B and C, then A expires earlier
> than B, and A expires earlier than C.  There is no constraint on the
> relative expiry times of B and C.
> 
> There is no "stop the world" to rebalance, which you may consider an
> advantage over the present hierarchy of tables.  On the other hand, each
> insertion or deletion operation takes O(log n) time, where n is the
> number of items in the queue.  Although fairly fast, this bound can be
> improved if you know the typical insertion/deletion patterns, to O(1)
> for selected cases.  Also you should know that not all priority queues
> are based on heap-ordered trees.
> 
> Linux' current hierarchy of tables is a good example of optimisation: it
> is optimised for inserting and actually running short term timers, as
> well as inserting and deleting (before running) long term timers.  These
> extremes take O(1) for insertion, removal and expiry, including the
> "stop the world" time.  This should be considered before and move to a
> heap-based priority queue, which may turn out slower.
> 
> > Meanwhile, I keep thinking of a simple doubly linked list in time
> > order.  To speed it up keep an array of pointers to the first N whole
> > jiffie points and maybe pointers to coarser points beyond the first N.
> > Make N, say 512.  Build the pointers as needed.  This should give
> > something like O(n/N) insertion and O(1) removal.
> 
> You've just described the same as the current implementation, but with
> lists for longer term timers.  I.e. slower.  With your coarser points,
> you have to sort the front elements of the coarse list into a fine one
> from time to time.
> 
> The idea of having jiffie-point pointers into a data structure for fast
> insertion is a good one for speeding up any data structure for that
> common case, though.
> 
If we are to do high-res-timers, I think we will always have to do some
sort of a sort on insert.  If we can keep the sort to a VERY short list
and only do it on sub jiffie timers, we will have something that is VERY
close to what we have now.

I think that the density of timers tends to increase as they get closer
to "now".  This should allow coarser pointers for times more removed
from "now".  Still if we sort on insert we will not have to resort
later.

The pointers into the list, however, are perishable and need to be
refreshed as time passes.  My though was to do this only when a needed
pointer was found to be "expired" and then only for the needed pointer. 
On first need we would have a small overhead, but would remember for
next time.

Thanks for the good input

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-13 Thread george anzinger


Horst von Brand wrote:
> 
> Ben Greear <[EMAIL PROTECTED]> said:
> 
> [...]
> 
> > Wouldn't a heap be a good data structure for a list of timers?  Insertion
> > is log(n) and finding the one with the least time is O(1), ie pop off the
> > front  It can be implemented in an array which should help cache
> > coherency and all those other things they talked about in school :)
> 
> Insertion and deleting the first are both O(log N). Plus the array is fixed
> size (bad idea) and the jumping around in the array thrashes the caches.
> --
And your solution is?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No one wants to help me :-(

2001-04-13 Thread george anzinger


Brian Gerst wrote:
> 
> Mircea Damian wrote:
> >
> > Hello,
> >
> > I was expecting to receive some replies to my last desperate messages:
> > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg35446.html
> > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg36591.html
> >
> > My machine is dyeing in add_timer(). It seems to happen only on SMP
> > machines and is something related to the network driver. For some reason
> >  one of the timer lists gets broken so we (we are two people trying to
> >  solve this issue) wrote a "safe" timer.c which tries to rebuild the chain
> >  in case it hits a NULL pointer.
> >
> > The machine is (ofcourse) slower with this patch but at least it works.
> >
> > Maybe someone can see which is the real bug and fix it.
> >
> > Please help!
> 
> I found (at least part of) the problem.  In detach_timer() we test if
> the timer is pending.  If it is not the function does not remove the
> timer from the list and returns 0.  The functions that call
> detach_timer() do not check the return value and unconditionally set the
> list pointers to NULL, even though the timer is still on the list.
> Patch against 2.4.3 attached, but there may be a better solution.
> 
uh, but the pending test is to check for NULL pointers, so while your
change consolidates some code, I don't think it changes anything, unless
you can find a place that calls detach_timer and doesn't clear the
pointers...

For what its worth, my look at this problem seems to indicate that the
new timer pointer is zero.  This would be a problem in the network code
somewhere.  I would guess that the whole structure is being released by
cpu X while cpu y is trying to set up a timer.  But then I don't really
know the network code (at all).  Just going by the error, defer of zero
in add_timer() which only looks at the timer to verify that the pointers
are zero.

George


George


> --
> 
> Brian Gerst
> 
>   
> diff -urN linux-2.4.3/kernel/timer.c linux/kernel/timer.c
> --- linux-2.4.3/kernel/timer.c  Thu Dec 14 20:52:22 2000
> +++ linux/kernel/timer.cFri Apr 13 13:26:08 2001
> @@ -194,6 +194,7 @@
> if (!timer_pending(timer))
> return 0;
> list_del(&timer->list);
> +   timer->list.next = timer->list.prev = NULL;
> return 1;
>  }
> 
> @@ -217,7 +218,6 @@
> 
> spin_lock_irqsave(&timerlist_lock, flags);
> ret = detach_timer(timer);
> -   timer->list.next = timer->list.prev = NULL;
> spin_unlock_irqrestore(&timerlist_lock, flags);
> return ret;
>  }
> @@ -246,7 +246,6 @@
> 
> spin_lock_irqsave(&timerlist_lock, flags);
> ret += detach_timer(timer);
> -   timer->list.next = timer->list.prev = 0;
> running = timer_is_running(timer);
> spin_unlock_irqrestore(&timerlist_lock, flags);
> 
> @@ -309,7 +308,6 @@
> data= timer->data;
> 
> detach_timer(timer);
> -   timer->list.next = timer->list.prev = NULL;
> timer_enter(timer);
> spin_unlock_irq(&timerlist_lock);
> fn(data);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-Kernel Archive: No 100 HZ timer !

2001-04-15 Thread george anzinger

Roger Larsson wrote:
> 
> On Thursday 12 April 2001 23:52, Andre Hedrick wrote:
> > Okay but what will be used for a base for hardware that has critical
> > timing issues due to the rules of the hardware?
> >
> > I do not care but your drives/floppy/tapes/cdroms/cdrws do:
> >
> > /*
> >  * Timeouts for various operations:
> >  */
> > #define WAIT_DRQ(5*HZ/100)  /* 50msec - spec allows up to 20ms
> > */ #ifdef CONFIG_APM
> > #define WAIT_READY  (5*HZ)  /* 5sec - some laptops are very
> > slow */ #else
> > #define WAIT_READY  (3*HZ/100)  /* 30msec - should be instantaneous
> > */ #endif /* CONFIG_APM */
> > #define WAIT_PIDENTIFY  (10*HZ) /* 10sec  - should be less than 3ms (?), if
> > all ATAPI CD is closed at boot */ #define WAIT_WORSTCASE  (30*HZ) /* 30sec
> > - worst case when spinning up */ #define WAIT_CMD(10*HZ) /* 10sec
> > - maximum wait for an IRQ to happen */ #define WAIT_MIN_SLEEP  (2*HZ/100)
> >/* 20msec - minimum sleep time */
> >
> > Give me something for HZ or a rule for getting a known base so I can have
> > your storage work and not corrupt.
> >
> 
> Wouldn't it make sense to define these in real world units?
> And to use that to determine requested accuracy...
> 
> Those who wait for seconds will probably not have a problem with up to (half)
> a second longer wait - or...?
> Those in range of the current jiffie should be able to handle up to one
> jiffie longer...
> 
> Requesting a wait in ms gives yo ms accuracy...

The POSIX standard seems to point to a "CLOCK" for this sort of thing. 
A "CLOCK" has a resolution.  One might define CLOCK_10MS, CLOCK_1US, or
CLOCK_1SEC, for example.  Then the request for a delay would pass the
CLOCK to use as an additional parameter.  Of course, CLOCK could also
wrap other characteristics of the timer.  For example, the jiffies
variable in the system could be described as a CLOCK which has a
resolution of 10 ms and is the uptime.  Another CLOCK might return
something related to GMT or wall time (which, by the way, is allowed to
slip around a bit relative to uptime to account for leap seconds, day
light time, and even the date command).

Now to make this real for the kernel we would need to define a set of
CLOCKs, to meet the kernel as well as the user needs.  POSIX timers
requires the CLOCK construct and doesn't limit it very much.  Once
defined to meet the standard, it is easy to extend the definition to fix
the apparent needs.  It is also easy to make the definition extensible
and we (the high-res-timers project
http://sourceforge.net/projects/high-res-timers) intend to do so.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [test-PATCH] Re: [QUESTION] 2.4.x nice level

2001-04-16 Thread george anzinger


Rik van Riel wrote:
> 
> On Thu, 12 Apr 2001, Pavel Machek wrote:
> 
> > > One rule of optimization is to move any code you can outside the loop.
> > > Why isn't the nice_to_ticks calculation done when nice is changed
> > > instead of EVERY recalc.?  I guess another way to ask this is, who needs
> >
> > This way change is localized very nicely, and it is "obviously right".
> 
> Except for two obvious things:
> 
> 1. we need to load the nice level anyway
> 2. a shift takes less cycles than a load on most
>CPUs
> 
Gosh, what am I missing here?  I think "top" and "ps" want to see the
"nice" value so it needs to be available and since the NICE_TO_TICK()
function looses information (i.e. is not reversible) we can not compute
it from ticks.  Still, yes we need to load something, but is it nice? 
Why not the result of the NICE_TO_TICK()?  

A shift and a subtract are fast, yes, but this loop runs over all tasks
(not just the run list).  This loop can put a real dent in preemption
times AND the notion of turning on interrupts while it is done can run
into some interesting race conditions.  (This is why the MontaVista
scheduler does the loop without dropping the lock, AFTER optimizing the
h... out of it.)

What am I missing?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-16 Thread george anzinger

Mark Salisbury wrote:
> 
> all this talk about which data structure to use and how to allocate memory is
> wy premature.
> 
> there needs to be a clear definition of the requirements that we wish to meet,
> including whether we are going to do ticked, tickless, or both
> 
> a func spec, for lack of a better term needs to be developed
> 
> then, when we go to design the thing, THEN is when we decide on the particular
> flavor of list/tree/heap/array/dbase that we use.
> 
> let's engineer this thing instead of hacking it.

Absolutely, find first draft attached.

Comments please.

George

~snip~

Functional Specification for the high-res-timers project.

http://sourceforge.net/projects/high-res-timers

We are developing code to implement the POSIX clocks & timers as defined
by IEEE Std 1003.1b-1993 Section 14.  (For an on line reference see our
home page: http://high-res-timers.sourceforge.net/ )

The API specifies the following functions (for details please see the spec):

int clock_settime(clockid_t clock_id, const struct timespec *tp);
int clock_gettime(clockid_t clock_id, struct timespec *tp);
int clock_getres(clockid_t clock_id, struct timespec *res);

int timer_creat(clockid_t clock_id, struct sigevent *evp,
timer_t *timerid);
int timer_delete(timer_t *timerid);

int timer_settime(timer_t *timerid, int flags, 
  const struct itimerspec *value,
  struct itimerspec *ovalue);
int timer_gettime(timer_t *timerid, struct itimerspec *value);
int timer_getoverrun(timer_t *timerid);

int nanosleep( const struct timesped *rqtp, struct timespec *rmtp);

In addition we expect that we will provide a high resolution timer for
kernel use (heck, we may provide several).

In all this code we will code to allow resolutions to 1 nanosecond second (the
max provided by the timespec structure).  The actual resolution of
any given clock will be fixed at compile time and the code will do its
work at a higher resolution to avoid round off errors as much as
possible.

We will provide several "clocks" as defined by the standard.  In
particular, the following capabilities will be attached to some clock,
regardless of the actual clock "name" we end up using:

CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
linux today). 

CLOCK_HIGH_RES a wall clock supporting timers with the highest
resolution the hardware supports.

CLOCK_1US a wall clock supporting timers with 1 micro second resolution
(if the hardware allows it).

CLOCK_UPTIME a clock that give the system up time.  (Does this clock
need to support timers?)

CLOCK_REALTIME a wall clock supporting timers with 1/HZ resolution.

At the same time we will NOT support the following clocks:

CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
wall) of a given task.  

CLOCK_PROFILE a clock used to generate profiling events. 

CLOCK_???  any clock keyed to a task.

(Note that this does not mean that the clock system will not support the
virtual and profile clocks, but that they will not be accessible thru
the POSIX timer interface.)

It would be NICE if we could provide a way to hook other time support
into the system.  In particular a

CLOCK_WWV or CLOCK_GPS

might be nice.  The problem with these sorts of clocks is that they
imply an array of function pointers for each clock and function pointers
slow the code down because of their non predictability.  Never the less,
we will attempt to allow easy expansion in this direction.

Implications on the current kernel:

The high resolution timers will require a fast clock access with the
maximum supported resolution in order to convert relative times to
absolute times.  This same fast clock will be used to support the
various user and system time requests.

There are two ways to provide timers to the kernel.  For lack of a
better term we will refer to them as "ticked" and "tick less".  Until we
have performance information that implies that one or the other of these
methods is better in all cases we will provide both ticked and tick less
systems.  The variety to be used will be selected at configure time.

For tick less systems we will need to provide code to collect execution
times.  For the ticked system the current method of collection these
times will be used.  This project will NOT attempt to improve the
resolution of these timers, however, the high speed, high resolution
access to the current time will allow others to augment the system in
this area.

For the tick less system the project will also provide a time slice
expiration interrupt.

The timer list(s) (all pending timers) need to be organized so that the
following operations are fast:

a.) list insertion of an arbitrary timer,
b.) removal of canceled and expired timers, and
c.) finding the timer for "NOW" and its immediate followers.

Times in the timer list will be absolute and related to system up time.
These times will be converted to wall time as needed.

The POSIX interface prov

Re: No 100 HZ timer!

2001-04-16 Thread george anzinger


"Albert D. Cahalan" wrote:
> 
> > CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> > linux today).
> 
> Except on the Alpha, and on some ARM systems, etc.
> The HZ constant varies from 10 to 1200.

I suspect we will want to use 10 ms resolution for a clock named
CLOCK_10MS :)
On the other hand we could have a CLOCK_1_OVER_HZ...  Actually with
high-res-timers the actual HZ value becomes a bit less important.  It
would be "nice" to keep 1/HZ == jiffie, however.
> 
> > At the same time we will NOT support the following clocks:
> >
> > CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> > wall) of a given task.
> ...
> > For tick less systems we will need to provide code to collect execution
> > times.  For the ticked system the current method of collection these
> > times will be used.  This project will NOT attempt to improve the
> > resolution of these timers, however, the high speed, high resolution
> > access to the current time will allow others to augment the system in
> > this area.
> ...
> > This project will NOT provide higher resolution accounting (i.e. user
> > and system execution times).
> 
> It is nice to have accurate per-process user/system accounting.
> Since you'd be touching the code anyway...

Yeah sure... and will you pick up the ball on all the platform dependent
code to get high-res-timers on all the other platforms?  On second
thought I am reminded of the corollary to the old saw:  "The squeaking
wheel get the grease."  Which is: "He who complains most about the
squeaking gets to do the greasing."  Hint  hint.
> 
> > The POSIX interface provides for "absolute" timers relative to a given
> > clock.  When these timers are related to a "wall" clock they will need
> > adjusting when the wall clock time is adjusted.  These adjustments are
> > done for "leap seconds" and the date command.
> 
> This is a BIG can of worms. You have UTC, TAI, GMT, and a loosely
> defined POSIX time that is none of the above. This is a horrid mess,
> even ignoring gravity and speed. :-)
> 
> Can a second be 2 billion nanoseconds?
> Can a nanosecond be twice as long as normal?
> Can a second appear twice, with the nanoseconds getting reset?
> Can a second never appear at all?
> Can you compute times more than 6 months into the future?
> How far does time deviate from solar time? Is this constrained?
> 
> If you deal with leap seconds, you have to have a table of them.
> This table grows with time, with adjustments being made with only
> about 6 months notice. So the user upgrades after a year or two,
> and the installer discovers that the user has been running a
> system that is unaware of the most recent leap second. Arrrgh.
> 
> Sure you want to touch this? The Austin group argued over it for
> a very long time and never did find a really good solution.
> Maybe you should just keep the code simple and fast, without any
> concern for clock adjustments.

There is a relatively simple way to handle this, at least from the
high-res-timers point of view.  First we convert all timers to absolute
uptime.  This is a nice well behaved time.  At boot time we peg the wall
clock to the uptime.  Then at any given time, wall time is boot wall
time + uptime.  Then date, leap seconds, etc. affect the pegged value of
boot wall time.  Using the POSIX CLOCK id we allow the user to ask for
either version of time.  Now if we define an array of struc clock_id
which contains pointers to such things as functions to return time, any
algorithm you might want can be plugged in to bend time as needed.  The
only fly in this mess is the NTP rate adjustment stuff.  This code is
supposed to allow the system to adjust its ticker to produce accurate
seconds and so gets at the root of the uptime counter be it in hardware
or software or some combination of the two.  But then that's what makes
life interesting :)
> 
> > In either a ticked or tick less system, it is expected that resolutions
> > higher than 1/HZ will come with some additional overhead.  For this
> > reason, the CLOCK resolution will be used to round up times for each
> > timer.  When the CLOCK provides 1/HZ (or coarser) resolution, the
> > project will attempt to meet or exceed the current systems timer
> > performance.
> 
> Within the kernel at least, it would be good to let drivers specify
> desired resolution. Then a near-by value could be selected, perhaps
> with some consideration for event type. (for cache reasons)

This could be done, however, I would prefer to provide CLOCK_s to do
this as per the standard.  What does the community say?  In either case
you get different resolutions, but in the latter case the possible
values are fixed at least at configure time.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-16 Thread george anzinger


Mark Salisbury wrote:
> 
> > Given a system speed, there is a repeating timer rate which will consume
> > 100% of the system in handling the timer interrupts.  An attempt will
> > be made to detect this rate and adjust the timer to prevent system
> > lockup.  This adjustment will look like timer overruns to the user
> > (i.e. we will take a percent of the interrupts and record the untaken
> > interrupts as overruns)
> 
> just at first blush, there are some things in general but I need to read
> this again and more closely
> 
> but, with POSIX timers, there is a nifty little restriction/protection built
> into the spec regarding the re-insertion of short interval repeating timers.
> that is: a repeating timer will not be re-inserted until AFTER the
> associated signal handler has been handled.

Actually what it says is: "Only a single signal shall be queued to the
process for a given timer at any point in time.  When a timer for which
a signal is still pending expires, no signal shall be queued, and a
timer overrun shall occur."

It then goes on to talk about the overrun count and how it is to be
managed.

What I am suggesting is that the system should detect when these
interrupts would come so fast as to stall the system and just set up a
percent of them while bumping the overrun count as if they had all
occured.

George

> 
> this has some interesting consequences for signal handling and signal
> delivery implementations, but importantly, it ensures that even a flood of
> POSIX timers with very short repeat intervals will be handled cleanly.
> 
> I will get more detailed comments to you tomorrow.
> 
> Mark Salisbury
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-17 Thread george anzinger

Mark Salisbury wrote:
> 
> > Functional Specification for the high-res-timers project.
> >
> > In addition we expect that we will provide a high resolution timer for
> > kernel use (heck, we may provide several).
> 
> what we do here determines what we can do for the user..

I was thinking that it might be good to remove the POSIX API for the
kernel and allow a somewhat simplified interface.  For example, the user
gets to resolution by specifying a CLOCK, where we MAY want to allow the
kernel call to directly specify the resolution.  This has already been
suggested.  I suppose you could say the functional spec should define
the kernel PI (KPI?) as well as the user API, but that is a bit fuzzy at
this time as I think it will depend on how we actually code the user
functionality.  Another example: I am leaning toward using a two word
uptime composed of jiffies (i.e. 1/HZ since boot) and a machine
dependent sub jiffie unit.  Each ARCH would then define this unit as
well as the conversion routines to move it back and forth to
nanoseconds, microseconds, and 1/HZ.  I think this format would play
well in task time accounting, as well as in the timer management.  

For calls to something like delay(), however, it sucks.  I think these
calls want a single word, possibly microsecond, time specification. 
This gives a 16 or 17 minutes max and 1 microsecond min. which probably
covers 99.99% of all kernel delay needs.  

Another kernel internal interface should allow the user specified
structures (timespec and timeval).  The notion is to put all the
conversion routines in the timer code to maintain the specified
resolution, and (more importantly), to avoid conversion to a format that
just needs an additional conversion.

To summarize,

I think there is a need for two classes of timer interfaces in the
kernel:

a.) For drivers and others that need "delay()" sorts of things, and
b.) For system calls that handle user specified times.
> 
> > We will provide several "clocks" as defined by the standard.  In
> > particular, the following capabilities will be attached to some clock,
> > regardless of the actual clock "name" we end up using:
> >
> > CLOCK_10MS a wall clock supporting timers with 10 ms resolution (same as
> > linux today).
> >
> > CLOCK_HIGH_RES a wall clock supporting timers with the highest
> > resolution the hardware supports.
> >
> > CLOCK_1US a wall clock supporting timers with 1 micro second resolution
> > (if the hardware allows it).
> >
> > CLOCK_UPTIME a clock that give the system up time.  (Does this clock
> > need to support timers?)
> >
> > CLOCK_REALTIME a wall clock supporting timers with 1/HZ resolution.
> >
> 
> Too many clocks.  we should have CLOCK_REALTIME and CLOCK_UPTIME for sure, but
> the others are just fluff.  we should have 1 single clock mechanism for the
> whole system with it's resolution and tick/tickless characteristics determined
> at compile time.

I think you already have let the nose of the camel into the tent :) 
Here is what I am thinking:

Suppose an array of structures of type clock.  Clock_id is an index into
this array.  Here is what is in the structure:

struct clock{
int resolution;
int *gettime();
int *settime();
int *convert_to_uptime();
int *convert_from_uptime();
;
};

Now the difference between CLOCK_UPTIME and CLOCK_REALTIME is surly in
the gettime/settime and possibly in the resolution.  But the difference
between CLOCK_REALTIME and CLOCK_1US, CLOCK_HIGH_RES, CLOCK_10MS is JUST
the resolution!  In other words, all they cost is the table entry.  Note
that CLOCK_GMT, CLOCK_UST, and CLOCK_GPS, etc. all fit nicely into this
same structure.

We should also provide a way to "register" a new clock so the user can
easily configure in additional clocks.  There are ways of doing this
that are really easy to use, e.g. the module_init() macro.
> 
> also CLOCK_UPTIME should be the "true" clock, with CLOCK_REALTIME just a
> convenience/compliance offset.

If you mean by "true" that this clock can not be set, starts at 0 at
boot time and can only be affected by rate adjustments to get it to pass
a real second every second, I agree.
> 
> > At the same time we will NOT support the following clocks:
> >
> > CLOCK_VIRTUAL a clock measuring the elapsed execution time (real or
> > wall) of a given task.
> >
> > CLOCK_PROFILE a clock used to generate profiling events.
> >
> > CLOCK_???  any clock keyed to a task.
> 
> we could do some KOOL things here but they are more related to process time
> accounting and should be dealt with in that context and as a separate project.
> 
> however our design should take these concepts into account and allow for easy
> integration of these types of functionality.

I agree.
> 
> >
> > (Note that this does not mean that the clock system will not support the
> > virtual and profile clocks, but that they will not be accessible thru
> > the POSIX timer interface.)
> 
> I think that should sombody choose t

Re: schedule() seems to have changed.

2001-04-18 Thread george anzinger


"Richard B. Johnson" wrote:
> 
> It seems that the nature of schedule() has changed in recent
> kernels. I am trying to update my drivers to correspond to
> the latest changes. Here is an example:
> 
> This waits for some hardware (interrupt sets flag), time-out in one
> second. This is in an ioctl(), i.e., user context:
> 
> set_current_state(TASK_INTERRUPTIBLE);
> current->policy = SCHED_YIELD;
> timer = jiffies + HZ;
> while(time_before(jiffies, timer))
> {
> if(flag) break;
> schedule();
> }
> set_current_state(TASK_RUNNING);
> 
> The problem is that schedule() never returns!!! If I use
> schedule_timeout(1), it returns, but the granularity
> of the timeout interval is such that it slows down the
> driver (0.1 ms).
> 
> So, is there something that I'm not doing that is preventing
> schedule() from returning?  It returns on a user-interrupt (^C),
> but otherwise gives control to the kernel forever.
> 
When schedule() is entered with TASK_INTERRUPTIBLE (actually with
current state !=TASK_RUNNING) it takes the task out of the run_list.  It
has been this way for a long time.  The normal way for the task to move
back to the run_list is for wake_up to be called, which, of course (^C)
does.  In your case it would be best if you could get what ever sets
"flag" to call wake_up.

If what you really want to do is to spin in a SCHED_YIELD waiting for
"flag" you need to a.) move the setting of SCHED_YIELD inside the while,
and b.) eliminate the setting of current_state (both of them).

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What is the precision of usleep ?

2001-04-23 Thread george anzinger

george anzinger wrote:
> 
> Marcus Ramos wrote:
> >
> > Hello,
> >
> > I am using usleep in an application under RH7 kernel 2.4.2. However,
> > when I bring its argument down to 20 miliseconds (20.000 microseconds)
> > or less, this seems to be ignored by the function (or the machine's hw
> > timer), which behaves as if 20 ms where its lowest acceptable value. How
> > can I measure the precision of usleep in my box ? I am currently using
> > an Dell GX110 PIII 866 MHz.
> >
> > Thanks in advance.
> > Marcus.
> 
> Well, first, your issue is resolution, not precision.  Current
> resolution on most all timers is 1/HZ.  So this should get a min.
> nanosleep of 10 ms.
> 
> So, could someone explain this line from sys_nanosleep() (
> kernel/timer.c):
> 
> expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
> 
> It seems to me this should just be:
> 
> expire = timespec_to_jiffies(&t)
> 
Oh darn!  Must NOT do posts at 4AM!  

The standard says nanosleep MUST wait at LEAST the requested time. 
Since we are dealing with a 1/HZ time resolution (tick) the actual time
waited MUST fall between 10 and 20 ms.  Depending on if your code is
synced to the system clock or not you may see times closer to one end of
this range.  If you are not synced to the clock the average wait should
be about 15 ms.  (Note, locking to the clock in some way is relatively
hard to get away from.  After all this is the same clock that is used
for time slicing.)

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

high-res-timers start code.

2001-04-23 Thread george anzinger

"Robert H. de Vries" wrote:
> 
> On Monday 23 April 2001 19:45, you wrote:
> 
> > By the way, is the user land stuff the same for all "arch"s?
> 
> Not if you plan to handle the CPU cycle counter in user space. That is at
> least what I would propose.

Just got interesting, lets let the world look in.

What did you have in mind here?  I suspect that on some archs the cycle
counter is not available to user code.  I know that on parisc it is
optionally available (kernel can set a bit to make it available), but by
it self it is only good for intervals.  You need to peg some value to a
CLOCK to use it to get timeofday, for instance.

On the other hand, if there is an area of memory that both users and
system can read but only system can write, one might put the soft clock
there.  This would allow gettimeofday (with the cycle counter) to work
without a system call.  To the best of my knowledge the system does not
have such an area as yet.

comments?

George

> System call stuff, yes. There may be gotcha's in the area of 32/64
> interfaces. Almost all 64 bit archs also support 32 bit interfaces (check out
> the stuff in my patch regarding the SPARC, kindly donated by Jakub Jelinek).
> 
> Robert
> 
> --
> Robert de Vries
> [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: high-res-timers start code.

2001-04-24 Thread george anzinger


Gabriel Paubert wrote:
> 
> On Mon, 23 Apr 2001, george anzinger wrote:
> 
> > "Robert H. de Vries" wrote:
> > >
> > > On Monday 23 April 2001 19:45, you wrote:
> > >
> > > > By the way, is the user land stuff the same for all "arch"s?
> > >
> > > Not if you plan to handle the CPU cycle counter in user space. That is at
> > > least what I would propose.
> >
> > Just got interesting, lets let the world look in.
> >
> > What did you have in mind here?  I suspect that on some archs the cycle
> > counter is not available to user code.  I know that on parisc it is
> > optionally available (kernel can set a bit to make it available), but by
> > it self it is only good for intervals.  You need to peg some value to a
> > CLOCK to use it to get timeofday, for instance.
> 
> On Intel there is a also bit to disable unprivileged RDTSC, IIRC. On PPC
> the timebase is always available (but the old 601 needs spacial casing: it
> uses different registers and does not count in binary :-().
> 
> > On the other hand, if there is an area of memory that both users and
> > system can read but only system can write, one might put the soft clock
> > there.  This would allow gettimeofday (with the cycle counter) to work
> > without a system call.  To the best of my knowledge the system does not
> > have such an area as yet.
> >
> > comments?
> 
> Well, there may be work in this area, since x86-64 will not enter kernel
> mode for gettimeofday() if I understand correctly what Andrea said. Linus
> hinted once at exporting (kernel) code to user space.
> 
> Some data also will also need to be accessible but as long as you don't
> guarantee compatibility on data layout, only AFAIU on interface for these
> calls (it was not clear to me if it would be a fixed address forever or
> dynamic linking with kernel exported symbols), it's not a problem.

HPUX passes some kernel addresses to programs on the initial start up,
i.e. the primary entry point where exec starts the task.  In this case
the addresses are in registers.  I think HPUX also passes addresses back
to the kernel at this time, but those are done with a system call.  In
any case, such a change requires a user land relink, something that may
want the kernel to move from 2.x.x to 3.x.x (depending on version
conventions :).

I think the real problem has more to do with the "minor" variations on
various archs, for example the 601 ppc above and the i386 TSC is only
available in the "newer" machines.  This would require user land
configuration based on fine points of the arch, stuff "only the kernel
knows for sure".

So what do we do in the meantime?

Comments?

George
> 
> Of course it will SIGSEGV instead of returning -EFAULT but this is a good
> thing IMHO, nobody checks for -EFAULT from gettimeofday(). I think
> that system calls should rather force SIGSEGV than return -EFAULT anyway,
> to make syscalls indistinguishable from pure library calls.
> 
> Regards,
> Gabriel.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Major Clock Drift

2001-02-12 Thread george anzinger


I may be off base here, but the problem as described below does _NOT_
seem to be OT so I removed that from the subject line.  A clock drift
change with an OS update is saying _something_ about the OS, not the
hardware.  In this case it seems to be the 2.4.x OS that is loosing
time.  I suspect the cause is some driver that is not being nice to the
hardware, either by abusing the interrupt off code or locking up the bus
or some such.  In any case I think it should _not_ be considered OT.

George



"Michael B. Trausch" wrote:
> 
> On Sun, 4 Feb 2001, Tom Eastep wrote:
> 
> > Thus spoke Michael B. Trausch:
> >
> > > On Sat, 3 Feb 2001, Josh Myer wrote:
> > >
> > > > Hello all,
> > > >
> > > > I know this _really_ isn't the forum for this, but a friend of mine has
> > > > noticed major, persistent clock drift over time. After several weeks, the
> > > > clock is several minutes slow (always slow). Any thoughts on the
> > > > cause? (Google didn't show up anything worthwhile in the first couple of
> > > > pages, so i gave up).
> > > >
> > >
> > > I'm having the same problem here.  AMD K6-II, 450 MHz, VIA Chipset, Kernel
> > > 2.4.1.
> > >
> >
> >
> > The video on this system is an onboard ATI 3D Rage LT Pro; I use vesafb
> > rather than atyfb because the latter screws up X.
> >
> 
> I'm not using any framebuffer on my machine (I have an ATI 3D Rage 128
> Pro, myself).  I use the standard 80x50 console, and X when I need
> it.  I'm about to put Debian on the system and see how that works and if I
> like it, I just got the .ISO of disc 1 downloaded (after about a week) and
> now it's burning.  (I hate having a 33.6 connection!)
> 
> However the clock drift didn't happen as much, if at all, with 2.2.xx
> kernels.  It's kept itself pretty well sane.  But now I'm losing something
> on the order of a half hour a week - that didn't happen before.
> 
> - Mike
> 
> ===
> Michael B. Trausch[EMAIL PROTECTED]
> Avid Linux User since April, '96!   AIM:  ML100Smkr
> 
>   Contactable via IRC (DALNet) or AIM as ML100Smkr
> ===
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

Re: [PATCH] guard mm->rss with page_table_lock (241p11)

2001-02-12 Thread george anzinger


Excuse me if I am off base here, but wouldn't an atomic operation be
better here.  There are atomic inc/dec and add/sub macros for this.  It
just seems that that is all that is needed here (from inspection of the
patch).

George


Rasmus Andersen wrote:
> 
> On Mon, Jan 29, 2001 at 07:30:01PM -0200, Rik van Riel wrote:
> > On Mon, 29 Jan 2001, Rasmus Andersen wrote:
> >
> > > Please comment. Or else I will continue to sumbit it :)
> >
> > The following will hang the kernel on SMP, since you're
> > already holding the spinlock here. Try compiling with
> > CONFIG_SMP and see what happens...
> 
> You are right. Sloppy research by me :(
> 
> New patch below with the vmscan part removed.
> 
> diff -aur linux-2.4.1-pre11-clean/mm/memory.c linux/mm/memory.c
> --- linux-2.4.1-pre11-clean/mm/memory.c Sun Jan 28 20:53:13 2001
> +++ linux/mm/memory.c   Sun Jan 28 22:43:04 2001
> @@ -377,7 +377,6 @@
> address = (address + PGDIR_SIZE) & PGDIR_MASK;
> dir++;
> } while (address && (address < end));
> -   spin_unlock(&mm->page_table_lock);
> /*
>  * Update rss for the mm_struct (not necessarily current->mm)
>  * Notice that rss is an unsigned long.
> @@ -386,6 +385,7 @@
> mm->rss -= freed;
> else
> mm->rss = 0;
> +   spin_unlock(&mm->page_table_lock);
>  }
> 
> 
> @@ -1038,7 +1038,9 @@
> flush_icache_page(vma, page);
> }
> 
> +   spin_lock(&mm->page_table_lock);
> mm->rss++;
> +   spin_unlock(&mm->page_table_lock);
> 
> pte = mk_pte(page, vma->vm_page_prot);
> 
> @@ -1072,7 +1074,9 @@
> return -1;
> clear_user_highpage(page, addr);
> entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
> +   spin_lock(&mm->page_table_lock);
> mm->rss++;
> +   spin_unlock(&mm->page_table_lock);
> flush_page_to_ram(page);
> }
> set_pte(page_table, entry);
> @@ -,7 +1115,9 @@
> return 0;
> if (new_page == NOPAGE_OOM)
> return -1;
> +   spin_lock(&mm->page_table_lock);
> ++mm->rss;
> +   spin_unlock(&mm->page_table_lock);
> /*
>  * This silly early PAGE_DIRTY setting removes a race
>  * due to the bad i386 page protection. But it's valid
> diff -aur linux-2.4.1-pre11-clean/mm/mmap.c linux/mm/mmap.c
> --- linux-2.4.1-pre11-clean/mm/mmap.c   Sat Dec 30 18:35:19 2000
> +++ linux/mm/mmap.c Sun Jan 28 22:43:04 2001
> @@ -879,8 +879,8 @@
> spin_lock(&mm->page_table_lock);
> mpnt = mm->mmap;
> mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;
> -   spin_unlock(&mm->page_table_lock);
> mm->rss = 0;
> +   spin_unlock(&mm->page_table_lock);
> mm->total_vm = 0;
> mm->locked_vm = 0;
> while (mpnt) {
> diff -aur linux-2.4.1-pre11-clean/mm/swapfile.c linux/mm/swapfile.c
> --- linux-2.4.1-pre11-clean/mm/swapfile.c   Fri Dec 29 23:07:24 2000
> +++ linux/mm/swapfile.c Sun Jan 28 22:43:04 2001
> @@ -231,7 +231,9 @@
> set_pte(dir, pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
> swap_free(entry);
> get_page(page);
> +   spin_lock(&vma->vm_mm->page_table_lock);
> ++vma->vm_mm->rss;
> +   spin_unlock(&vma->vm_mm->page_table_lock);
>  }
> 
>  static inline void unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
> 
> --
> Rasmus([EMAIL PROTECTED])
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

[ANNOUNCEMENT] High resolution timer mailing list/ project

2001-02-14 Thread george anzinger


An open source project is starting at: 

http://sourceforge.net/projects/high-res-timers/

Currently the project is collecting ideas, requirements, etc.

A mailing list has been set up for the project.  To join:

http://lists.sourceforge.net/lists/listinfo/high-res-timers-discourse

To mail to the list (member or not) mail to:

[EMAIL PROTECTED]

Come help us design and build high resolution timers for linux.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel timers and jiffie wrap-around

2001-02-18 Thread george anzinger


Jamie wrote:
> 
> Hi !
> 
> I've been trying to determine the reliability of kernel timers when a box has been 
>up for a while. Now as everyone is aware (for HZ=100 (default)), when the uptime of 
>the kernel reaches (approx.) 1.3 years the clock tick count (jiffies) wraps-around. 
>Now if a kernel timer is added just before the wrap-around then from the source I get 
>the impression the kernel timer will be run immediately instead of after the 
>specified delay. Here's my reasoning:
> 
> When adding a timer the internal_add_timer() function is (eventually) called. Given 
>that the current jiffies is close to maximum for an unsigned long value then the 
>following index value is computed:
> 
> // jiffies = ULONG_MAX - 10, say.
> // so timer_jiffies is close to jiffies.
> // timer.expires = jiffies + TIMEOUT_VALUE, where TIMEOUT_VALUE=200, say.
> 
> index = expires - timer_jiffies;
> 
> Thus index is a large negative number resulting in the timer being added to 
>tv1.vec[tv1.index] which means that the timer is run on the next execution of 
>run_timer_list().

Now just how did you arrive at this?  What value _is_ ULONG_MAX+190?  It
rolls over to 190.  But you should think of timer_jiffies as 0-10 (in
your case) so index=190+10 or the desired 200.  No need to tweak the
kernel, just try some simple C code.  It all works until the requested
time out is greater than ULONG_MAX/2 (about .68 years).

George

   snip~
> 
> Surely I've misunderstood something in the timer code ?

Now you have a good interview question for that next potential new hire
:)

snip~
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: strange nonmonotonic behavior of gettimeoftheday -- seen similar problem on PPC

2001-03-02 Thread george anzinger


"Richard B. Johnson" wrote:
> 
> On Fri, 2 Mar 2001, Christopher Friesen wrote:
> 
> > John Being wrote:
> >
> > > gives following result on box in question
> > > root@**:# ./clo
> > > Leap found: -1687 msec
> > > and prints nothing on all other  my boxes.
> > > This gives me bunch of troubles with occasional hang ups and I found nothing
> > > in kernel archives at
> > > http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html
> > > just some notes about smth like this for SMP boxes with ntp. Is this issue
> > > known, and how can I fix it?
> >
> > I've run into non-monotonic gettimeofday() on a PPC system with 2.2.17, but it
> > always seemed to be almost exactly a jiffy out, as though it was getting
> > hundredths of a second from the old tick, and microseconds from the new tick.
> > Your leap seems to be more unusual, and the first one I've seen on an x86 box.
> >
> > Have you considered storing the results to see what happens on the next call?
> > Does it make up the difference, or do you just lose that time?
> >
> > Chris
> 
> I think it's a math problem in the test code. Try this:
> 
> #include 
> #include 
> 
> #define DEB(f)
> 
> int main()
> {
> struct timeval t;
> double start_us;
> double stop_us;
> for(;;)
> {
> gettimeofday(&t, NULL);
> start_us  = (double) t.tv_sec * 1e6;
> start_us += (double) t.tv_usec;
> gettimeofday(&t, NULL);
> stop_us  = (double) t.tv_sec * 1e6;
> stop_us += (double) t.tv_usec;
> if(stop_us <= start_us)
> break;
> DEB(fprintf(stdout, "Start = %f, Stop = %f\n", start_us, stop_us));
> }
> fprintf(stderr, "Start = %f, Stop = %f\n", start_us, stop_us);
> return 0;
> }
> 
> Note that two subsequent calls to gettimeofday() must not return the
> same time even if your CPU runs infinitely fast. I haven't seen any
> kernel in the past few years that fails this test.

Oh!  With only micro second resolution how is this avoided?  The only
"legal" thing to do to avoid this is for the fast boxes to loop until
the requirement is satisfied.  Is this really done?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: strange nonmonotonic behavior of gettimeoftheday -- seen similar problem on PPC

2001-03-02 Thread george anzinger


"Richard B. Johnson" wrote:
> 
> On Fri, 2 Mar 2001, george anzinger wrote:
> 
> > "Richard B. Johnson" wrote:
> 
~snip~

> > > Note that two subsequent calls to gettimeofday() must not return the
> > > same time even if your CPU runs infinitely fast. I haven't seen any
> > > kernel in the past few years that fails this test.
> >
> > Oh!  With only micro second resolution how is this avoided?  The only
> > "legal" thing to do to avoid this is for the fast boxes to loop until
> > the requirement is satisfied.  Is this really done?
> >
> > George
> >
> 
> Yes and no. It takes microseconds to call the kernel for anything (time
> getpid() ), so it seldom loops. All the kernel has to do is remember
> the last value returned. If the time isn't past that time yet, bump
> that value and return it instead of waiting.
> 
Well, "has to do" and "does" are two different animals.  My reading of
the code shows that it does not.  I have a bit of code that does
gettimeofday() calls as fast as possible and on some boxes (ix86) have
seen the difference as low as 1 micro second.  It is not beyond
imagination that a box might return the same time two times in a row
given the processors performance increases we are seeing.  I, for one,
don't find this objectionable.  I WILL take exception to time running
backward, however.  (I don't see how this is avoided on the leap second
delete, but I have just started looking at this issue.)  As to returning
a time in the future as you suggest, I think this is a bad policy.  If
the box can actually do two gettimeofdays in one micro second or less,
it SHOULD return the same time (given the resolution can not resolve the
difference).  If this becomes an issue, and it will, those that care
should use the clock_gettime() call which should return time in nano
seconds.  This is part of the POSIX standard code for which we are
working on at:


http://sourceforge.net/projects/high-res-timers/

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-04 Thread george anzinger


Roger Larsson wrote:
> 
> On Thursday 04 January 2001 09:43, ludovic fernandez wrote:
> > Daniel Phillips wrote:
> > > The key idea here is to disable preemption on spin lock and reenable on
> > > spin unlock.  That's a practical idea, highly compatible with the
> > > current way of doing things.  Its a fairly heavy hit on spinlock
> > > performance, but maybe the overall performance hit is small.  Benchmarks
> > > are needed.
> >
> > I'm not sure the hit on spinlock is this heavy (one increment for lock
> > and one dec + test on unlock), but I completely agree (and volonteer)
> > for benchmarking.
> 
> And the conditional jump is usually predicted correctly :-)
> +static inline void enable_preempt(void)
> 
> +{
> +if (atomic_read(¤t->preemptable) <= 0) {
> +BUG();
> +}
> +if (atomic_read(¤t->preemptable) == 1) {
> 
> This part can probably be put in a proper non inline function.
> Cache issues...
> +/*
> +* At that point a scheduling is healthy iff:
> +* - a scheduling request is pending.
> +* - the task is in running state.
> +* - this is not an interrupt context.
> +* - local interrupts are enabled.
> +*/
> +if (current->need_resched == 1 &&
> +   current->state == TASK_RUNNING &&
> +   !in_interrupt()&&
> +   local_irq_are_enabled())
> +{
> +schedule();
> +}
>
Actually the MontaVista Patch cleverly removes the tests for
in_interrupt() and local_irq_are_enabled() AND the state ==
TASK_RUNNING.  In actual fact these states can be considered way points
on the system status vector.  For example the interrupts off state
implies all the rest, the in_interrupt() implies not preemptable and
finally, not preemptable is one station away from fully preemptable.  

TASK_RUNNING is easily solved by makeing schedule() aware that it is
being called for preemption.  See the MontaVista patch for details.


ftp://ftp.mvista.com/pub/Area51/preemptible_kernel/


 
> +}
> +atomic_dec(¤t->preemptable);
> 
> What if something happens during the schedule() that would require
> another thread...?
> 
> +}
> 
> I have been discussing different layout with George on Montavista
> also doing this kind of work... (different var and value range)
> 
> static incline void enable_preempt(void) {
> if (--current->preempt_count) {
> smp_mb(); /* not shure if needed... */
> preempt_schedule();
> }
> }
> 
> in sched.c (some smp_mb might be needed here too...)
> void preempt_schedule() {
> while (current->need_resched) {
> current->preempt->count++; /* prevent competition with IRQ code */
> if (current->need_resched)
> schedule();
> current->preempt_count--;
> }
> }
> 
> > I'm not convinced a full preemptive kernel is something
> > interesting mainly due to the context switch cost (actually mmu contex
> > switch).
> 
> It will NOT be fully, it will be mostly.
> You will only context switch when a higher prio thread gets runnable, two
> ways:
> 1) external intterupt waking higher prio process, same context swithes as
> when running in user code. We won't get more interrupts.
> 2) wake up due to something we do. Not that many places, mostly due to
> releasing syncronization objects (spinlocks does not count).
> 
> If this still is a problem, we can select to only preemt to processes running
> RT stuff. SCHED_FIFO and SCHED_RR by letting them set need_resched to 2...

The preemption ususally just switches earlier.  The switch would happen
soon anyway.  That is what need_resched =1; means.
> 
> > Benchmarking is a good way to get a global overview on this.
> 
> Remember to benchmark with stuff that will make the positive aspects visible
> too. Playing audio (with smaller buffers), more reliably burning CD ROMs,
> less hichups while playing video [if run with higher prio...]
> Plain throuput tests won't tell the whole story!
> 
> see
>  http://www.gardena.net/benno/linux/audio
>  http://www.linuxdj.com/latency-graph/
> 
> > What about only preemptable kernel threads ?
> 
> No, it won't help enough.
> 
> --
> --
> Home page:
>   http://www.norran.net/nra02596/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-prerelease: preemptive kernel.

2001-01-05 Thread george anzinger


ludovic fernandez wrote:
> 
> george anzinger wrote:
> 
> > Roger Larsson wrote:
> > >
> >
> > > This part can probably be put in a proper non inline function.
> > > Cache issues...
> > > +/*
> > > +* At that point a scheduling is healthy iff:
> > > +* - a scheduling request is pending.
> > > +* - the task is in running state.
> > > +* - this is not an interrupt context.
> > > +* - local interrupts are enabled.
> > > +*/
> > > +if (current->need_resched == 1 &&
> > > +   current->state == TASK_RUNNING &&
> > > +   !in_interrupt()&&
> > > +   local_irq_are_enabled())
> > > +{
> > > +schedule();
> > > +}
> > >
> > Actually the MontaVista Patch cleverly removes the tests for
> > in_interrupt() and local_irq_are_enabled() AND the state ==
> > TASK_RUNNING.  In actual fact these states can be considered way points
> > on the system status vector.  For example the interrupts off state
> > implies all the rest, the in_interrupt() implies not preemptable and
> > finally, not preemptable is one station away from fully preemptable.
> >
> > TASK_RUNNING is easily solved by makeing schedule() aware that it is
> > being called for preemption.  See the MontaVista patch for details.
> >
> 
> Humm, I'm just curious,
> Regarding in_interrupt(). How do you deal with soft interrupts?
> Guys calling cpu_bh_disable() or even incrementing the count on
> their own.
 
#define cpu_bh_disable(cpu) do { ctx_sw_off(); local_bh_count(cpu)++;
barrier(); } while (0)
#define cpu_bh_enable(cpu)  do { barrier();
local_bh_count(cpu)--;ctx_sw_on(); } while (0)

I don't know if this acceptable but definitely can be done,
> I prefer to rely on fact than on API.

Yes, of course anything CAN be done, but then they would be SOL with the
movement of the flag location (as was done on the way from 2.3 to
2.4.0).  If we encounter such problems, we just fix them.

> Regarding local_irq_enabled(). How do you handle the code that
> call local_irq_disable(), then spin_lock(), spin_unlock() and only
> re-enable the interruptions ? 

Good question, as this is exactly what spin_lock_irq()/spin_unlock_irq()
do.  In this case it is not a problem as the intent was the same anyway,
but we can fix the code to handle this.  If you read the patch, you will
find that we call preempt_schedule() which calls schedule().  We could
easily put a test of the interrupt off state here and reject the
preemption.  The real issue here is how to catch the preemption when
local_irq_enable() is called.  If the system has an interrupt dedicated
to scheduling we could use this, however, while this is available in SMP
systems it is usually not available in UP systems.

On the other hand I have not seen any code do this.  I have, however,
seen code that:
spin_lock_irq()
  :
local_irq_enable()
  :
spin_unlock()

We would rather not mess with the preemption count while irq is disabled
but this sort of code messes up the pairing we need to make this work.

> In this case, you preempt code that
> is supposed to run interruptions disabled.
> Finally, regarding the test on the task state, there may be a cache issue
> but calling schedule() has also some overhead.
> 
I am not sure what you are getting at here.  The task state will be
looked at by schedule() in short order in any case so a cache miss is
not the issue.  We don't look at the state but on the way to schedule()
(in preempt_schedule()) we add a flag to the state to indicate that it
is a preemption call.  schedule() is then changed to treat this task as
running, regardless of the state.

George

> Ludo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-12 Thread george anzinger


Andrew Morton wrote:
> 
> Nigel Gamble wrote:
> >
> > Spinlocks should not be held for lots of time.  This adversely affects
> > SMP scalability as well as latency.  That's why MontaVista's kernel
> > preemption patch uses sleeping mutex locks instead of spinlocks for the
> > long held locks.
> 
> Nigel,
> 
> what worries me about this is the Apache-flock-serialisation saga.
> 
> Back in -test8, kumon@fujitsu demonstrated that changing this:
> 
> lock_kernel()
> down(sem)
> 
> up(sem)
> unlock_kernel()
> 
> into this:
> 
> down(sem)
> 
> up(sem)
> 
> had the effect of *decreasing* Apache's maximum connection rate
> on an 8-way from ~5,000 connections/sec to ~2,000 conn/sec.
> 
> That's downright scary.
> 
> Obviously,  was very quick, and the CPUs were passing through
> this section at a great rate.

If  was that fast, maybe the down/up should have been a spinlock
too.  But what if it is changed to:

  BKL_enter_mutx()
  down(sem)
  
  up(sem)
  BKL_exit_mutex()
> 
> How can we be sure that converting spinlocks to semaphores
> won't do the same thing?  Perhaps for workloads which we
> aren't testing?

The key is to keep the fast stuff on the spinlock and the slow stuff on
the mutex.  Otherwise you WILL eat up the cpu with the overhead.
> 
> So this needs to be done with caution.
> 
> As davem points out, now we know where the problems are
> occurring, a good next step is to redesign some of those
> parts of the VM and buffercache.  I don't think this will
> be too hard, but they have to *want* to change :)

They will *want* to change if they pop up due to other work :)
> 
> Some of those algorithms are approximately O(N^2), for huge
> values of N.
> 
> -
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Latency: allowing resheduling while holding spin_locks

2001-01-13 Thread george anzinger


Nigel Gamble wrote:
> 
> On Sat, 13 Jan 2001, Roger Larsson wrote:
> > A rethinking of the rescheduling strategy...
> 
> Actually, I think you have more-or-less described how successful
> preemptible kernels have already been developed, given that your
> "sleeping spin locks" are really just sleeping mutexes (or binary
> semaphores).
> 
> 1.  Short critical regions are protected by spin_lock_irq().  The maximum
> value of "short" is therefore bounded by the maximum time we are happy
> to disable (local) interrupts - ideally ~100us.
> 
> 2.  Longer regions are protected by sleeping mutexes.
> 
> 3.  Algorithms are rearchitected until all of the highly contended locks
> are of type 1, and only low contention locks are of type 2.
> 
> This approach has the advantage that we don't need to use a no-preempt
> count, and test it on exit from every spinlock to see if a preempting
> interrupt that has caused a need_resched has occurred, since we won't
> see the interrupt until it's safe to do the preemptive resched.

I agree that this was true in days of yore.  But these days the irq
instructions introduce serialization points and, me thinks, may be much
more time consuming than the "++, --, if (false)" that a preemption
count implemtation introduces.  Could some one with a knowledge of the
hardware comment on this?

I am not suggesting that the "++, --, if (false)" is faster than an
interrupt, but that it is faster than cli, sti.  Of course we are
assuming that there is  between the cli and the sti as there is
between the ++ and the -- if (false).

George

> 
> Nigel Gamble[EMAIL PROTECTED]
> Mountain View, CA, USA. http://www.nrg.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-14 Thread george anzinger


"David S. Miller" wrote:
> 
> Nigel Gamble writes:
>  > That's why MontaVista's kernel preemption patch uses sleeping mutex
>  > locks instead of spinlocks for the long held locks.
> 
> Anyone who uses sleeping mutex locks is asking for trouble.  Priority
> inversion is an issue I dearly hope we never have to deal with in the
> Linux kernel, and sleeping SMP mutex locks lead to exactly this kind
> of problem.
> 
Exactly why we are going to us priority inherit mutexes.  This handles
the inversion nicely.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Latency: allowing resheduling while holding spin_locks

2001-01-15 Thread george anzinger


Roger Larsson wrote:
> 
> On Sunday 14 January 2001 01:06, george anzinger wrote:
> > Nigel Gamble wrote:
> > > On Sat, 13 Jan 2001, Roger Larsson wrote:
> > > > A rethinking of the rescheduling strategy...
> > >
> > > Actually, I think you have more-or-less described how successful
> > > preemptible kernels have already been developed, given that your
> > > "sleeping spin locks" are really just sleeping mutexes (or binary
> > > semaphores).
> > >
> > > 1.  Short critical regions are protected by spin_lock_irq().  The maximum
> > > value of "short" is therefore bounded by the maximum time we are happy
> > > to disable (local) interrupts - ideally ~100us.
> > >
> > > 2.  Longer regions are protected by sleeping mutexes.
> > >
> > > 3.  Algorithms are rearchitected until all of the highly contended locks
> > > are of type 1, and only low contention locks are of type 2.
> > >
> > > This approach has the advantage that we don't need to use a no-preempt
> > > count, and test it on exit from every spinlock to see if a preempting
> > > interrupt that has caused a need_resched has occurred, since we won't
> > > see the interrupt until it's safe to do the preemptive resched.
> >
> > I agree that this was true in days of yore.  But these days the irq
> > instructions introduce serialization points and, me thinks, may be much
> > more time consuming than the "++, --, if (false)" that a preemption
> > count implemtation introduces.  Could some one with a knowledge of the
> > hardware comment on this?
> >
> > I am not suggesting that the "++, --, if (false)" is faster than an
> > interrupt, but that it is faster than cli, sti.  Of course we are
> > assuming that there is  between the cli and the sti as there is
> > between the ++ and the -- if (false).
> >
> 
> The problem with counting scheme is that you can not schedule inside any
> spinlock - you have to split them up. Maybe you will have to do that anyway.
> But if your RT process never needs more memory - it should be quite safe.
> 
> The difference with a sleeping mutex is that it can be made lazier - keep it
> in the runlist, there should be very few...
> 
Nigel and I agree on the approach he has layed out with the possible
exception of just how to handle the short spinlocks.  It is agreed that
we can not preempt a task that has a spinlock.  He suggests that the
overhead of testing for preemption on the exit of a spinlock protected
with the preempt_count is higher than the cost of turning off and on the
interrupt system.  He may well be right, and surly was right 5 or 10
years ago.  Today the cost of an cli or sti is much higher relative to
the memory references, especially if we don't need to make the result
visible to other processors (and we don't).  We only have to serialize
WRT our own interrupt system, but the interrupt itself will do this, and
only when we need it.

snip

WRT your patch, A big problem with simple sleeping mutexes is that of
priority inversion.  An example:

Given tasks L of low priority, M of medium, and H of high and X a mutex.
If L is holding X when it is preempted by M and
M wants to run a long time
Then when H preempts M and trys to get X it will have to wait while M
does his thing, just because L can not get the cycles needed to get out
of X.

A priority inherit mutex (pi_mutex) handles this by, when H trys to get
X, boosting the priority of L (the holder of X) to its own priority
until L releases X.  At this point L reverts to its prior priority and H
continues, now having suceeded in getting X.  This is all complicated,
of course, by remembering that a task can hold several mutexes at a time
and each can have several waiters.

>From a real time point of view, we would NEVER want to scan the task
list looking for someone to wake up.  We should know who to wake up from
the getgo.  Likewise, clutter in the run_list adds wasted cycles and
cache lines to the schedule process.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Coding Style

2001-01-21 Thread george anzinger

Anton Altaparmakov wrote:

> At 06:29 20/01/2001, Linus Torvalds wrote:
> 
>> On Fri, 19 Jan 2001, Mark I Manning IV wrote:
> 
> [snip]
> 
>>  > > And two spaces is not enough. If you write code that needs 
>> comments at
>>  > > the end of a line, your code is crap.
>>  >
>>  > Might i ask you to qualify that statement ?
>> 
>> Ok. I'll qualify it. Add a "unless you write in assembly language" to the
>> end. I have to admit that most assembly languages are terse and hard to
>> read enough that you often want to comment at the end. In assembly you
>> just don't tend to have enough flexibility to make the code readable, so
>> you must live with unreadable code that is commented.
> 
> 
> Would you not also add "unless you are defining structure definitions 
> and want to explain what each of the struct members means"?
> 
> [snip]
> 
>> Notice? Not AFTER the statements.
>> 
>> Why? Because you are likely to want to change the statements. You don't
>> want to keep moving the comments around to line them up. And trying to
>> have a multi-line comment with code is just HORRIBLE:
> 
> 
> And structs are not likely to change so this argument would not longer 
> apply?
> 
> Just curious.

I am curious about another style issue.  In particular _inline_ in ".h" 
files.  The "style" for this as practiced today is about to run into a 
brick wall.  Try, for example, referring to "current->need_resched" 
within the spin_lock() macro.  I don't think you can get this to work 
with the current kernel, and if you can, a) let me know how, and b) it 
is much too hard.  Imho it is time to rethink how (or where) we put 
_inline_ code.

I think the problem could be fixed by a convention in the compiler that 
says something like "_inline_" code will be compiled when a reference to 
it is encountered, but this is outside the standard (i.e. the standard 
allows it to be as it is and does not disallow my proposed convention, 
as I understand the standard).

George

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test11-pre2 compile error undefined reference to `bust_spinlocks' WHAT?!

2000-11-13 Thread George Anzinger


Andrew Morton wrote:
> 
> George Anzinger wrote:
> >
> > The notion of releasing a spin lock by initializing it seems IMHO, on
> > the face of it, way off.  Firstly the protected area is no longer
> > protected which could lead to undefined errors/ crashes and secondly,
> > any future use of spinlocks to control preemption could have a lot of
> > trouble with this, principally because the locker is unknown.
> >
> > In the case at hand, it would seem that an unlocked path to the console
> > is a more correct answer that gives the system a far better chance of
> > actually remaining viable.
> >
> 
> Does bust_spinlocks() muck up the preemptive kernel's spinlock
> counting?  Would you prefer spin_trylock()/spin_unlock()?
> It doesn't matter - if we call bust_spinlocks() the kernel is
> known to be dead meat and there is a fsck in your near future.

Well, actually this fails just as badly as the locker is not unlocking
and the preemption counts are task local... BUT, see below.
> 
> We are still trying to find out why kumon@fujitsu's 8-way is
> crashing on the test10-pre5 sched.c.  Looks like it's fixed
> in test11-pre2 but we want to know _why_ it's fixed.  And at
> present each time he hits the bug, his printk() deadlocks.
> 
> So bust_spinlocks() is a RAS feature :)  A very important one -
> it's terrible when your one-in-a-trillion bug happens and there
> are no diagnostics.
>
I agree, this is why, in the preemption patch, we have an "unlocked"
printk.  Attached is the relevant portion of the preemption patch for
test9.

I think it still suffers from the console lock, but it is a bit further
down the road.

The patch also illustrates why I am looking for a way to pass var args
to the next function down the line.  If I had this the patch would be
WAY simple and would not duplicate the body of printf.

George
 
> It's a work-in-progress.  There are a lot of things which
> can cause printk to deadlock:
> 
> - console_lock
> - timerlist_lock
> - global_irq_lock (console code does global_cli)
> - log_wait.lock
> - tasklist_lock (printk does wake_up) (*)
> - runqueue_lock (printk does wake_up)
> 
> I'll be proposing a better patch for this in a few days.
> 
> (*) Keith: this explains why you can't do a printk() in
> __wake_up_common: printk calls wake_up().  Duh.

diff -urP -X patch.exclude linux-2.4.0-test9-kb-rts/kernel/printk.c 
linux/kernel/printk.c
--- linux-2.4.0-test9-kb-rts/kernel/printk.cWed Jul  5 11:00:21 2000
+++ linux/kernel/printk.c   Thu Nov  2 10:17:20 2000
@@ -312,6 +312,64 @@
return i;
 }
 
+#if defined(CONFIG_KGDB) && defined(CONFIG_PREEMPT)
+asmlinkage int printk_unlocked(const char *fmt, ...)
+{
+   va_list args;
+   int i;
+   char *msg, *p, *buf_end;
+   int line_feed;
+   static signed char msg_level = -1;
+
+   va_start(args, fmt);
+   i = vsprintf(buf + 3, fmt, args); /* hopefully i < sizeof(buf)-4 */
+   buf_end = buf + 3 + i;
+   va_end(args);
+   for (p = buf + 3; p < buf_end; p++) {
+   msg = p;
+   if (msg_level < 0) {
+   if (
+   p[0] != '<' ||
+   p[1] < '0' || 
+   p[1] > '7' ||
+   p[2] != '>'
+   ) {
+   p -= 3;
+   p[0] = '<';
+   p[1] = default_message_loglevel + '0';
+   p[2] = '>';
+   } else
+   msg += 3;
+   msg_level = p[1] - '0';
+   }
+   line_feed = 0;
+   for (; p < buf_end; p++) {
+   log_buf[(log_start+log_size) & LOG_BUF_MASK] = *p;
+   if (log_size < LOG_BUF_LEN)
+   log_size++;
+   else
+   log_start++;
+
+   logged_chars++;
+   if (*p == '\n') {
+   line_feed = 1;
+   break;
+   }
+   }
+   if (msg_level < console_loglevel && console_drivers) {
+   struct console *c = console_drivers;
+   while(c) {
+   if ((c->flags & CON_ENABLED) && c->write)
+   c->write(c, msg, p - msg + line_feed);
+   c = c->next;
+   }
+   }
+   if (line_feed)
+   msg_level = -1;
+   }
+   return i;
+}
+#endif
 void console_print(const char *s)
 {
struct console *c;

In line ASM magic? What is this?

2000-11-15 Thread George Anzinger


I am trying to understand what is going on in the following code.  The
reference for %2, i.e. "m"(*__xg(ptr)) seems like magic (from
.../include/i386/system.h).  At the same time, the code "m" (*mem) from
the second __asm__ below (my code) seems to generate the required asm
code.  Before I go with the simple version, could someone tell me why? 
Inquiring minds want to know.

struct __xchg_dummy { unsigned long a[100]; };
#define __xg(x) ((struct __xchg_dummy *)(x))

__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %b1,%2"
 : "=a"(prev)
 : "q"(new), "m"(*__xg(ptr)), "0"(old)
 : "memory");


__asm__ __volatile__(
 LOCK "cmpxchgl %1,%2\n\t"
 :"=a" (result)
 :"r" (new),
  "m" (*mem),
  "a0" (test)
 : "memory");


George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: In line ASM magic? What is this?

2000-11-15 Thread George Anzinger


Timur Tabi wrote:
> 
> ** Reply to message from George Anzinger <[EMAIL PROTECTED]> on Wed, 15 Nov
> 2000 12:55:46 -0800
> 
> > I am trying to understand what is going on in the following code.  The
> > reference for %2, i.e. "m"(*__xg(ptr)) seems like magic (from
> > .../include/i386/system.h).  At the same time, the code "m" (*mem) from
> > the second __asm__ below (my code) seems to generate the required asm
> > code.  Before I go with the simple version, could someone tell me why?
> > Inquiring minds want to know.
> >
> > struct __xchg_dummy { unsigned long a[100]; };
> > #define __xg(x) ((struct __xchg_dummy *)(x))
> >
> >   __asm__ __volatile__(LOCK_PREFIX "cmpxchgl %b1,%2"
> >: "=a"(prev)
> >: "q"(new), "m"(*__xg(ptr)), "0"(old)
> >: "memory");
> >
> >
> >   __asm__ __volatile__(
> >  LOCK "cmpxchgl %1,%2\n\t"
> >  :"=a" (result)
> >  :"r" (new),
> >   "m" (*mem),
> >   "a0" (test)
> >  : "memory");
> 
> I've been a lot of gcc inline asm recently, and I still consider it a black
> art.  There are times when I just throw in what I think makes sense, and then
> look at the code the compiler generated.  If it's wrong, I try something else.
> 
> Both versions look correct to me.  The "m" simply tells the compiler that
> __xg(ptr) is a memory location, and the contents of that memory location should
> NOT be copied to a register.  The confusion occurs because its unintuitive that
> the "*" is required.  Otherwise, it would have been "r", which basically tells
> the compiler to copy the contents to a register first.
> 
I know the feeling.  I am currently strugling with "inconsistant
constraints".  Still, I must assume that form 1 was used instead of 2
for some reason

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test9: running tasks not in run-queue

2000-11-09 Thread George Anzinger


"David S. Miller" wrote:
> 
>Date:Wed, 8 Nov 2000 15:11:49 -0800
>From: Mike Kravetz <[EMAIL PROTECTED]>
> 
>The following code in __wake_up_common() is then
>executed:
> 
>if (best_exclusive)
>best_exclusive->state = TASK_RUNNING;
>wq_write_unlock_irqrestore(&q->lock, flags);
> 
> test10 fixes this error, now it sets TASK_RUNNING and
> adds the task back to the runqueue all under the runqueue
> lock.

In our preemptable kernel work we often put (or leave) tasks on the run
queue that are not in state TASK_RUNNING and want to treat them as if
they are in state TASK_RUNNING.  We thus changed the test in schedule()
to "task_on_runqueue(prev)"

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: *_trylock return on success?

2000-12-04 Thread george anzinger

So what is a coder to do.  We need to define the pi_mutex_trylock().  If
I understand this thread, it should return 0 on success.  Is this
correct?

George

On Saturday 25 November 2000 22:05, Roger Larsson wrote: 
> On Saturday 25 November 2000 20:22, Philipp Rumpf wrote: 
> > On Sat, Nov 25, 2000 at 08:03:49PM +0100, Roger Larsson wrote: 
> > > > _trylock functions return 0 for success. 
> > > 
> > > Not spin_trylock 
> > 
> > Argh, I missed the (recent ?) change to make x86 spinlocks use 1 to mean 
> > unlocked. You're correct, and obviously this should be fixed. 

Have looked more into this now... 
tasklet_trylock is also wrong (but there are only four of them) 
Is this 2.4 only, or where there spin locks earlier too? 

My suggestion now is a few steps: 
1) to release a kernel version that has corrected _trylocks; 
spin2_trylock and tasklet2_trylock. 
[with corresponding updates in as many places as possible: 
  s/!spin_trylock/spin2_trylock/g 
  s/spin_trylock/!spin2_trylock/g 
  . . .] 
(ready for spin trylock, not done for tasklet yet..., attached, 
 hope it got included OK - not fully used to kmail) 

2) This will in house only drives or compilations that in some 
strange way uses this calls... 

3a) (DANGEROUS) global rename spin2_trylock to spin_trylock 
 [no logic change this time - only name] 
3b) (dangerous) add compatibility interface 
 #define spin_trylock(L) (!spin2_trylock(L)) 
 Probably not necessary since it can not be linked against. 
 Binary modules will contain their own compatibility code :-) 
 Probably preferred by those who maintain drivers for several 
 releases; 2.2, 2.4, ... 
3c) do not do anything more... 

Alternative: 
1b) do nothing at all - suffer later 

/RogerL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Kernel Debugging

2000-08-29 Thread George Anzinger


"Amit S. Kale" wrote:
> 
> George Anzinger wrote:
> > I took a look at this and it looks very messy.  The whole notion is that
> > the stack is available to put the call on and then proceed.  The
> > question then is what does the stack look like after the call to the
> > inferior function which, one way or another get back to the kgdb.  I
> > think the back trace should show the frames that kgdb put on the stack
> > and then trace back to the orgional stack.  If gdb is some how not on
> > the same stack, we will not see it.  In fact the inferior trap/break
> > would leave a bit of a mess to clean up.  We would have to move the
> > stack again, but to a new place, etc.
> 
> Yes. putting new parameters etc. on top of the stack inside kgdb
> makes back trace show called function, kgdb function calls and then
> original kernel function calls. This creates
> 1. Reentrancy problems.
> 2. Called function can corrupt kgdb data.
> 
> > An alternative solution is to put the call in a seperate memory area and
> > put the parameters, etc. on the stack.  This would unwind cleanly and
> > gdb already drops the inferior call from being useful when it ends other
> > than by returning.  Some systems do not allow execution from the stack,
> > so code already exists in gdb to do this.  Problem is a) how to turn it
> > on, and b) how to tell gdb where the real sp is so it can lay down the
> > call parameters correctly.  Of course, even this may be a problem as the
> > stub still needs to make calls to get characters from the interface, but
> > this could be covered by boosting the stack address given to gdb for the
> > inverior function call.
> 
> In this case the stack would be broken.
> 1. It again has reentrancy problems
> 2. back trace is not clean it's broken.
> 
> IMO a better option is to make kgdb intelligent to recognize the fact
> that gdb is pushing parameters and calling another function. Save all
> this in some other area and put parameters and the call just before
> returning from kgdb. This solution does not have reentrancy problems
> as artificially constructed call looks similar to a normal call from
> called function. No kgdb frames appear in between original function.
> kgdb does not need to remember that it has executed a function call.

Well I guess I am up for the chalange.  Is there a clue when this is
happening other than a write to the stack area?  I assume gdb expects,
on return, to be able to just continue to resume the program.  How is
the completion of the inferior function recognized?  Could be a
breakpoint or some return to the gdb.  Hm, lets see, in a ptrace
environment the only available option is a breakpoint.  Right?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Drivers that potentially leave state as TASK_{UN}INTERRUPTIBLE

2000-09-06 Thread George Anzinger


John Levon wrote:
> 
> Am I right ? against test8pre1
> 
> Also, is it a bug to not set TASK_{UN}INTERRUPTIBLE before doing a
> schedule_timeout() ? What will happen ?
> 
Well, first the "timeout" call will return immediately.  Next, when the
time out actually happens, if the task is not TASK_RUNNING (i.e. it is
waiting for some other thing) it will wake_up.  So the sleep is lost and
it is possible to have a false wake up (could even wake up a zombie). 
If the actual timeout happens while the task is TASK_RUNNING it is
ignored.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduler policy question

2000-09-06 Thread George Anzinger


Hubert Tonneau wrote:
> 
> Benchmarking Pliant (http://pliant.cx/) semaphores led to unexpectedly
> low results. The problem to either kernel bad features or bugs in my
> program since Pliant uses no glue library such as glibc: it calls
> directly kernel funtions.
> 
> Not enough scheduling problem:
> -
> I traced the problem to the fact that when the semaphore is locked,
> I call sched_yield kernel function expecting that it will switch
> to another thread, whereas my results seem to mean that it does not.
> On the other hand, calling nanosleep kernel function with 0 as tv_sec and
> tv_nsec does exactly what I would have expected from sched_yield.
> Did I misunderstood sched_yield semantic ?

NO.  Yield has been broken more than not.  It is broken in 2.4.0-test6
for example.

> So, sched_yield function seems to do nothing on my 2.2.15 UP box. On the other
> hand, on my 2.0.38 SMP box, it seems to work as expected (same as nanosleep 0)
> 
> Too much scheduling problem:
> ---
> In order to restart a stopped thread (which sent a SIGSTOP signal to itself),
> I send a SIGCONT signal to the target thread using kill kernel function.
> The problem here is that it seems (also what I deduced from benchmarks) that
> the calling thread will be preempted immediatly.
> This is a big problem in the case I want to restart a set of threads
> at once (in case they are waiting for read access on a semaphore) and also a
> small one in the very general case since they is too much ping pong between
> the various threads dealing with the semaphore resulting in too much wasted
> time in the kernel scheduler.
> So my question is how can I restart another thread and continue running
> the current one until it's time slice ends ?

One answer is to jump to a real time priority.  Fall back when you want
to yield.  Problem is you need root priv to do the call.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[patch]2.4.0-test6 "spinlock" preemption patch

2000-09-06 Thread George Anzinger

p; console_drivers) {
+   struct console *c = console_drivers;
+   while(c) {
+   if ((c->flags & CON_ENABLED) && c->write)
+   c->write(c, msg, p - msg + line_feed);
+   c = c->next;
+   }
+   }
+   if (line_feed)
+   msg_level = -1;
+   }
+   return i;
+}
+#endif
 void console_print(const char *s)
 {
struct console *c;
diff -urP -X patch.exclude linux-2.4.0-test6-org/kernel/sched.c linux/kernel/sched.c
--- linux-2.4.0-test6-org/kernel/sched.cFri Aug  4 16:15:37 2000
+++ linux/kernel/sched.cTue Sep  5 17:43:24 2000
@@ -4,12 +4,14 @@
  *  Kernel scheduler and related syscalls
  *
  *  Copyright (C) 1991, 1992  Linus Torvalds
+ *  Preemption code  Copyright (C) 2000 MontaVista Software Inc.
  *
  *  1996-12-23  Modified by Dave Grothe to fix bugs in semaphores and
  *  make semaphores SMP safe
  *  1998-11-19 Implemented schedule_timeout() and related stuff
  * by Andrea Arcangeli
  *  1998-12-28  Implemented better SMP scheduling by Ingo Molnar
+ *  2000-9-5Preemption changes by George Anzinger [EMAIL PROTECTED]
  */
 
 /*
@@ -29,6 +31,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_PREEMPT
+#ifdef DEBUG_PREEMPT
+int in_scheduler=0;
+#endif
+#endif
+
 extern void timer_bh(void);
 extern void tqueue_bh(void);
 extern void immediate_bh(void);
@@ -456,7 +464,11 @@
 {
current->need_resched |= prev->need_resched;
 #ifdef CONFIG_SMP
+#ifdef CONFIG_PREEMPT
+if ((task_on_runqueue(p) &&
+#else
if ((prev->state == TASK_RUNNING) &&
+#endif
(prev != idle_task(smp_processor_id( {
unsigned long flags;
 
@@ -472,6 +484,14 @@
 
 void schedule_tail(struct task_struct *prev)
 {
+#ifdef CONFIG_PREEMPT
+#ifdef DEBUG_PREEMPT
+   ASSERT(in_scheduler,
+   "in_scheduler == 0 in schedule_tail");
+   in_scheduler = 0; 
+#endif
+   ctx_sw_on();
+#endif
__schedule_tail(prev);
 }
 
@@ -492,6 +512,18 @@
struct list_head *tmp;
int this_cpu, c;
 
+#ifdef CONFIG_PREEMPT
+   ctx_sw_off(); 
+#ifdef DEBUG_PREEMPT
+if ( in_scheduler || in_ctx_sw_off()<1){
+printk("Recursive sched call, count=%d\n",in_ctx_sw_off());
+printk("Called from 0x%x\n",(int)__builtin_return_address(0));
+}
+   ASSERT(!in_scheduler++,
+   "in_scheduler == 1 in schedule()");
+#endif
+#endif
+
if (!current->active_mm) BUG();
if (tq_scheduler)
goto handle_tq_scheduler;
@@ -526,10 +558,17 @@
switch (prev->state & ~TASK_EXCLUSIVE) {
case TASK_INTERRUPTIBLE:
if (signal_pending(prev)) {
+#ifdef CONFIG_PREEMPT
+case TASK_PREEMPTING:
+#endif
prev->state = TASK_RUNNING;
break;
}
default:
+#ifdef CONFIG_PREEMPT
+if (prev->state & TASK_PREEMPTING )
+break;
+#endif
del_from_runqueue(prev);
case TASK_RUNNING:
}
@@ -545,7 +584,11 @@
 */
next = idle_task(this_cpu);
c = -1000;
+#ifdef CONFIG_PREEMPT
+if (task_on_runqueue(prev))
+#else
if (prev->state == TASK_RUNNING)
+#endif
goto still_running;
 
 still_running_back:
@@ -630,6 +673,15 @@
}
}
 
+#ifdef CONFIG_PREEMPT
+#ifdef DEBUG_PREEMPT
+if (in_ctx_sw_off() < 1) {
+   current->comm[15] = 0;
+PANIC2("inside schedule() : switch_off_count = %d, \n", 
+   in_ctx_sw_off());
+}
+#endif
+#endif
/*
 * This just switches the register state and the
 * stack.
@@ -639,6 +691,14 @@
 
 same_process:
reacquire_kernel_lock(current);
+#ifdef CONFIG_PREEMPT
+#ifdef DEBUG_PREEMPT
+   ASSERT(in_scheduler,
+   "in_scheduler == 0 in schedule_tail");
+   in_scheduler = 0; 
+#endif
+   ctx_sw_on_no_preempt(); 
+#endif
return;
 
 recalculate:
@@ -1018,9 +1078,9 @@
spin_lock_irq(&runqueue_lock);
if (current->policy == SCHED_OTHER)
current->policy |= SCHED_YIELD;
-   current->need_resched = 1;
move_last_runqueue(current);
spin_unlock_irq(&runqueue_lock);
+schedule();
return 0;
 }
 
@@ -1242,3 +1302,72 @@
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, cpu);
 }
+#ifdef CONFIG_PREEMPT
+asmlinkage void preempt_schedule(void)
+{
+current->state |= TASK_PREEMPTING;

Re: Drivers that potentially leave state as TASK_{UN}INTERRUPTIBLE

2000-09-06 Thread George Anzinger


John Levon wrote:
> 
> On Wed, 6 Sep 2000, George Anzinger wrote:
> 
> > John Levon wrote:
> > >
> > > Am I right ? against test8pre1
> > >
> > > Also, is it a bug to not set TASK_{UN}INTERRUPTIBLE before doing a
> > > schedule_timeout() ? What will happen ?
> > >
> > Well, first the "timeout" call will return immediately.  Next, when the
> > time out actually happens, if the task is not TASK_RUNNING (i.e. it is
> > waiting for some other thing) it will wake_up.  So the sleep is lost and
> > it is possible to have a false wake up (could even wake up a zombie).
> > If the actual timeout happens while the task is TASK_RUNNING it is
> > ignored.
> >
> > George
> >
> 
> So it seems to be a bug at least in terms of timing. Unfortunately I only
> got about 4 replies to the patches that touched 20+ drivers. I suppose I
> should just hassle maintainers until they fix it or tell me where I've
> gone wrong ...
>
Actually I was not quite correct.  The call to timeout WILL return
immediately, however, the timeout code will clean up the timer, so there
should be no worry there.  It is a bug in that the sleep does not happen
as expected.  I saw at least one place where there were comments about
it not working.. and he did not set the state.  For what its worth I
think the schedule_timeout() should be changed to __schedule_timeout()
and two new time out calls be used {un}interruptible_sleep_on_timeout(),
these calls to do the set up of state.  In fact the interruptible
version already exists.  Of course things like select would have to use
__schedule_timeout() as they are waiting for any of several events.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [OT] Re: Availability of kdb

2000-09-07 Thread George Anzinger

Chris Wedgwood wrote:
> 
> On Wed, Sep 06, 2000 at 12:52:29PM -0700, Linus Torvalds wrote:
> 
> [... words of wisdom removed for brevity ...]
> 
>  I'm a bastard, and proud of it!
> 
> Linus
> 
> Anyone else think copyleft could make a shirt from this?

I like this one better:

"And I'm right.  I'm always right, but in this case I'm just a bit more
right than I usually am." -- Linus Torvalds, Sunday Aug 27, 2000.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Drivers that potentially leave state as TASK_{UN}INTERRUPTIBLE

2000-09-07 Thread George Anzinger

David Woodhouse wrote:
> 
> [EMAIL PROTECTED] said:
> >  So it seems to be a bug at least in terms of timing. Unfortunately I
> > only got about 4 replies to the patches that touched 20+ drivers. I
> > suppose I should just hassle maintainers until they fix it or tell me
> > where I've gone wrong ...
> 
> Actually, I was quite happy calling schedule_timeout in the flash drivers
> without changing current->state. I'm waiting to something to happen, and
> just to be considerate, I'm asking to be put to sleep for the 'expected'
> amount of time for whatever's happening. If there's other stuff on the run
> queue, it won't return immediately, will it? 

It most likely will return immediately.  The only case it would not is
if, while you were running:

A tick happened and some other task in the run queue then had a higher
count of remaining ticks, or
Something event happened to cause the system to want to run some other
task (i.e. need_resched is set)

In no case will the system wait for anything related to the time you
send to schedule_timeout() (which, I think should be called something
like xxx_sleep().

>If there isn't other stuff on
> the run queue, I'll just busy wait till the flash chip's finished.

No you will _almost always_ busy wait.

> 
> Otherwise, it would be TASK_UNINTERRUPTIBLE.

UNINTERRUPTIBLE refers to being open to signals.  A task waiting
UNINTERRUPTIBLE can not be killed (or otherwise signaled) while it is
waiting.  Signals that come in against the task are just queued until
the task allows them to be recognized (by exiting to user land or
explicitly calling signal delivery (which very few code paths do)).  By
waiting INTERRUPTIBLE, the system will wake_up() the task when a signal
is posted against it.  The task still has to respond to it in one of the
above two ways.

For what it is worth, I think the system needs a "deliver_signal()"
function just for this.  I think that this function should have:

a.) A "task being killed" call back function, so callers can clean up if
the call is not going to return, but instead is killing the task.  This
would allow drivers to test for being killed as opposed to just being
handed some other relatively unimportant signal and to clean up their
act if needed.

b.) An indication (return value) that tells if a user handler was
called.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [patch]2.4.0-test6 "spinlock" preemption patch

2000-09-12 Thread George Anzinger

Rik van Riel wrote:
> 
> On Tue, 12 Sep 2000, Andrea Arcangeli wrote:
> > On Wed, 6 Sep 2000, George Anzinger wrote:
> >
> > >The times a kernel is not preemptable under this patch are:
> > >
> > >While handling interrupts.
> > >While doing "bottom half" processing.
> > >While holding a spinlock, writelock or readlock.
> > >
> > >At all other times the algorithm allows preemption.
> >
> > So it can deadlock if somebody is doing:
> >
> >   while (test_and_set_bit(0, &something)) {
> >   /* critical section */
> >   mb();
> >   clear_bit(0, &something);
> >   }
> 
> > The above construct it's discouraged of course when you can do
> > the same thing with a spinlock but some place is doing that.
> 
> Hmmm, maybe the Montavista people can volunteer to clean
> up all those places in the kernel code? ;)
> 
> cheers,

Well, I think that is what we are saying.  We are trying to understand
the lay of the land and which way the wind is blowing so that our work
is accepted into the kernel.  Thus, for example, even now both
preemption and rtsched are configuration options that, when not chosen,
give you back the same old kernel (with possibly a more readable debug
option in spinlock.h  and a more reliable exit from entry.S :)

Along these lines, we do want to thank Andrea for pointing out this code
to us.  It is always better to have someone point out the tar pits prior
to our trying to walk across them (and verily, he was never seen again
:)

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: about time-slice

2000-10-20 Thread George Anzinger

Dan Maas wrote:
> 
> > I have a question about the time-slice of linux, how do I know it, or how
> > can I test it?
> 
> First look for the (platform-specific) definition of HZ in
> include/asm/param.h. This is how many timer interrups you get per second (eg
> on i386 it's 100). Then look at include/linux/sched.h for the definition of
> DEF_COUNTER. This is the number of timer interrupts between mandatory
> schedules. By default it's HZ/10, meaning that the time-slice is 100ms (10
> schedules/sec). (of course the interval could be longer if kernel code is
> hogging the CPU; the scheduler won't run until the process leaves the kernel
> or sleeps explicitly...)
> 
> Experts, please correct me if I'm wrong.

Not really an expert, but...  

In the 2.4.0... version the slice time is derived from the tasks NICE
value.  Also, in the new system the call sys_sched_rr_get_interval()
returns the value.  (In older systems NICE was also involved in a "not
so clear" way and the call returned nonsense.)  

On the other hand, the system manages these slices in an interesting and
non-intuitive way.  First, the task that has the longest remaining slice
gets (usually) the processor.
Second, when the slice is consumed a new one is not given to the task
until all tasks in the run queue have consumed all of their slices. 
When this happends all tasks on the system are given new slices.  Tasks
that have some value left will get 1/2 of that value plus the new
slice.  (So a task that is waiting for something .. a key strok.. will
accumulate more slice time as it waits.).

Hope this helps.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Full preemption issues

2000-10-27 Thread George Anzinger


Dear Linus,

As you know we at MontaVista are working on a fully preemptable kernel. 
In this work we have come up with a couple of issues we would like your
comments on.

First, as you know, we have added code to the spinlock macros to count
up and down a preemption lock counter.  We would like to not do this if
the macro also turns off local interrupts.  The issue here is that in
some places in the code, spin_lock_irq() or spin_lock_irqsave() is
called but spin_unlock_irq() or spin_lock_irqrestore() is not.  This, of
course, confuses the preemption count.  Attached is a patch that
addresses this issue.  At this time we are not asking you to apply this
patch, but to indicate if we are moving in an acceptable direction.

The second issue resolves around the naming conventions used in the
kernel.  We want to extend this work to include the SMP kernel, but to
do this we need to have several levels of names for the spinlock
macros.  We note that the kernel uses "_" and "__" prefixes in some
macros, but can not, by inspection, figure out when to uses these
prefixes.  Could you explain this convention or is this wisdom written
somewhere?

To clarify the intent here is a bit of proto code:

#ifdef CONFIG_PREEMPT
#define preempt_lock() ... definition...
#define preempt_unlock() ...definition...
#else
#define preempt_lock()
#define preempt_unlock()
#endif

#ifdef CONFIG_SMP
#define _spin_lock(x) __spin_lock(x)  /* __spin_lock(x) to be todays SMP
definition */
#define _spin_unlock(x) __spin_unlock(x)  /* __spin_unlock(x) to be
todays SMP definition */
#else
#define _spin_lock()
#define _spin_unlock()
#endif

#define spin_lock(x) do{ preempt_lock(); _spin_lock(x);} while (0)
#define spin_unlock(x) do{ _spin_unlock(x); preempt_unlock();} while (0)

George

diff -urP -X patch.exclude linux-2.4.0-test6-kb-p-r-i-6-1.4/drivers/ide/ide.c 
linux/drivers/ide/ide.c
--- linux-2.4.0-test6-kb-p-r-i-6-1.4/drivers/ide/ide.c  Thu Jul 27 16:40:57 2000
+++ linux/drivers/ide/ide.c Fri Oct 20 13:06:45 2000
@@ -1354,7 +1354,7 @@
 */
if (masked_irq && hwif->irq != masked_irq)
disable_irq_nosync(hwif->irq);
-   spin_unlock(&io_request_lock);
+   spin_unlock_noirqrestore(&io_request_lock);
ide__sti(); /* allow other IRQs while we start this request */
startstop = start_request(drive);
spin_lock_irq(&io_request_lock);
@@ -1438,7 +1438,7 @@
 * the handler() function, which means we need to globally
 * mask the specific IRQ:
 */
-   spin_unlock(&io_request_lock);
+   spin_unlock_noirqrestore(&io_request_lock);
hwif  = HWIF(drive);
 #if DISABLE_IRQ_NOSYNC
disable_irq_nosync(hwif->irq);
@@ -1599,7 +1599,7 @@
}
hwgroup->handler = NULL;
del_timer(&hwgroup->timer);
-   spin_unlock(&io_request_lock);
+   spin_unlock_noirqrestore(&io_request_lock);
 
if (drive->unmask)
ide__sti(); /* local CPU only */
--- linux-2.4.0-test6-org/include/linux/spinlock.h  Wed Aug  9 18:57:54 2000
+++ linux/include/linux/spinlock.h  Fri Oct 27 09:48:47 2000
@@ -7,29 +7,107 @@
  * These are the generic versions of the spinlocks and read-write
  * locks..
  */
-#define spin_lock_irqsave(lock, flags) do { local_irq_save(flags);   
spin_lock(lock); } while (0)
-#define spin_lock_irq(lock)do { local_irq_disable(); 
spin_lock(lock); } while (0)
-#define spin_lock_bh(lock) do { local_bh_disable();  
spin_lock(lock); } while (0)
-
-#define read_lock_irqsave(lock, flags) do { local_irq_save(flags);   
read_lock(lock); } while (0)
-#define read_lock_irq(lock)do { local_irq_disable(); 
read_lock(lock); } while (0)
-#define read_lock_bh(lock) do { local_bh_disable();  
read_lock(lock); } while (0)
-
-#define write_lock_irqsave(lock, flags)do { local_irq_save(flags);
  write_lock(lock); } while (0)
-#define write_lock_irq(lock)   do { local_irq_disable();
write_lock(lock); } while (0)
-#define write_lock_bh(lock)do { local_bh_disable(); 
write_lock(lock); } while (0)
-
-#define spin_unlock_irqrestore(lock, flags)do { spin_unlock(lock);  
local_irq_restore(flags); } while (0)
-#define spin_unlock_irq(lock)  do { spin_unlock(lock);  
local_irq_enable();   } while (0)
-#define spin_unlock_bh(lock)   do { spin_unlock(lock);  
local_bh_enable();} while (0)
-
-#define read_unlock_irqrestore(lock, flags)do { read_unlock(lock);  
local_irq_restore(flags); } while (0)
-#define read_unlock_irq(lock)  do { read_unlock(lock);  
local_irq_enable();   } while (0)

Locking question, is this cool?

2000-10-31 Thread George Anzinger


At line 1073 of ../drivers/char/i2lib.c (2.4.0-test9) we find:

WRITE_LOCK_IRQSAVE(...

this is followed by:

COPY_FROM_USER(...

It seems to me that this could result in a page fault with interrupts
off.  Is this ok?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Locking question, is this cool?

2000-10-31 Thread George Anzinger


Alan Cox wrote:
> 
> > At line 1073 of ../drivers/char/i2lib.c (2.4.0-test9) we find:
> >
> > WRITE_LOCK_IRQSAVE(...
> >
> > this is followed by:
> >
> > COPY_FROM_USER(...
> >
> > It seems to me that this could result in a page fault with interrupts
> > off.  Is this ok?
> 
> It wont do what you want - it'll re-enable irqs and may then deadlock. It might
> need to copy the buffer to a temporary space then take the lock >
> -
I suspected as much.  I see the same error (bug?) at line 978 of
../drivers/char/riotty.c

Seems like a common problem.

george
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 1.2.45 Linux Scheduler

2000-11-01 Thread George Anzinger

Anonymous wrote:
> 
> In the Linux scheduler they use a circular queue implementation with round
> robin. What is the advantage of this over just using a normal queue with a
> back and front. Also does anyone know what a test plan for such a design
> would even begin to look like. This is a project for a proposal going around
> in my neighborhood and I am wondering why in the world someone would want to
> modify the Linux scheduler to this extent.
> 
The advantages to the circular bi-directional list are:

1.) You can insert AND remove entries at any point in the list with
simple code that does not have to a:) test to see if it is dealing with
an end point or b:) know ANY thing about the rest of the list.
2.) You have access to each end of the queue without searching (great
for RR stuff).
3.) It is easy to get the compiler to do as good a job at insert and
delete as you can do in assembly (see 2.4.0-testX code).

In fact Linux uses this list structure for almost all of its lists, not
just the run list.

The problems with the scheduler list management are not so much circular
bi-directional issues as the fact that the actual dispatch priorities
are so dynamic that you (the scheduler) can not predict at list
insertion time the best task dispatch order.  A real-time scheduler with
fixed priorities has a much easier go of it in this regard.  See, for
example, the real-time scheduler at .

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Possible cause of SegFaults

2000-11-02 Thread George Anzinger


Linus,

In doing the full preemption testing we found and fixed this little
segfault window.  Seems that interrupts are left on for the page fault. 
This allows an interrupt prior to fetching the faulting info from CR2. 
Result, illegal memory reference, i.e. segfault. I don't know what
interrupt code might touch CR2, but better safe than sorry.  Of course,
preemption in this window _IS_ a problem.

Attached find a patch to fix the problem.  The patch is on 2.4.0-test9
but should apply to most of 2.4.0 versions.

George

diff -urP -X patch.exclude linux-2.4.0-test9-kb-rts/arch/i386/kernel/traps.c 
linux/arch/i386/kernel/traps.c
--- linux-2.4.0-test9-kb-rts/arch/i386/kernel/traps.c   Mon Oct 30 21:04:29 2000
+++ linux/arch/i386/kernel/traps.c  Wed Nov  1 12:42:40 2000
@@ -1028,7 +1028,7 @@
set_trap_gate(11,&segment_not_present);
set_trap_gate(12,&stack_segment);
set_trap_gate(13,&general_protection);
-   set_trap_gate(14,&page_fault);
+   set_intr_gate(14,&page_fault);
set_trap_gate(15,&spurious_interrupt_bug);
set_trap_gate(16,&coprocessor_error);
set_trap_gate(17,&alignment_check);
diff -urP -X patch.exclude linux-2.4.0-test9-kb-rts/arch/i386/mm/fault.c 
linux/arch/i386/mm/fault.c
--- linux-2.4.0-test9-kb-rts/arch/i386/mm/fault.c   Mon Oct 30 21:04:29 2000
+++ linux/arch/i386/mm/fault.c  Thu Nov  2 09:57:02 2000
@@ -130,7 +130,7 @@
 
/* get the address */
__asm__("movl %%cr2,%0":"=r" (address));
-
+__sti();
tsk = current;
mm = tsk->mm;
info.si_code = SEGV_MAPERR;

Re: Installing kernel 2.4

2000-11-08 Thread George Anzinger


But, here the customer did run the configure code (he said he did not
change anything).  Isn't this where the machine should be diagnosed and
the right options chosen?  Need a way to say it is a cross build, but
that shouldn't be too hard.

My $.02 worth.

George


"James A. Sutherland" wrote:
> 
> On Wed, 08 Nov 2000, Horst von Brand wrote:
> > "Jeff V. Merkey" <[EMAIL PROTECTED]> said:
> >
> > [...]
> >
> > > Your way out in the weeds.  What started this thread was a customer who
> > > ended up loading the wrong arch on a system and hanging.  I have to
> > > post a kernel RPM for our release, and it's onerous to make customers
> > > recompile kernels all the time and be guinea pigs for arch ports.
> >
> > I'd prefer to be a guinea pig for one of 3 or 4 generic kernels distributed
> > in binary than of one of the hundreds of possibilities of patching a kernel
> > together at boot, plus the (presumamby rather complex and fragile)
> > machinery to do so *before* the kernel is booted, thank you very much.
> 
> Hmm... some mechanism for selecting the appropriate *module* might be nice,
> after boot...
> 
> > Plus I'm getting pissed off by how long a boot takes as it stands today...
> 
> Yep: slowing down boottimes is not an attractive idea.
> 
> > > They just want it to boot, and run with the same level of ease of use
> > > and stability they get with NT and NetWare and other stuff they are used
> > > to.   This is an easy choice from where I'm sitting.
> >
> > Easy: i386. Or i486 (I very much doubt your customers run on less, and this
> > should be geneic enough).
> 
> I think there are better options. Jeff could, for example, *optimise* for
> Pentium II/III, without using PII specific instructions, in the main kernel,
> then have multiple target binaries for modules.
> 
> James.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Installing kernel 2.4

2000-11-08 Thread George Anzinger


"James A. Sutherland" wrote:
> 
> On Wed, 08 Nov 2000, George Anzinger wrote:
> > But, here the customer did run the configure code (he said he did not
> > change anything).  Isn't this where the machine should be diagnosed and
> > the right options chosen?  Need a way to say it is a cross build, but
> > that shouldn't be too hard.
> 
> Why default to incompatibility?! If the user explicitly says "I really do want
> a kernel which only works on this specific machine as it is now, and I want it
> to break otherwise", fine. Don't make it a default!

I could go along with this.  The user, however, had the default break,
and, to my knowledge, there are no tools to diagnose the current (or any
other) machine anywhere in the kernel.  Maybe it is time to do such a
tool with exports that the configure programs could use as defaults.  My
thought is that the tool could run independently on the target system
(be it local or otherwise) with the results fed back to configure.

(Oops, corollary to the rule that "The squeaking wheel gets the grease."
is "S/he who complains most about the squeaking gets to do the
greasing."  I better keep quiet :)

> 
> BTW: Has anyone benchmarked the different optimizations - i.e. how much
> difference does optimizing for a Pentium make when running on a PII? More to
> the point, how about optimizing non-exclusively for a Pentium, so the code
> still runs on earlier CPUs?
> 
> James.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: fpu now a must in kernel

2000-11-09 Thread George Anzinger


Andi Kleen wrote:
> 
> On Thu, Nov 09, 2000 at 12:27:29PM +1300, david wrote:
> >
> > 2 . put the save / restore code in my code (NOT! GOOD! i do not wont to
> > do it this way it is not the right way)
> 
> It is the right way because it only penalizes your code, not everybody else.
> 
This is a MAJOR drag on preemptability.  MUCH better to keep it out of
the kernel.  Barring that, since context switch does not (and should
not) save/restore fp state, the using code must be preemption locked. 
Sound folks won't like this.  

Maybe you could explain why you think you need this and the community
here could suggest an alternative way to do the same or better.

By the way, since the kernel is not yet preemptable, you could use empty
macros to lock preemption.  This way, when preemption comes (2.5) your
code will be easily found.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: getting a process name from task struct

2000-11-09 Thread George Anzinger


Chris Swiedler wrote:
> 
> Is it possible to get a process's name / full execution path (from
> kernelspace) given only a task struct? I can't find any pointers to this
> information in the task struct, and I don't know where else it might be. ps
> seems to be able to get the process name, but that's from userspace.
> Apologies in advance if this is a stupid question.
> 
> chris
> 
Try the "comm" member of task_struct.  (Clear name, right?)

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Where is it written?

2000-11-10 Thread George Anzinger


I thought this would be simple, but...

Could someone point me at the info on calling conventions to be used
with
x86 processors.  I need this to write asm code correctly and I suspect
that it is a bit more formal than the various comments I have found in
the sources.  Is it, perhaps an Intel doc?  Or a gcc thing?
 
George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Patch generation

2000-11-10 Thread George Anzinger


Dan Aloni wrote:
> 
> On Thu, 9 Nov 2000, Ivan Passos wrote:
> 
> > Where in the src tree can I find (or what is) the command to generate a
> > patch file from two Linux kernel src trees, one being the original and the
> > other being the newly changed one??
> 
> The syntex looks like this one:
> 
> diff -urN old_tree new_tree > your_patch_file
> 
> > I've tried 'diff -ruN', but that does diff's on several files that could
> > stay out of the comparison (such as the files in include/config, .files,
> > etc.).
> 
> You can use the --exclude switch of diff, or make mrproper before you
> diff, or you can cp -al a clean source tree before you build the kernel
> on top of it.
> 
> Another way, is to use *one* source tree, copying the files you change -
> adding them the '.orig' extention to their name.
> 
> Then you run this script (I got it when Riel pasted it on IRC)
> 
> for i in `find ./ -name \*.orig` ; do diff -u $i `dirname $i`/`basename $i
> .orig` ; done
> 
> About the other method: cp -al is fast, creating a copy of tree without
> taking much diskspace, it copies the tree by hard linking the files.
> 
> BTW, 'patch' unlinks files before modifing so you can have lots of kernel
> trees from different releases with little diskspace waste:
> 
> [karrde@callisto ~/usr/src/kernel/work]$ ls -1
> linux-2.4.0-test10
> linux-2.4.0-test10.build
> linux-2.4.0-test11-pre1
> linux-2.4.0-test6
> 
> I did 'cp -al linux-2.4.0-test10 linux-2.4.0-test10.build', and on
> linux-2.4.0-test10.build I did 'make bzImage' and all the rest.
> 
> When Linus releasd test11-pre1 I did 'cp -al' from test10 to test11-pre1
> and patched the test11-pre1 dir with the patch Linus released. the test10
> dir remained intact.
> 
> [karrde@callisto ~/usr/src/kernel/work]$ du . -s
> 193004  .
> 
> 4 kernel trees, one after make dep ; make bzImage, and all taking together
> just 193MB, instead of about 400MB... hard links, gotta love'em.

Ok, this is cool, but suppose I have the same file linked to all these
and want to change it in all the trees, i.e. still have one file.  Is
there an editor that doesn't unlink.  Or maybe cp of the edited file?? 
How would you do this?  (I prefer EMACS, which likes to unlink.)

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test11-pre2 compile error undefined reference to `bust_spinlocks' WHAT?!

2000-11-10 Thread George Anzinger


The notion of releasing a spin lock by initializing it seems IMHO, on
the face of it, way off.  Firstly the protected area is no longer
protected which could lead to undefined errors/ crashes and secondly,
any future use of spinlocks to control preemption could have a lot of
trouble with this, principally because the locker is unknown.

In the case at hand, it would seem that an unlocked path to the console
is a more correct answer that gives the system a far better chance of
actually remaining viable.

George

Keith Owens wrote:
> 
> On Fri, 10 Nov 2000 00:32:49 -0500,
> John Kacur <[EMAIL PROTECTED]> wrote:
> >When attempting to compile test11-pre2, I get the following compile
> >error.
> >
> >arch/i386/mm/mm.o: In function `do_page_fault':
> >arch/i386/mm/mm.o(.text+0x781): undefined reference to `bust_spinlocks'
> >make: *** [vmlinux] Error 1
> 
> Oops, wrong patch.
> 
> Index: 0-test11-pre2.1/arch/i386/kernel/traps.c
> --- 0-test11-pre2.1/arch/i386/kernel/traps.c Fri, 10 Nov 2000 13:10:37 +1100 kaos 
>(linux-2.4/A/c/1_traps.c 1.1.2.2.1.1.2.1.2.3.1.2.3.1.1.2 644)
> +++ 0-test11-pre2.1(w)/arch/i386/kernel/traps.c Fri, 10 Nov 2000 16:06:48 +1100 kaos 
>(linux-2.4/A/c/1_traps.c 1.1.2.2.1.1.2.1.2.3.1.2.3.1.1.2 644)
> @@ -382,6 +382,18 @@ static void unknown_nmi_error(unsigned c
> printk("Do you have a strange power saving mode enabled?\n");
>  }
> 
> +extern spinlock_t console_lock, timerlist_lock;
> +/*
> + * Unlock any spinlocks which will prevent us from getting the
> + * message out (timerlist_lock is acquired through the
> + * console unblank code)
> + */
> +void bust_spinlocks(void)
> +{
> +   spin_lock_init(&console_lock);
> +   spin_lock_init(&timerlist_lock);
> +}
> +
>  #if CONFIG_X86_IO_APIC
> 
>  int nmi_watchdog = 1;
> @@ -394,19 +406,7 @@ static int __init setup_nmi_watchdog(cha
> 
>  __setup("nmi_watchdog=", setup_nmi_watchdog);
> 
> -extern spinlock_t console_lock, timerlist_lock;
>  static spinlock_t nmi_print_lock = SPIN_LOCK_UNLOCKED;
> -
> -/*
> - * Unlock any spinlocks which will prevent us from getting the
> - * message out (timerlist_lock is aquired through the
> - * console unblank code)
> - */
> -void bust_spinlocks(void)
> -{
> -   spin_lock_init(&console_lock);
> -   spin_lock_init(&timerlist_lock);
> -}
> 
>  inline void nmi_watchdog_tick(struct pt_regs * regs)
>  {
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Patch generation

2000-11-10 Thread George Anzinger


Dan Aloni wrote:
> 
> On Fri, 10 Nov 2000, George Anzinger wrote:
> 
> > > 4 kernel trees, one after make dep ; make bzImage, and all taking together
> > > just 193MB, instead of about 400MB... hard links, gotta love'em.
> >
> > Ok, this is cool, but suppose I have the same file linked to all these
> > and want to change it in all the trees, i.e. still have one file.  Is
> > there an editor that doesn't unlink.  Or maybe cp of the edited file??
> > How would you do this?  (I prefer EMACS, which likes to unlink.)
> 
> I know mcedit doesn't unlink (but mcedit kinda sucks), I think nedit
> doesn't unlink too.
> 
> I prefer an editor that unlinks, since in most cases I don't want to
> modify the source trees that I'm not working on, so diff can do what it's
> supposed to do later.

Oh, I agree, but I am working on several things at once so my
development trees are cascaded, usually with a kgdb patch in all of
them.  If I make a change to kgdb, for example, it would be nice to only
have to change it once, so occasionally, I want to do it differently.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Lock ordering, inquiring minds want to know.

2000-12-07 Thread george anzinger


In looking over sched.c I find:

spin_lock_irq(&runqueue_lock);
read_lock(&tasklist_lock);


This seems to me to be the wrong order of things.  The read lock
unavailable (some one holds a write lock) for relatively long periods of
time, for example, wait holds it in a while loop.  On the other hand the
runqueue_lock, being a "irq" lock will always be held for short periods
of time.  It would seem better to wait for the runqueue lock while
holding the read_lock with the interrupts on than to wait for the
read_lock with interrupts off.  As near as I can tell this is the only
place in the system that both of these locks are held (of course, all
cases of two locks being held at the same time, both locker must use the
same order).  So...


What am I missing here? 

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Lock ordering, inquiring minds want to know.

2000-12-08 Thread george anzinger


Mike Kravetz wrote:
> 
> George,
> 
> I can't answer your question.  However, have  you noticed that this
> lock ordering has changed in the test11 kernel.  The new sequence is:
> 
> read_lock_irq(&tasklist_lock);
> spin_lock(&runqueue_lock);
> 
> Perhaps the person who made this change could provide their reasoning.
> 
> An additional question I have is:  Is it really necessary to hold
> the runqueue lock (with interrupts disabled) for as long as we do
> in this routine (setscheduler())?  I suspect we only need the
> tasklist_lock while calling find_process_by_pid().  Isn't it
> possible to do the error checking (parameter validation) with just
> the tasklist_lock held?  Seems that we would only need to acquire
> the runqueue_lock (and disable interrupts) if we are in fact
> changing the task's scheduling policy.

Yes, I think this is true.  The runqueue_lock should only be needed
after the error checks.  Still, the error checks don't take all that
long...

George
> -
> Mike
> 
> On Thu, Dec 07, 2000 at 03:07:18PM -0800, george anzinger wrote:
> > In looking over sched.c I find:
> >
> >   spin_lock_irq(&runqueue_lock);
> >   read_lock(&tasklist_lock);
> >
> >
> > This seems to me to be the wrong order of things.  The read lock
> > unavailable (some one holds a write lock) for relatively long periods of
> > time, for example, wait holds it in a while loop.  On the other hand the
> > runqueue_lock, being a "irq" lock will always be held for short periods
> > of time.  It would seem better to wait for the runqueue lock while
> > holding the read_lock with the interrupts on than to wait for the
> > read_lock with interrupts off.  As near as I can tell this is the only
> > place in the system that both of these locks are held (of course, all
> > cases of two locks being held at the same time, both locker must use the
> > same order).  So...
> >
> >
> > What am I missing here?
> >
> > George
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: lock_kernel() / unlock_kernel inconsistency Don't do this!

2000-12-15 Thread george anzinger

Jason Wohlgemuth wrote:
> 
> In an effort to stay consistent with the community, I migrated some code
> to a driver to use the daemonize() routine in the function specified by
> the kernel_thread() call.
> 
> However, in looking at a few drivers in the system (drivers/usb/hub.c ,
> drivers/md/md.c, drivers/media/video/msp3400.c), I noticed some
> inconsistencies.  Specifically with the use of lock_kernel() /
> unlock_kernel().
> 
> drivers/md/md.c looks like:
> int md_thread(void * arg)
> {
>md_lock_kernel();
> 
>daemonize();
>.
>.
>.
>//md_unlock_kernel();
> }
> 
> this is similiar to drivers/usb/hub.c (which doesn't call unlock_kernel
> following lock_kernel)
> 
> however drivers/media/video/msp3400.c looks like:
> static int msp3400c_thread(void *data)
> {
>.
>.
>.
> #ifdef CONFIG_SMP
>lock_kernel();
> #endif
>daemonize();
>.
>.
>.
> #ifdef CONFIG_SMP
>unlock_kernel();
> #endif
> }
> 
> The latter example seems logically correct to me.  Does this imply that
> after the CPU that is responsible for starting the thread in md.c or
> hub.c claims the global lock it will never be released to any other CPU?
> 
> If I am incorrect here please just point out my error, however, I
> figured I would bring this to the mailing list's attention if in fact
> this is truely in error.

Both of these methods have problems, especially with the proposed
preemptions changes.  The first case causes the thread to run with the
BKL for the whole time.  This means that any other task that wants the
BKL will be blocked.  Surly the needed protections don't require this. 
These locks should be replaced with fine grain locking and once taken,
they should be released ASAP.

The second practice will not provide the needed protection in a
preemptable UP system.  The BKL on a UP is just a NOP anyway.  On the
other hand we want to use these lock points to disable preemption. 
Letting the defining code for the lock decide the SMP/UP issue allows
the preemption code to do the right thing.  This said, still, the BKL
should go away, see above.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: UP 2.2.18 makes kernels 3% faster than UP 2.4.0-test12

2000-12-15 Thread george anzinger


Russell King wrote:
> 
> Rogier Wolff writes:
> > Alan Cox wrote:
> > > What better interactivity ;)
> > Thus to me, 2.4 FEELS much less interactive. When I move windows they
> > don't follow the mouse in real-time.
> 
> Interesting observation: in a scrolling rxvt, kernel 2.0 is smoother than
> 2.2, which is smoother than 2.4.  I hope this trend isn't going to
> continue to 2.6. ;(

Could this be due to the shorter times caculated by the scheduler
recaculate code with the change that moved "nice" into the task_struct? 

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: lock_kernel() / unlock_kernel inconsistency Don't do this!

2000-12-15 Thread george anzinger


Alan Cox wrote:
> 
> > Both of these methods have problems, especially with the proposed
> > preemptions changes.  The first case causes the thread to run with the
> > BKL for the whole time.  This means that any other task that wants the
> > BKL will be blocked.  Surly the needed protections don't require this.
> 
> The BKL is dropped on rescheduling of that task. Its an enforcement of the
> old unix guarantees against other code making the same assumptions. Its also
> the standard 2.4 locking for several things still
> 
Yes, I am aware of the drop on schedule, but a preemptive schedule call
should (can not) do this.  Result, no preemption, i.e. the thread does
not let anyone else in.  Some how I don't think a long term hold, such
as this is needed.  Of course, if the code blocks (i.e. calls
schedule()) often... but then we find folks using such code a pattern
and learning tool.  Remember this thread was started by just such a
study.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [prepatch] 2.4 waitqueues

2000-12-27 Thread george anzinger


Andrew Morton wrote:
> 
> It's been quiet around here lately...
> 
> This is a rework of the 2.4 wakeup code based on the discussions Andrea
> and I had last week.  There were two basic problems:
> 
> - If two tasks are on a waitqueue in exclusive mode and one gets
>   woken, it will put itself back into TASK_[UN]INTERRUPTIBLE state for a
>   few instructions and can soak up another wakeup which should have
>   gone to the other task.
> 
> - If a task is on two waitqueues and two CPUs simultaneously run a
>   wakeup, one on each waitqueue, they can both try to wake the same
>   task which may be a lost wakeup, depending upon how the waitqueue
>   users are coded.
> 
> The first problem is the most serious.  The second is kinda baroque...
> 
> The approach taken by this patch is the one which Andrea is proposing
> for 2.2: if a task was already on the runqueue, continue scanning for
> another exclusive task to wake up.
> 
> It ended up getting complicated because of the "find a process affine
> to this CPU" thing.  Plus I did go slightly berzerk, but I believe the
> result is pretty good.
> 
> - wake_up_process() now returns a success value if it managed to move
>   something to the runqueue.
> 
>   Tidied up the code here a bit as well.
> 
>   wake_up_process_synchronous() is no more.
> 
> - Got rid of all the debugging ifdefs - these have been folded into
>   wait.h
> 
> - Removed all the USE_RW_WAIT_QUEUE_SPINLOCK code and just used
>   spinlocks.
> 
>   The read_lock option was pretty questionable anyway.  It hasn't had
>   the widespread testing and, umm, the kernel is using wq_write_lock
>   *everywhere* anyway, so enabling USE_RW_WAIT_QUEUE_SPINLOCK wouldn't
>   change anything, except for using a more expensive spinlock!
> 
>   So it's gone.
> 
> - Introduces a kernel-wide macro `SMP_KERNEL'.  This is designed to
>   be used as a `compiled ifdef' in place of `#ifdef CONFIG_SMP'.  There
>   are a few examples in __wake_up_common().
> 
>   People shouldn't go wild with this, because gcc's dead code
>   elimination isn't perfect.  But it's nice for little things.
> 
> - This patch's _use_ of SMP_KERNEL in __wake_up_common is fairly
>   significant.  There was quite a lot of code in that function which
>   was an unnecessary burden for UP systems.  All gone now.
> 
> - This patch shrinks sched.o by 100 bytes (SMP) and 300 bytes (UP).
>   Note that try_to_wake_up() is now only expanded in a single place
>   in __wake_up_common().  It has a large footprint.
> 
> - I have finally had enough of printk() deadlocking or going
>   infinitely mutually recursive on me so printk()'s wake_up(log_wait)
>   call has been moved into a tq_timer callback.
> 
> - SLEEP_ON_VAR, SLEEP_ON_HEAD and SLEEP_ON_TAIL have been changed.  I
>   see no valid reason why these functions were, effectively, doing
>   this:
> 
> spin_lock_irqsave(lock, flags);
> spin_unlock(lock);
> schedule();
> spin_lock(lock);
> spin_unlock_irqrestore(lock, flags);
> 
>   What's the point in saving the interrupt status in `flags'? If the
>   caller _wants_ interrupt status preserved then the caller is buggy,
>   because schedule() enables interrupts.  2.2 does the same thing.
> 
>   So this has been changed to:
> 
> spin_lock_irq(lock);
> spin_unlock(lock);
> schedule();
> spin_lock(lock);
> spin_unlock_irq(lock);
> 
>   Or did I miss something?
> 

Um, well, here is a consideration.  For preemption work it is desirable
to not burden the xxx_lock_irq(y) code with preemption locks, as the irq
effectively does this for far less cost.  The problem is code like this
that does not pair the xxx_lock_irq(y) with a xxx_unlock_irq(y).  The
worst offenders are bits of code that do something like:

spin_lock_irq(y);
  :
sti();
  :
spin_unlock(y);

(Yes, this does happen!)

I suspect that most of this has good reason behind it, but for the
preemptive effort it would be nice to introduce a set of macros like:

_unlock_noirq()  which would acknowledge that the lock used irq but
the unlock is not.  And for the above case:

xxx_lock_noirq()  which would acknowledge that the irq "lock" is already
held.

Oh, and my reading of sched.c seems to indicate that interrupts are on
at schedule() return (or at least, related to the task being switch out
and not the new task) so the final spin_unlock_irq(lock) above should
not be changing the interrupt state.  (This argument is lost in the
driver issue that Andrea points out, of course.)

Or did I miss something?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Preemption exit code

2000-12-29 Thread george anzinger


As you know we at MontaVista are working on a preemptable kernel that
takes advantage of the spin_lock() macros.  One of the "tricks" we use
is to bump a preemption counter on a spin_lock() and to decrement it on
spin_unlock().  The question, here posed, has to do with the test that
needs to be done on the decrement.  If the result is zero AND
need_resched is set we want to set the preemption count to one (to avoid
an interrupt race to schedule()) and call schedule().  This test and set
needs to be atomic (again to avoid the interrupt race to schedule()).  I
have been thinking of putting the preemption count and the need_resched
flag in shorts with a long union to combine them into one word.  Leaving
the endian problem aside, this would then allow me to use the cmpxchg
instruction to test and set the required bit.  Thus the spin_unlock()
would generate something like:

preempt_count--;
if( __cmpxchg(¤t->resched_preempt,
RESCHED_ONLY,
RESCHED_PREEMPT,
4) == RESCHED_ONLY) {
  do_call_schedule();

Where __cmpxchg() is from .../include/asm/system.h and
do_call_schedule() would decrement the preemption count on return
(actually it would return to the decrement above).  (Note that I am
using C here but the actual code would most likely be in asm.)

And then I found the following on the l-k list today:

>Subject:  Re: test13-pre5
>Date:  Thu, 28 Dec 2000 15:15:01 -0800 (PST)
>From:Linus Torvalds <[EMAIL PROTECTED]>
>
   snip
>FreeBSD doesn't try to be portable any more, but Linux does, and there 
>are architectures where 8- and 16-bit accesses aren't atomic but have to be 
>done with read-modify-write >cycles.

>And even for fields like "age", where we don't care whether the age 
>itself is 100% accurate, we _do_ care that the fields close-by don't get 
>strange effects from updating "age". We used to have exactly this problem on
>alpha back in the 2.1.x timeframe.

>This is why a lot of fields are 32-bit, even though we wouldn't need 
>more than 8 or 16 >bits of them.
   snip

So, what is recommended here?

Other considerations:  

We would like to not have to find and modify all accesses to
need_resched.  Currently it is set to one in several places and tested
for non-zero in quite a few more places.  Combining the two flags in one
word would change established usage, but would solve the problem.

The above exit code is fast and tight, being a dec a cmpxchg and a
conditional jump (inline) (in asm we can use the Z-flag that the cmpxchg
sets to eliminate the compare in used above).  Note that the atomic
requirement is with respect to an interrupt, not another cpu, so the
lock modifier is not needed.  If another cpu sets need_resched, it will
also set and interrupt for us so we can safely ignore it here.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

procfs info

2000-10-05 Thread George Anzinger


Where is the internal interface to procfs documented?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: procfs docs...

2000-10-05 Thread George Anzinger


Great start.  To complete it we need some info on the read/write
interface.  Is it the same as for other drivers?  I have heard that read
is called once after returning an EOF.  Is this so?  I suppose there are
other interfaces also, e.g. ioctl, etc.

George

Jeff Garzik wrote:
> 
> On Thu, 5 Oct 2000, George Anzinger wrote:
> > Where is the internal interface to procfs documented?
> 
> There is no documentation for the -exported- procfs interface as far as
> I know.  As for internal interfaces, who knows what you are asking...
> 
> Here's a rough outline:  (maybe somebody should clean this up and stick
> it into Documentation/*)
> 
> * Drivers without MAJOR /proc interfaces should stick their procfs
> files/directories into /proc/driver/*
> 
> * Use proc_mkdir to create directories.  For symlinks, proc_symlink, for
> device nodes, proc_mknod.  Note that only proc_mknod takes a permission
> (mode_t) argument.  If you need special permissions on directories, use
> create_proc_entry with S_IFDIR in mode_t arg.  Otherwise directories
> will be mode 0755.
> 
> * Use create_proc_read_entry for your procfs "files."  For anything more
> complex than simply reading, use create_proc_entry.  If you pass '0' for
> mode_t, it will have mode 0644 (ie. normal file permissions).
> 
> * Use remove_proc_entry for removing entries.
> 
> * Pass NULL for the parent dir, if you are based off of /proc root.
> 
> * You don't need to keep around pointers to your procfs directories and
> files.  Just call remove_proc_entry with the correct (full) path,
> relative, to procfs root, and the right thing will happen.
> 
> Cheesy init example:
> 
> if (!proc_mkdir("driver/my_driver", NULL))
> /* error */
> if (!create_proc_read_entry("driver/my_driver/foo", 0, NULL,
> foo_read_proc, NULL))
> /* error */
> if (!create_proc_read_entry("driver/my_driver/bar", 0, NULL,
> bar_read_proc, NULL))
> /* error */
> 
> Cheesy remove example:
> 
> remove_proc_entry ("driver/my_driver/bar", NULL);
> remove_proc_entry ("driver/my_driver/foo", NULL);
> remove_proc_entry ("driver/my_driver", NULL);
> 
> In the above examples, I'm pretty sure that the proc_mkdir call,
> and final remove_proc_entry, can be skipped, too
> 
> Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: static scheduling - SCHED_IDLE?

2001-03-09 Thread george anzinger

Rik van Riel wrote:
> 
> On Thu, 8 Mar 2001, Boris Dragovic wrote:
> 
> > > Of course. Now we just need the code to determine when a task
> > > is holding some kernel-side lock  ;)
> >
> > couldn't it just be indicated on actual locking the resource?
> 
> It could, but I doubt we would want this overhead on the locking...
> 
> Rik

Seems like you are sneaking up on priority inherit mutexes.  The locking
over head is not so bad (same as spinlock, except in UP case, where it
is the same as the SMP case).  The unlock is, however, the same as the
lock overhead.  It is hard to beat the store of zero which is the
spin_unlock.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep question

2001-03-09 Thread george anzinger


Michael Reinelt wrote:
> 
> Hi,
> 
> I've got a question regarding the nanosleep() system call.
> 
> I'm writing a little tool called lcd4linux
> (http://lcd4linux.sourceforge.net), where I have to drive displays
> connected to the parallel port. I'm doing this in userland, using
> outb().
> 
> Some of this displays require quite short delays (e.g. 40 microseconds),
> which cannot be done with normal nanosleep() because of the 10 msec
> timer resolution.
> 
> At the moment I implemented by own delay loop using a small assembler
> loop similar to the one used in the kernel. This has two disadvantages:
> assembler isn't that portable, and the loop has to be calibrated.

Why not use C?  As long as you calibrate it, it should do just fine.  On
the other hand, since you are looping anyway, why not loop on a system
time of day call and have the loop exit when you have the required time
in hand.  These calls have microsecond resolution.
> 
> I took a look at the nanosleep() implementation in the kernel, and found
> that it is possible to get very small delays, but only if I set the
> scheduling type to SCHED_RR or SCHED_FIFO.
> 
> Here are my questions:
> 
> - why are small delays only possible up to 2 msec? what if I needed a
> delay of say 5msec? I can't get it?

The system does these delays by looping in the task.  I.e. NO one else
gets to use the time.  I, for one, would like to see this go away.  See
: http://sourceforge.net/projects/high-res-timers/
In any case the limit is to put "some" bound on the "time out" from
doing useful work.

If you want other times, you can always make more than one call to
nanosleep.
> 
> - how dangerous is it to run a process with SCHED_RR? As far as I
> understood the nanosleep man page, it _is_ dangerous (if the process
> gets stuck in an endless loop, you can't even kill it if you don't have
> a shell which has a higher static priority than the stuck process
> itself).

That is the nature of real time.  You could code your task to insure
that the father process was of higher priority.  This, of course,
assumes that you can continue to communicate with that task via, e.g. X
which is usually running SCHED_OTHER.  In other words, to keep control,
you need to have all the tasks in the communication loop at higher (or
at least equal for SCHED_RR) priority than the bad guy.
> 
> - is it possible to switch between different scheduling modes? I cound
> run the program with normal SCHED_OTHER, and switch to SCHED_RR whenever
> I need to write data to the parallel port? Does this make sense?

Depends.  It is certainly possible.  The question is: Can your task
stand the loss of the processor to another task?  This is what happens
at normal SCHED_OTHER priority.
> 
> - what's the reason why these small delays is not possible with
> SCHED_OTHER?

Just a guess, but I would say because SCHED_OTHER tasks should not be so
time critical.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nanosleep question

2001-03-10 Thread george anzinger


Michael Reinelt wrote:
> 
> george anzinger wrote:
> >
> > Michael Reinelt wrote:
> > >
> > > At the moment I implemented by own delay loop using a small assembler
> > > loop similar to the one used in the kernel. This has two disadvantages:
> > > assembler isn't that portable, and the loop has to be calibrated.
> >
> > Why not use C?  As long as you calibrate it, it should do just fine.
> Because the compiler might optimize it away.

Not if you use volatile on the data type.
> 
> > On
> > the other hand, since you are looping anyway, why not loop on a system
> > time of day call and have the loop exit when you have the required time
> > in hand.  These calls have microsecond resolution.
> I'm afraid they don't (at least with kernel 2.0, I didn't try this with
> 2.4). 

Gosh, I started with 2.2.14 and it does full microsecond resolution.

They have microsecond resolution, but increment only every 1/HZ.
> 
> Someone gave me a hint to loop on rdtsc. I will look into this.

This ticks at 1/"cpu MHz", which can be found by: "cat /proc/cpuinfo"
> 
> > > - why are small delays only possible up to 2 msec? what if I needed a
> > > delay of say 5msec? I can't get it?
> >
> > If you want other times, you can always make more than one call to
> > nanosleep.
> Good point!

~snip~

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] serial console vs NMI watchdog

2001-03-11 Thread george anzinger


Keith Owens wrote:
> 
> On Sun, 11 Mar 2001 08:44:24 +0100 (CET),
> Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >Andrew,
> >
> >your patch looks too complex, and doesnt cover the case of the serial
> >driver deadlocking. Why not add a "touch_nmi_watchdog_counter()" function
> >that just changes last_irq_sums instead of adding locking? This way
> >deadlocks will be caught in the serial code too. (because touch_nmi() will
> >only "postpone" the NMI watchdog lockup event, not disable it.)
> 
> kdb has to completely disable the nmi counter while it is in control.
> All interrupts are disabled, all but one cpus are spinning, the control
> cpu does busy wait while it polls the input devices.  With that model
> there is no alternative to a complete disable.
> 
Consider this.  Why not use the NMI to sync the cpus.  Kdb would have a
function that is called each NMI.  If it is doing nothing, just return
false, else, if waiting for this cpu, well here it is, put it in spin
AFTER saving where it came from so the operator can figure out what it
is doing.  In kgdb I just put the interrupt registers in the task_struct
where they are put when a context switch is done.  Then the debugger can
do a trace, etc. on that task.  A global var that the debugger can see
is also set to the cpus, "current".  

If the cpu is already spinning, return to the nmi code with a true flag
which will cause it to ignore the nmi.  Same thing if it is the cpu that
is doing debug i/o.

I went to this for kgdb after the system failed to return from the call
to force the other cpus to execute a function (which means they have to
be alive).  For extra safety I also time the sync.  If one or more
expected cpus, don't show while looping reading the cycle counter, the
code just continues with out the sync.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] serial console vs NMI watchdog

2001-03-12 Thread george anzinger


Keith Owens wrote:
> 
> On Sun, 11 Mar 2001 20:43:16 -0800,
> george anzinger <[EMAIL PROTECTED]> wrote:
> >Consider this.  Why not use the NMI to sync the cpus.  Kdb would have a
> >function that is called each NMI.
> 
> kdb uses NMI IPI to get the other cpu's attention.  One cpu is in
> control and may or may not be accepting NMI, it depends on the event
> that entered kdb.  The other cpus end up in kdb code, spinning waiting
> for a cpu switch.  Initially they are not receiving NMI because they
> were invoked via NMI which is masked until they exit.  However if the
> user does a cpu switch then single steps the interrupted code, the cpu
> has to return from the NMI handler to the interrupted code at which
> time this cpu starts receiving NMI again.

Are you actually twiddling the hardware, or just changing what happens
on NMI?
> 
> The kdb context can change from ignoring NMI to accepting NMI.  It is
> easier to bring all the cpus into kdb and let the kdb code decide if it
> ignores any NMI that is being received.

Yes. Exactly.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: system call for process information?

2001-03-14 Thread george anzinger


Rik van Riel wrote:
> 
> On Wed, 14 Mar 2001, Martin Dalecki wrote:
> 
> > Not the embedded folks!!! The server folks laugh histerically all
> > times they go via ssh to a trashing busy box to see what's wrong and
> > then they see top or ps auxe under linux never finishing they job:
> 
> That's a separate issue.
> 
> I guess the pagefault path should have _2_ locks.
> 
> One mmap_sem protecting read-only access to the address space
> and another one for write access to the adress space (to stop
> races with swapout, other page faults, ...).
> 
> At the point where the pagefault sleeps on IO, it could release
> the read-only lock, so vmstat, top, etc can get the statistics
> they need. Only during the time the pagefaulting code is actually
> messing with the address space could it block read access (to
> prevent others from seeing an inconsistent state).
> 
Is it REALLY necessary to prevent them from seeing an inconsistent
state?  Seems to me that in the total picture (i.e. system wide) they
will never see a consistent state, so why be concerned with a small
corner of the system.  Let them figure it out, possibly by consistency
checks, if they care.  It just seems unhealthy to demand consistency at
the cost of delays that will only make other data even more
inconsistent. And if the delay is _forever_ from a tool that may be used
to diagnose system problems...  I would rather a tool that repeatedly
showed the same inconsistent state than one that hangs because it can
not get a consistent one.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: system call for process information?

2001-03-14 Thread george anzinger


Rik van Riel wrote:
> 
> On Wed, 14 Mar 2001, george anzinger wrote:
> 
> > Is it REALLY necessary to prevent them from seeing an
> > inconsistent state?  Seems to me that in the total picture (i.e.
> > system wide) they will never see a consistent state, so why be
> > concerned with a small corner of the system.
> 
> You're right. All we need to make sure of is that the address
> space we want to print info about doesn't go away while we're
> reading the stats ...
> 
> (I think ... but we'll need to look at the procfs code in more
> detail)
> 
For what its worth:
On the last system I worked on we had a status program that maintained a
screen with interesting things such as context switches per sec, disc
i/o/sec, lan traffic/sec, ready queue length, next task (printed as
current task) and... well a whole 26X80 screen full of stuff.  The
program gathered all the data by reading system tables as quickly as
possible and THEN did the formatting/ screen update.  Having to deal
with pre formatted data would have a.) widened the capture window and
b.) been a real drag to reformat and move to the right screen location. 
We allowed programs that had the savvy to have read only access to the
kernel area to make this as fast as possible.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Who did the time list insert code?

2001-03-17 Thread george anzinger


At https://high-res-timers.sourceforge.net we are trying to define a
high resolution timer patch for linux (please join us if you are
interested).  We would like to know who wrote the time list management
code that is currently in the kernel.

Or

Any help on any studies done on the nature of the timer list.  The code
seems to indicate that most entries are in the first 2.56 seconds from
NOW.  Has this been verified?  Are there other hidden issues we should
know about?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CHECKER] blocking w/ spinlock or interrupt's disabled

2001-03-20 Thread george anzinger


Dawson Engler wrote:
> 
> > Is it difficult to split it into "interrupts disabled" and "spin lock
> > held"?
> 
Is it difficult to test for matching spinlock pairs such as
spin_lock_irq/spin_unlock_irq.  Sometimes a spin_lock_irq is followed by
a spin_unlock and a separate interrupt re-enable.  This sort of usage,
while not strictly wrong, does make it hard to use the spin_lock/unlock
macros to do preemption.  This said, pairing information would be very
helpful.  Note, there are several flavors here, not just the one I
cited.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-20 Thread george anzinger


Nigel Gamble wrote:
> 
> On Tue, 20 Mar 2001, Roger Larsson wrote:
> > One little readability thing I found.
> > The prev->state TASK_ value is mostly used as a plain value
> > but the new TASK_PREEMPTED is or:ed together with whatever was there.
> > Later when we switch to check the state it is checked against TASK_PREEMPTED
> > only. Since TASK_RUNNING is 0 it works OK but...
> 
> Yes, you're right.  I had forgotten that TASK_RUNNING is 0 and I think I
> was assuming that there could be (rare) cases where a task was preempted
> while prev->state was in transition such that no other flags were set.
> This is, of course, impossible given that TASK_RUNNING is 0.  So your
> change makes the common case more obvious (to me, at least!)
> 
> > --- sched.c.nigel   Tue Mar 20 18:52:43 2001
> > +++ sched.c.roger   Tue Mar 20 19:03:28 2001
> > @@ -553,7 +553,7 @@
> >  #endif
> > del_from_runqueue(prev);
> >  #ifdef CONFIG_PREEMPT
> > -   case TASK_PREEMPTED:
> > +   case TASK_RUNNING | TASK_PREEMPTED:
> >  #endif
> > case TASK_RUNNING:
> > }
> >
> >
> > We could add all/(other common) combinations as cases
> >
> >   switch (prev->state) {
> >   case TASK_INTERRUPTIBLE:
> >   if (signal_pending(prev)) {
> >   prev->state = TASK_RUNNING;
> >   break;
> >   }
> >   default:
> > #ifdef CONFIG_PREEMPT
> >   if (prev->state & TASK_PREEMPTED)
> >   break;
> > #endif
> >   del_from_runqueue(prev);
> > #ifdef CONFIG_PREEMPT
> >   case TASK_RUNNING   | TASK_PREEMPTED:
> >   case TASK_INTERRUPTIBLE | TASK_PREEMPTED:
> >   case TASK_UNINTERRUPTIBLE   | TASK_PREEMPTED:
> > #endif
> >   case TASK_RUNNING:
> >   }
> >
> >
> > Then the break in default case could almost be replaced with a BUG()...
> > (I have not checked the generated code)
> 
> The other cases are not very common, as they only happen if a task is
> preempted during the short time that it is running while in the process
> of changing state while going to sleep or waking up, so the default case
> is probably OK for them; and I'd be happier to leave the default case
> for reliability reasons anyway.

Especially since he forgot:

TASK_ZOMBIE
TASK_STOPPED
TASK_SWAPPING

I don't know about the last two but TASK_ZOMBIE must be handled
correctly or the task will never clear.

In general, a task must run till it gets to schedule() before the actual
state is "real" so the need for the TASK_PREEMPT.  

The actual code generated with what you propose should be the same (even
if TASK_RUNNING != 0, except for the constant).

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-21 Thread george anzinger

Nigel Gamble wrote:
> 
> On Wed, 21 Mar 2001, Keith Owens wrote:
> > I misread the code, but the idea is still correct.  Add a preemption
> > depth counter to each cpu, when you schedule and the depth is zero then
> > you know that the cpu is no longer holding any references to quiesced
> > structures.
> 
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work.  It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Exactly so.  The method does not depend on the sum of preemption being
zip, but on each potential reader (writers take locks) passing thru a
"sync point".  Your notion of waiting for each task to arrive
"naturally" at schedule() would work.  It is, in fact, over kill as you
could also add arrival at sys call exit as a (the) "sync point".  In
fact, for module unload, isn't this the real "sync point"?  After all, a
module can call schedule, or did I miss a usage counter somewhere?

By the way, there is a paper on this somewhere on the web.  Anyone
remember where?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: current->need_reshed, can it be a global flag ?

2001-03-22 Thread george anzinger

Parity Error wrote:
> 
> instead of need_reshed being a per-task flag, could it be
> as a global flag ?, since every time current->need_reshed
> is checked, schedule() is just called to pick another
> process.
> 
> ---
But for which cpu?  Really this is a short cut to provide a per cpu area
that I think works very well, thank you.

Putting it in a real cpu data area would make access slower.  The
"current" pointer is either very quickly computed or pre loaded in a
register (depends on the platform) so it is about as fast as it can get
as it is.

Also, the flag is often checked by selective preemption code in the
kernel.  Even more often by the full preemption patch.

Nuf said
George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Prevent OOM from killing init

2001-03-23 Thread george anzinger


What happens if you just make swap VERY large?  Does the system thrash
it self to a virtual standstill?  Is this a possible answer?  Supposedly
you could then sneak in and blow away the bad guys manually ...

George

Paul Jakma wrote:
> 
> On Fri, 23 Mar 2001, Szabolcs Szakacsits wrote:
> 
> > About the "use resource limits!". Yes, this is one solution. The
> > *expensive* solution (admin time, worse resource utilization, etc).
> 
> traditional user limits have worse resource utilisation? think what
> kind of utilisation a guaranteed allocation system would have. instead
> of 128MB, you'd need maybe a GB of RAM and many many GB of swap for
> most systems.
> 
> some hopefully non-ranting points:
> 
> - setting up limits on a RH system takes 1 minute by editing
> /etc/security/limits.conf.
> 
> - Rik's current oom killer may not do a good job now, but it's
> impossible for it to do a /perfect/ job without implementing
> kernel/esp.c.
> 
> - with limits set you will have:
>  - /possible/ underutilisation on some workloads.
>  - chance of hitting Rik's OOM killer reduced to almost nothing.
> 
> no matter how good or bad Rik's killer is, i'd much rather set limits
> and just about /never/ have it invoked.
> 
> more beancounting will make limits more useful (eg global?) and maybe
> dists can start setting up some kind of limits by default at install
> time based on the RAM installed and whether user selected
> server/workstation/etc.. install.
> 
> Then hopefully we can be a little less concerned about how close Rik
> gets to the impossible task of implementing esp.c.
> 
> > Szaka
> 
> --paulj
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-28 Thread george anzinger


Dipankar Sarma wrote:
> 
> Nigel Gamble wrote:
> >
> > On Wed, 21 Mar 2001, Keith Owens wrote:
> > > I misread the code, but the idea is still correct.  Add a preemption
> > > depth counter to each cpu, when you schedule and the depth is zero then
> > > you know that the cpu is no longer holding any references to quiesced
> > > structures.
> >
> > A task that has been preempted is on the run queue and can be
> > rescheduled on a different CPU, so I can't see how a per-CPU counter
> > would work.  It seems to me that you would need a per run queue
> > counter, like the example I gave in a previous posting.
> 
> Also, a task could be preempted and then rescheduled on the same cpu
> making
> the depth counter 0 (right ?), but it could still be holding references
> to data
> structures to be updated using synchronize_kernel(). There seems to be
> two
> approaches to tackle preemption -
> 
> 1. Disable pre-emption during the time when references to data
> structures
> updated using such Two-phase updates are held.

Doesn't this fly in the face of the whole Two-phase system?  It seems to
me that the point was to not require any locks.  Preemption disable IS a
lock.  Not as strong as some, but a lock none the less.
> 
> Pros: easy to implement using a flag (ctx_sw_off() ?)
> Cons: not so easy to use since critical sections need to be clearly
> identified and interfaces defined. also affects preemptive behavior.
> 
> 2. In synchronize_kernel(), distinguish between "natural" and preemptive
> schedules() and ignore preemptive ones.
> 
> Pros: easy to use
> Cons: Not so easy to implement. Also a low priority task that keeps
> getting
> preempted often can affect update side performance significantly.

Actually is is fairly easy to distinguish the two (see TASK_PREEMPTED in
state).  Don't you also have to have some sort of task flag that
indicates that the task is one that needs to sync?  Something that gets
set when it enters the area of interest and cleared when it hits the
sync point?  

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-03-31 Thread george anzinger


Rusty Russell wrote:
> 
> In message <[EMAIL PROTECTED]> you write:
> > Here is an attempt at a possible version of synchronize_kernel() that
> > should work on a preemptible kernel.  I haven't tested it yet.
> 
> It's close, but...
> 
> Those who suggest that we don't do preemtion on SMP make this much
> easier (synchronize_kernel() is a NOP on UP), and I'm starting to
> agree with them.  Anyway:
> 
> >   if (p->state == TASK_RUNNING ||
> >   (p->state == (TASK_RUNNING|TASK_PREEMPTED))) {
> >   p->flags |= PF_SYNCING;
> 
> Setting a running task's flags brings races, AFAICT, and checking
> p->state is NOT sufficient, consider wait_event(): you need p->has_cpu
> here I think.  You could do it for TASK_PREEMPTED only, but you'd have
> to do the "unreal priority" part of synchronize_kernel() with some
> method to say "don't preempt anyone", but it will hurt latency.
> Hmmm...
> 
> The only way I can see is to have a new element in "struct
> task_struct" saying "syncing now", which is protected by the runqueue
> lock.  This looks like (and I prefer wait queues, they have such nice
> helpers):
> 
> static DECLARE_WAIT_QUEUE_HEAD(syncing_task);
> static DECLARE_MUTEX(synchronize_kernel_mtx);
> static int sync_count = 0;
> 
> schedule():
> if (!(prev->state & TASK_PREEMPTED) && prev->syncing)
> if (--sync_count == 0) wake_up(&syncing_task);
> 
> synchronize_kernel():
> {
> struct list_head *tmp;
> struct task_struct *p;
> 
> /* Guard against multiple calls to this function */
> down(&synchronize_kernel_mtx);
> 
> /* Everyone running now or currently preempted must
>voluntarily schedule before we know we are safe. */
> spin_lock_irq(&runqueue_lock);
> list_for_each(tmp, &runqueue_head) {
> p = list_entry(tmp, struct task_struct, run_list);
> if (p->has_cpu || p->state == (TASK_RUNNING|TASK_PREEMPTED)) {
I think this should be:
  if (p->has_cpu || p->state & TASK_PREEMPTED)) {
to catch tasks that were preempted with other states.  The lse Multi
Queue scheduler folks are going to love this.

George

> p->syncing = 1;
> sync_count++;
> }
> }
> spin_unlock_irq(&runqueue_lock);
> 
> /* Wait for them all */
> wait_event(syncing_task, sync_count == 0);
> up(&synchronize_kernel_mtx);
> }
> 
> Also untested 8),
> Rusty.
> --
> Premature optmztion is rt of all evl. --DK
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for 2.5] preemptible kernel

2001-04-02 Thread george anzinger

Nigel Gamble wrote:
> 
> On Sat, 31 Mar 2001, george anzinger wrote:
> > I think this should be:
> > if (p->has_cpu || p->state & TASK_PREEMPTED)) {
> > to catch tasks that were preempted with other states.
> 
> But the other states are all part of the state change that happens at a
> non-preemtive schedule() point, aren't they, so those tasks are already
> safe to access the data we are protecting.
> 
If your saying that the task's "thinking" about a state change is
sufficient, ok.  The point is that a task changes it state prior to
calling schedule() and then, sometimes, doesn't call schedule and just
changes its state back to running.  Preemption can happen at any of
these times, after all that is what the TASK_PREEMPTED flag is used for.

On a higher level, I think the scanning of the run list to set flags and
counters is a bit off.  If these things need to be counted and kept
track of, the tasks should be doing it "in the moment" not some other
task at some distant time.  For example if what is being protected is a
data structure, a counter could be put in the structure that keeps track
of the number of tasks that are interested in seeing it stay around.  As
I understand the objective of the method being explored, a writer wants
to change the structure, but old readers can continue to use the old
while new readers get the new structure.  The issue then is when to
"garbage collect" the no longer used structures.  It seems to me that
the pointer to the structure can be changed to point to the new one when
the writer has it set up.  Old users continue to use the old.  When they
are done, they decrement the use count.  When the use count goes to
zero, it is time to "garbage collect".  At this time, the "garbage man"
is called (one simple one would be to check if the structure is still
the one a "new" task would get).  Various methods exist for determing
how and if the "garbage man" should be called, but this sort of thing,
IMNSHO, does NOT belong in schedule().

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-audio-dev] low-latency scheduling patch for 2.4.0

2001-01-30 Thread george anzinger

Joe deBlaquiere wrote:

~snip~

> The locical answer is run with HZ=1 so you get 100us intervals, 
> right ;o). 

Lets not assume we need the overhead of HZ=1 to get 100us 
alarm/timer resolution.  How about a timer that ticks when we need the 
next tick...

On systems with multiple hardware timers you could kick off a 
> single event at 200us, couldn't you? I've done that before with the 
> extra timer assigned exclusively to a resource. 

With the right hardware resource, one high res counter can give you all 
the various tick resolutions you need. BTDT on HPRT.

George

It's not a giant time 
> slice, but at least you feel like you're allowing something to happen, 
> right?
> 
>> 
>> -- 
>> dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Maintainers list update: linux-net -> netdev

2005-04-12 Thread George Anzinger

Horms wrote:
On Sat, Apr 09, 2005 at 03:52:05PM +0200, Jörn Engel wrote:
On Fri, 8 April 2005 22:16:07 +0200, Pavel Machek wrote:
More importantly, it is still listed as "the list" for network
drivers...
NETWORK DEVICE DRIVERS
P:  Andrew Morton
M:  [EMAIL PROTECTED]
P:  Jeff Garzik
M:  [EMAIL PROTECTED]
L:  linux-net@vger.kernel.org
S:  Maintained
Maybe one of the two maintainers might want to change that? ;)

Use netdev as the mailing list contact instead of the mostly dead
linux-net list.
~
 PHRAM MTD DRIVER
@@ -1795,7 +1795,7 @@
 POSIX CLOCKS and TIMERS
 P:	George Anzinger
 M:	george@mvista.com
-L:	linux-net@vger.kernel.org
+L:	netdev@oss.sgi.com
 S:	Supported
 
I don't really know about the rest of them, but I think this should be:
L: linux-kernel@vger.kernel.org
Least wise that is where I look...
~
--
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Maintainers list update: linux-net -> netdev

2005-04-13 Thread George Anzinger

Horms wrote:
On Tue, Apr 12, 2005 at 12:14:56PM -0700, George Anzinger wrote:
Horms wrote:
Use netdev as the mailing list contact instead of the mostly dead
linux-net list.
~
PHRAM MTD DRIVER
@@ -1795,7 +1795,7 @@
POSIX CLOCKS and TIMERS
P:  George Anzinger
M:  george@mvista.com
-L: linux-net@vger.kernel.org
+L: netdev@oss.sgi.com
S:  Supported
I don't really know about the rest of them, but I think this should be:
L: linux-kernel@vger.kernel.org
Least wise that is where I look...

Yes, I was wondering about that one. Here is a patch that
adds to my previous patch. Trivial to say the least. 
I can re-diff the whole thing if that is more convenient.
Looks good to me.

--
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-11 Thread George Anzinger


Martin J. Bligh wrote:

Lots of people have switched from 2.4 to 2.6 (100 Hz to 1000 Hz) with no impact 
in
stability, AFAIK. (I only remember some weird warning about HZ with debian 
woody's
ps).



Yes, that's called "progress" so no one complained.  Going back is
called a "regression".  People don't like those as much.



That's a very subjective viewpoint. Realize that this is a balancing
act between latency and overhead ... and you're firmly only looking
at one side of the argument, instead of taking a compromise in the
middle ...

If I start arguing for 100HZ on the grounds that it's much more efficient,
will that make 250/300 look much better to you? ;-)


I would like to interject an addition data point, and I will NOT be subjective. 
 The nature of the PIT is that it can _hit_ some frequencies better than 
others.  We have had complaints about repeating timers not keeping good time. 
These are not jitter issues, but drift issues.  The standard says we may not 
return early from a timer so any timer will either be on time or late.  The 
amount of lateness depends very much on the HZ value.  Here is what the values 
are for the standard CLOCK_TICK_RATE:


HZ  TICK RATE   jiffie(ns)  second(ns)   error (ppbillion)
 100 1193180100010 0
 200 1193180 598119600 19600
 250 1193180 4000250162500 62500
 500 1193180 19997031001851203   1851203
1000 1193180  9998481000847848847848

The jiffie values here are exactly what the kernel uses and are based on the 
best one can do with the PIT hardware.


I am not suggesting any given default HZ, but rather an argumentation of the 
help text that goes with it.  For those who want timers to repeat at one second 
(or multiples there of) this is useful info.


For you enjoyment I have attached the program used to print this.  It allows you 
to try additional values...



--
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/


#define NSEC_PER_SEC  10
//#define CLOCK_TICK_RATE /*1 */ 1193180
#define LATCH(CLOCK_TICK_RATE,HZ)  ((CLOCK_TICK_RATE + HZ/2) / HZ)
#define SH_DIV(NOM,DEN,LSH) (	((NOM / DEN) << LSH)			\
			 + (((NOM % DEN) << LSH) + DEN / 2) / DEN)
#define ACTHZ (SH_DIV (CLOCK_TICK_RATE, LATCH(CLOCK_TICK_RATE,HZ), 8))
#define TICK_NSEC (SH_DIV (100UL * 1000, ACTHZ, 8))


struct {
	int hz;
	int clocktickrate;
} vals[] = {{100, 1193180}, {200, 1193180}, {250, 1193180}, {500, 1193180}, {1000, 1193180},{0,0}};

void do_it(int hz,int tickrate)
{
	int HZ = hz;
	int CLOCK_TICK_RATE = tickrate;
	int tick_nsec = TICK_NSEC;
	int ticks_per_sec = NSEC_PER_SEC/tick_nsec;
	int sec_size = ticks_per_sec * tick_nsec;
	int one_sec_p;
	int err;

	if (sec_size < NSEC_PER_SEC)
		sec_size += tick_nsec;
	one_sec_p = sec_size;
	err = one_sec_p - NSEC_PER_SEC;
	printf( "%4d\t%8d\t%8d\t%10d\t%8d\n",hz, tickrate, tick_nsec, 
		one_sec_p, err);
}
	
void bail(void)
{
	printf("run as: as [hz [clock_tick_rate]]\n");
	exit(1);
}

main(int argc, char** argv)
{
	int i = 0;
	int phz = 0;
	int pcr = vals[0].clocktickrate;

	if (argc > 1) { 
		phz = atoi(argv[1]);
		if (!phz)
			bail();
	}
	if (argc > 2) {
		pcr = atoi(argv[2]);
		if (!pcr)
			bail();
	}

	printf("HZ  \tTICK RATE\tjiffie(ns)\tsecond(ns)\t error (ppbillion)\n");
	while(vals[i].hz) {
		do_it(vals[i].hz, vals[i].clocktickrate);
		i++;
	}
	if (phz)
		do_it(phz, pcr);
}

Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-12 Thread George Anzinger


Con Kolivas wrote:

On Tue, 12 Jul 2005 22:39, Con Kolivas wrote:


On Tue, 12 Jul 2005 22:10, Vojtech Pavlik wrote:


The PIT crystal runs at 14.3181818 MHz (CGA dotclock, found on ISA, ...)
and is divided by 12 to get PIT tick rate

14.3181818 MHz / 12 = 1193182 Hz


Yes, but the current code uses 1193180.  Wonder why that is...



The reality is that the crystal is usually off by 50-100 ppm from the
standard value, depending on temperature.

   HZ   ticks/jiffie  1 second  error (ppm)
---
  100  11932  1.15238  15.2
  200   5966  1.15238  15.2
  250   4773  1.57143  57.1
  300   3977  0.31429 -68.6
  333   3583  0.64114 -35.9
  500   2386  0.999847619-152.4
 1000   1193  0.999847619-152.4


If we are following the standard and trying to set up a timer, the 1 second time 
MUST be >= 1 second.  Thus the values for 300 and above in this table don't fly. 
 If we are trying to keep system time, well we do just fine at that by using 
the actual value of the jiffie (NOT 1/HZ) when we update time (one of the 
reasons for going to nanoseconds in xtime).  The observable thing the user sees 
is best seen by setting up an itimer to repeat every second.  Then you will see 
the drift AND it will be against the system clock which itself is quite accurate 
(the 50-100ppm you mention), even without ntp.  And the error really is in the 
range of 848ppm for HZ=1000 BECAUSE we need to follow the standard.  You can 
easily see this with the current 2.6 kernel.  We even have a bug report on it:


http://bugzilla.kernel.org/show_bug.cgi?id=3289
~
--
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 225 matches

Mail list logo