from:"Roman Zippel"

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-01-30 Thread Roman Zippel

Hi,

On Tue, 29 Jan 2008, john stultz wrote:

> +/* Because using NSEC_PER_SEC would be too easy */
> +#define NTP_INTERVAL_LENGTH 
> s64)TICK_USEC*NSEC_PER_USEC*USER_HZ)+CLOCK_TICK_ADJUST)/NTP_INTERVAL_FREQ)

Why are you using USER_HZ? Did you test this with HZ!=100?
Anyway, please don't make more complicated than it already is.
What I said previously about the update interval is still valid, so the 
correct solution is to use the simpler NTP_INTERVAL_LENGTH calculation 
from my last mail and to omit the correction for NO_HZ.

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-01-30 Thread Roman Zippel

Hi,

On Wed, 30 Jan 2008, john stultz wrote:

> My concern is we state the accumulation interval is X ns long. Then
> current_tick_length() is to return (X + ntp_adjustment), so each
> accumulation interval we can keep track of the error and adjust our
> interval length.
> 
> So if ntp_update_frequency() sets tick_length_base to be:
> 
>   u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ)
>   << TICK_LENGTH_SHIFT;
>   second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT;
>   second_length += (s64)time_freq
>   << (TICK_LENGTH_SHIFT - SHIFT_NSEC);
> 
>   tick_length_base = second_length;
>   do_div(tick_length_base, NTP_INTERVAL_FREQ);
> 
> 
> The above is basically (X + part of ntp_adjustment)

CLOCK_TICK_ADJUST is based on LATCH and HZ, if the update frequency isn't 
based on HZ, there is no point in using it!

Let's look at what actually needs to be done:

1. initializing clock interval:

clock_cycle_interval = timer_cycle_interval * clock_frequency / 
timer_frequency

It's simply about converting timer cycles into clock cycles, so they're 
about the same time interval.
We already make it a bit more complicated than necessary as we go via 
nsec:

ntp_interval = timer_cycle_interval * 10^9nsec / timer_frequency

and in clocksource_calculate_interval() basically:

clock_cycle_interval = ntp_interval * clock_frequency / 10^9nsec

Without a fixed timer tick it's actually even easier, then we use the same 
frequency for clock and timer and the cycle interval is simply:

clock_cycle_interval = timer_cycle_interval = clock_frequency / 
NTP_INTERVAL_FREQ

There is no need to use the adjustment here, you'll only cause a mismatch 
between the clock and timer cycle interval, which had to be corrected by 
NTP.

2. initializing clock adjustment:

clock_adjust = timer_cycle_interval * NTP_INTERVAL_FREQ / 
timer_frequency - 1sec

This adjustment is used make up for the difference that the timer 
frequency isn't evenly divisible by HZ, so that the clock is advanced by 
1sec after timer_frequency cycles.

Like above the clock frequency is used for the timer frequency for this 
calculation for CONFIG_NO_HZ, so it would be incorrect to use 
CLOCK_TICK_RATE/LATCH/HZ here and since NTP_INTERVAL_FREQ is quite small 
the resulting adjustment would be rather small, it's easier not to bother 
in this case.

What you're basically trying is to add an error to the clock 
initialization, so that we can later compensate for it. The correct 
solution is really to not add the error in first place, so that there is 
no need to compensate for it.

bye. Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-12 Thread Roman Zippel

Hi,

On Mon, 11 Feb 2008, john stultz wrote:

> > I don't want to just send a patch, I want you to understand why your 
> > approach is wrong.
> 
> With all due respect, it also keeps the critique in one direction and
> makes your review less collaborative and more confrontational then I
> suspect (or maybe just hope) you intend.

I don't think that's necessarily a contradiction, if we keep it to 
confronting the problem. A simple patch wouldn't have provided any further 
understanding of the problem compared to what I already said. You would 
have seen what the patch does (which I described already differently), but 
not really why it does that.

In this sense I prefer to force the confrontation of the problem. I'm 
afraid a working patch would encourage to simply ignore the problem, as 
your problem at hand would be solved without completely understanding it.
The point is that I'd really like you to understand the problem, so I'm 
not the only one who understands this code :) and in the end it might 
allow better collaboration to further improve this code.

To make it very clear this is just about understanding the problem, I 
don't want to force a specific solution (which a patch would practically 
do). If we both understand the problem, we can also discuss the solution 
and maybe we find something better, but maybe I'm also totally wrong, 
which would be a little embarrassing :), but that would be fine too. There 
may be better ways to go about this problem, but IMO it would still be 
better than just ignoring the problem and force it with a patch.

> This fine grained error accounting is where the bug I'm trying to
> address is cropping up from. In order to have the comparison we need to
> have two values:
>  A: The clocksource's notion of how long the fixed interval is.
>  B: NTP's notion of how long the fixed interval is.
> 
> When no NTP adjustment is being made, these two values should be equal,
> but currently they are not. This is what causes the 280ppm error seen on
> my test system.
> 
> Part A is calculated in the following fashion:
>   #define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ)
> 
>   Which is then functionally shifted up by TICK_LENGTH_SHIFT, but for the
> course of this discussion, lets ignore that.
> 
> Part B is calculated in ntp_update_frequency() as:
> 
>   u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ)
>   << TICK_LENGTH_SHIFT;
>   second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT;
>   second_length += (s64)time_freq << (TICK_LENGTH_SHIFT - SHIFT_NSEC);
> 
>   tick_length_base = second_length;
>   do_div(tick_length_base, NTP_INTERVAL_FREQ);
> 
> 
> If we're assuming there is no NTP adjustment, and avoiding the
> TICK_LENGTH_SHIFT, this can be shorted to:
>   B = ((TICK_USEC * NSEC_PER_USEC * USER_HZ)
>   + CLOCK_TICK_ADJUST)/NTP_ITNERVAL_FREQ
> 
> 
> The A vs B comparison can be shortened further to:
>   NSEC_PER_SEC != (TICK_USEC * NSEC_PER_USEC * USER_HZ)
>   + CLOCK_TICK_ADJUST
> 
> So now on to what seems to be the point of contention:
>   If A != B, which side do we fix?
> 
> 
> My patches fix the A side so that it matches B, which on its face isn't
> terribly complicated, but you seem to be suggesting we fix the B side
> instead (Now I'm assuming here, because there's no patch. So I can only
> speak to your emails, which were not clear to me).

If we go from your base assumption above "there is no NTP adjustment", I 
would actually agree with you and it wouldn't matter much on which side 
to correct the equation.

The question is now what happens, if there are NTP adjustments, i.e. when 
the time_freq value isn't zero. Based on this initialization we tell the 
NTP daemon the base frequency, although not directly but it knows the 
length freq1 == 1 sec. If the clock now needs adjustment, the NTP daemon 
tells the kernel via time_freq how to change the speed so that freq1 == 
1 sec + time_freq.

The problem is now that by using CLOCK_TICK_ADJUST we are cheating and we 
don't tell the NTP daemon the real frequency. We define 1 sec as freq2 + 
tick_adjust and with the NTP adjustment we have freq2 + tick_adj == 1 sec 
+ time_freq. Above initialization now calcalutes the base time length for 
an update interval of freq2 / NTP_INTERVAL_FREQ, this means the requested 
time_freq adjustment isn't applied to (freq2 + tick_adj) cycles but to 
freq2 cycles, so this finally means any adjustment made by the NTP daemon 
is slightly off.

To demonstrate this let's look at some real values and let's use the PIT 
for it (as this is where this originated and on which CLOCK_TICK_ADJUST is 
based on). With freq1=1193182 and HZ=1000 we program the timer with 1193 
cycles and the actual update frequency is freq2=1193000. To adjust for 
this difference we change the length of a timer tick:

(NSEC_PER_SEC + CLOCK_TICK_ADJUST) / NTP_INTERVAL_FREQ

Re: Question on timekeeping subsystem

2008-02-13 Thread Roman Zippel

Hi,

On Wednesday 13. February 2008, Francis Moreau wrote:

> First I tried to find some documentation on the current implementation
> but haven't found any thing really usefull. Specially there's nothing about
> it in Documentation/ directory. Please correct me if I'm already wrong.
>
> Actually I read the implementation of update_wall_time() and I really fail
> to understand how it works. This is probably because I don't know
> what "xtime_nsec" and "error" fields in clocksource struct are for.
> These fields are not documented anywhere in the source code so it
> should be obvious but unfortunately not for me.

These mails should help to understand, what this code does:

http://lkml.org/lkml/2006/3/4/61
http://lkml.org/lkml/2006/4/3/205

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: distributed module configuration [Was: Announce: Linux-next (Or Andrew's dream :-))]

2008-02-13 Thread Roman Zippel

Hi,

On Wednesday 13. February 2008, Sam Ravnborg wrote:

> config foo
>   tristate "do you want foo?"
>   depends on USB && BAR
>   module
> obj-$(CONFIG_FOO) += foo.o
> foo-y := file1.o file2.o
>   help
> foo will allow you to explode your PC

I'm more thinking about something like this:

module foo [FOO]
tristate "do you want foo?"
depends on USB && BAR
source file1.c
source file2.c if BAZ

Avoiding direct Makefile fragments would give us far more flexibility in the 
final Makefile output.

> And we could introduce support for
>
> source "drivers/net/Kconfig.*"
>
> But then we would have to make the kconfig step mandatory
> for each build as we would otherwise not know if there
> were added any Kconfig files.

That's a real problem and it would be a step back of what we have right now, 
so I'm not exactly comfortable with it.

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-15 Thread Roman Zippel

Hi,

On Wed, 13 Feb 2008, john stultz wrote:

> Oh! So your issue is that since time_freq is stored in ppm, or in effect
> a usecs per sec offset, when we add it to something other then 1 second
> we mis-apply what NTP tells us to. Is that right?

Pretty much everything is centered around that 1sec, so the closer the 
update frequency is to it the better.

> Right, so if NTP has us apply a 10ppm adjustment, instead of doing:
>   NSEC_PER_SEC + 10,000
> 
> We're doing:
>   NSEC_PER_SEC + CLOCK_TICK_ADJUST + 10,000
> 
> Which, if I'm doing my math right, results in a 10.002ppm adjustment
> (using the 999847467ns number above), instead of just a 10ppm
> adjustment.
> 
> Now, true, this is an error, but it is a pretty small one. Even at the
> maximum 500ppm value, it only results in an error of 76 parts per
> billion. As you'll see below, that tends to be less error then what we
> get from the clock granularity. Is there something else I'm missing here
> or is this really the core issue you're concerned with?

The error accumulates and there is no good reason to do this for the 
common case.

> > In consequence this means, if we want to improve timekeeping, we first set 
> > the (update_cycles * NTP_INTERVAL_FREQ) interval as close as possible to 
> > the real frequency. It doesn't has to be perfect as we usually don't know 
> > the real frequency with 100% certainty anyway. 
> 
> This might need some more explanation, as I'm not certain I know what
> update_cycles refers to. Do you mean cycle_interval? I guess I'm not
> completely sure how you're suggesting we change things here.

clock->cycle_interval

> > Second, we drop the tick 
> > adjustment if possible and leave the adjustments to the NTP daemon and as 
> > long as the drift is within the 500ppm limit it has no problem to manage 
> > this.
> 
> Dropping the tick adjustment? By that do you mean the tick_usec value
> set by adjtimex()? I don't quite see why we want that. Could you expand
> here?

CLOCK_TICK_ADJUST

> HZ=1000 CLOCK_TICK_ADJUST=-152533
> jiffies   467 ppb error
> jiffies NOHZ  467 ppb error
> pit   0 ppb error
> pit NOHZ  0 ppb error
> acpi_pm   -280 ppb error
> acpi_pm NOHZ  279 ppb error
> 
> HZ=1000 CLOCK_TICK_ADJUST=0
> jiffies   153000 ppb error
> jiffies NOHZ  153000 ppb error
> pit   152533 ppb error
> pit NOHZ  0 ppb error
> acpi_pm   -127112 ppb error
> acpi_pm NOHZ  279 ppb error
> 
> So you are right, w/ pit & NO_HZ, the granularity error is always very
> small both with or without CLOCK_TICK_ADJUST. 

If you change the frequency of acpi_pm to 3579000 you'll get this:

HZ=1000 CLOCK_TICK_ADJUST=0
jiffies 153000 ppb error
jiffies NOHZ153000 ppb error
pit 152533 ppb error
pit NOHZ0 ppb error
acpi_pm 0 ppb error
acpi_pm NOHZ0 ppb error

HZ=1000 CLOCK_TICK_ADJUST=-152533
jiffies 0 ppb error
jiffies NOHZ466 ppb error
pit -467 ppb error
pit NOHZ-1 ppb error
acpi_pm 126407 ppb error
acpi_pm NOHZ22 ppb error

CLOCK_TICK_ADJUST has only any meaning for PIT (and indirectly for 
jiffies). For every other clock you just add some random value, where 
it doesn't do _any_ good.
The worst case error there will always be (ntp_hz/freq/2*10^9nsec), all 
you do with CLOCK_TICK_ADJUST is to do shift it around, but it doesn't 
actually fix the error - it's still there.

> However, without CLOCK_TICK_ADJUST, the jiffies error increases for all
> values of HZ except 100 (which at first seems odd, but seems to be due
> to loss from rounding in the ACTHZ calculation).

jiffies depends on the timer resolution, so it will practically produce 
the same results as PIT (assuming it's used to generate the timer tick).

> One interesting surprise in the data: With CLOCK_TICK_ADJUST=0, the
> acpi_pm's error frequency shot up in the !NO_HZ cases. This ends up
> being due to the acpi_pm being a very close to a multiple (3x) of the
> pit frequency, so CLOCK_TICK_ADJUST helps it as well.

What exactly does it help with?
All you are doing is number cosmetics, it has _no_ practically value and 
only decreases the quality of timekeeping.

> Further it seems to point that if we are going to be chasing down small
> sub-100ppb errors (which I think would be great to do, but lets not make
> users to endure 200+ppm errors while we debate the fine-tuning :) we
> might want to consider a method where we let ntp_update_freq take into
> account the current clocksource's interval length, so it becomes the
> base value against which we apply adjustments (scaled appropriately).

The error at least is real, the use value of CLOCK_TICK_ADJUST for the 
common case is not existent.

> There are 3 sources of error that we've discussed here:
> 1) The large (280ppm) error seen with no-NTP adjustment, caused by the
> inconsistent (A!=B) interval comparisons which started this discussion,
> which my patch does address.

Part of the error is caused by CLOCK

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-18 Thread Roman Zippel

Hi,

On Mon, 18 Feb 2008, john stultz wrote:

> If we are building a train station, and each train car is 60ft, it
> doesn't make much sense to build 1000ft stations, right? Instead you'll
> do better if you build a 1020ft station.

That would assume trains are always 60ft long, which is the basic error in 
your assumption.

Since we're using analogies: What you're doing is to put you winter 
clothes on your weight scale and reset the scale to zero to offset for the 
weigth of the clothes. If you stand now with your bathing clothes on the 
scale, does that mean you lost weight?
That's all you do - you change the scale and slightly screw the scale for 
everyone else trying to use it.

To keep in mind what time adjusting is supposed to do:

freq = 1sec + time_freq

What we do instead is:

freq + tick_adj = 1sec + time_freq

Where exactly is now the problem to integrate tick_adj into time_freq? The 
result is _exactly_ the same. The only visible difference is a slightly 
higher time_freq value and as long as it is within the 500 ppm limit there 
is simply no problem.

> And yes, if we remove CLOCK_TICK_ADJUST, that would also resolve the 
> (A!=B) issue, but it doesn't address the error from #2 below.
> [..]
> 2) We need a solution that handles granularity error well, as this is a
> moderate source of error for course clocksources such as jiffies.
> CLOCK_TICK_ADJUST does cover this fairly well in most cases. I suspect
> we could do even better, but that will take some deeper changes.

How exactly does CLOCK_TICK_ADJUST address this problem? The error due to 
insufficient resolution is still there, all it does is shifting the scale, 
so it's not immediately visible.

> My understanding of your approach (removing CLOCK_TICK_ADJUST),
> addresses issues #1 and #3, but hurts issue #2.

What exactly is hurt?

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-20 Thread Roman Zippel

ly based on the PIT anymore. The whole 
reason for the original patch is pretty much gone by now.

If you really need some kind of adjustment for your extremely broken 
hardware, below is the absolute maximum you need, which doesn't inflict 
more insanity on all the sane hardware.

bye, Roman


Revert bbe4d18ac2e058c56adb0cd71f49d9ed3216a405 and 
e13a2e61dd5152f5499d2003470acf9c838eab84 and remove CLOCK_TICK_ADJUST 
completely. Add a optional kernel parameter ntp_tick_adj instead to allow 
adjusting of a large base drift and thus keeping ntpd happy.
The CLOCK_TICK_ADJUST mechanism was introduced at a time PIT was the 
primary clock, but we have a varity of clock sources now, so a global PIT 
specific adjustment makes little sense anymore.

Signed-off-by: Roman Zippel <[EMAIL PROTECTED]>


---
 include/linux/timex.h |9 +
 kernel/time/ntp.c |   11 ++-
 kernel/time/timekeeping.c |6 ++
 3 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/timex.h
===
--- linux-2.6.orig/include/linux/timex.h
+++ linux-2.6/include/linux/timex.h
@@ -232,14 +232,7 @@ static inline int ntp_synced(void)
 #else
 #define NTP_INTERVAL_FREQ  (HZ)
 #endif
-
-#define CLOCK_TICK_OVERFLOW(LATCH * HZ - CLOCK_TICK_RATE)
-#define CLOCK_TICK_ADJUST  (((s64)CLOCK_TICK_OVERFLOW * NSEC_PER_SEC) / \
-   (s64)CLOCK_TICK_RATE)
-
-/* Because using NSEC_PER_SEC would be too easy */
-#define NTP_INTERVAL_LENGTH s64)TICK_USEC * NSEC_PER_USEC * USER_HZ) + \
- CLOCK_TICK_ADJUST) / NTP_INTERVAL_FREQ)
+#define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ)
 
 /* Returns how long ticks are at present, in ns / 2^(SHIFT_SCALE-10). */
 extern u64 current_tick_length(void);
Index: linux-2.6/kernel/time/ntp.c
===
--- linux-2.6.orig/kernel/time/ntp.c
+++ linux-2.6/kernel/time/ntp.c
@@ -42,12 +42,13 @@ long time_esterror = NTP_PHASE_LIMIT;   /*
 long time_freq;/* frequency offset (scaled 
ppm)*/
 static long time_reftime;  /* time at last adjustment (s)  */
 long time_adjust;
+long ntp_tick_adj;
 
 static void ntp_update_frequency(void)
 {
u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ)
<< TICK_LENGTH_SHIFT;
-   second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT;
+   second_length += (s64)ntp_tick_adj << TICK_LENGTH_SHIFT;
second_length += (s64)time_freq << (TICK_LENGTH_SHIFT - SHIFT_NSEC);
 
tick_length_base = second_length;
@@ -400,3 +401,11 @@ leave: if ((time_status & (STA_UNSYNC|ST
notify_cmos_timer();
return(result);
 }
+
+static int __init ntp_tick_adj_setup(char *str)
+{
+   ntp_tick_adj = simple_strtol(str, NULL, 0);
+   return 1;
+}
+
+__setup("ntp_tick_adj=", ntp_tick_adj_setup);
Index: linux-2.6/kernel/time/timekeeping.c
===
--- linux-2.6.orig/kernel/time/timekeeping.c
+++ linux-2.6/kernel/time/timekeeping.c
@@ -187,8 +187,7 @@ static void change_clocksource(void)
 
clock->error = 0;
clock->xtime_nsec = 0;
-   clocksource_calculate_interval(clock,
-   (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT));
+   clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH);
 
tick_clock_notify();
 
@@ -245,8 +244,7 @@ void __init timekeeping_init(void)
ntp_clear();
 
clock = clocksource_get_next();
-   clocksource_calculate_interval(clock,
-   (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT));
+   clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH);
clock->cycle_last = clocksource_read(clock);
 
xtime.tv_sec = sec;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-08 Thread Roman Zippel

Hi,

On Fri, 1 Feb 2008, John Stultz wrote:

> > CLOCK_TICK_ADJUST is based on LATCH and HZ, if the update frequency isn't 
> > based on HZ, there is no point in using it!
> 
> Hey Roman,
> 
> Again, I'm sorry I don't seem to be following your objections. If you
> want to suggest a different patch to fix the issue, it might help.

I already gave you the necessary details for how to set 
NTP_INTERVAL_LENGTH and in the previous mail I explained the basis for it. 
I really don't understand what's your problem with it. Why do you try to 
make it more complex than necessary?

> The big issue for me, is that we have to initialize the clocksource
> cycle interval so it matches the base tick_length that NTP uses.
> 
> To be clear, the issue I'm trying to address is only this:
> Assuming there is no NTP adjustment yet to be made, if we initialize the
> clocksource interval to X, then compare it with Y when we accumulate, we
> introduce error if X and Y are not the same.
> 
> It really doesn't matter how long the length is, if we're including
> CLOCK_TICK_ADJUST, or if it really matches the actual HZ tick length or
> not. The issue is that we have to be consistent. If we're not, then we
> introduce error that ntpd has to additionally correct for.

You don't create consistency by adding corrections all over the place 
until it adds up to the right sum.
The current correction is already somewhat of a hack and I'd rather get 
rid of it than to let it spread all over the place (it's really only 
needed so that people with weird HZ settings don't hit the 500ppm limit 
and we're basically cheating to the ntpd by not telling it the real 
frequency). Please keep the knowledge about this crutch at a single place 
and don't spread it.
Anyway, for NO_HZ this correction is completely irrelevant, so again 
there's no point in adding random values all over the place until you get 
the correct result.

The only other alternative would be to calculate this correction 
dynamically. For this you leave NTP_INTERVAL_LENGTH as is and when 
changing clocks you check whether "abs(((cs->xtime_interval * 
NTP_INTERVAL_FREQ) >> cs->shift) - NSEC_PER_SEC)" exceeds a certain limit 
(e.g. 200usec) and in this case you print a warning message, that the 
clock has large base drift value and is a bad ntp source and apply a 
correction value. This way the correction only hits the very few system 
which might need it and it would be the prefered solution, but it also 
requires a few more changes.

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 01/26] mount options: add documentation

2008-02-08 Thread Roman Zippel

Hi,

On Wed, 30 Jan 2008, Miklos Szeredi wrote:

> > How does this deal with certain special cases:
> > - chroot: how will mount/df only show the for chroot relevant mounts?
> 
> That is a very good question.  Andreas Gruenbacher had some patches
> for fixing behavior of /proc/mounts under a chroot, but people are
> paranoid about userspace ABI changes (unwarranted in this case, IMO).
> 
>   http://lkml.org/lkml/2007/4/20/147
> 
> Anyway, if we are going to have a new 'mountinfo' file, this could be
> easily fixed as well.
> 
> > - loop: how is the connection between file and loop device maintained?
> 
> We also discussed this with Karel, maybe it didn't make it onto lkml.
> 
> The proposed solution was to store the "loop" flag separately in a
> file under /var.  It could just be an empty file for each such loop
> device:
> 
>   /var/lib/mount/loops/loop0
> 
> This file is created by mount(8) if the '-oloop' option is given.  And
> umount(8) automatically tears down the loop device if it finds this
> file.

My question was maybe a little short. I don't doubt that we can shove a 
lot into the kernel, the question is rather how much of this will be 
unnecessary information, which the kernel doesn't really need itself.

> > Could also please explain why you want to go via user mounts. Other OS use 
> > a 
> > daemon for that, which e.g. can maintain access controls. How do you want 
> > to 
> > manage this?
> 
> The unprivileged mounts patches do contain a simple form of access
> control.  I don't think anything more is needed, but of course, having
> unprivileged mounts in the kernel does not prevent the use of a more
> sophisticated access control daemon in userspace, if that becomes
> necessary.

A "I don't think anything more is needed" lets go off all sorts of warning 
lights. Most things start out simple, so IMO it's very worth it to check 
where it might go to to know the limits beforehand. The main question here 
is why should a kernel based solution be preferable over a daemon based 
solution?

If we look for example look at OS X, it has no need for user mounts but 
has a daemon instead, which also provides an interesting notification 
system for new devices, mounts or unmount requests. All this could also be 
done in the kernel, but where would be the advantage in doing so? The 
kernel implementation would be either rather limited or only bloat the 
kernel. What is the feature that would make user mounts more than just a 
cool kernel hack?

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-08 Thread Roman Zippel

Hi,

On Fri, 8 Feb 2008, john stultz wrote:

>  
>   clock = clocksource_get_next();
> - clocksource_calculate_interval(clock,
> - (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT));
> + clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH);
>   clock->cycle_last = clocksource_read(clock);
>  

Only now I noticed that the first patch had been merged without any 
further question. :-(
What point is there in reviewing patches, if everything is merged anyway. :-(

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-09 Thread Roman Zippel

Hi,

On Fri, 8 Feb 2008, Andrew Morton wrote:

> > Only now I noticed that the first patch had been merged without any 
> > further question. :-(
> > What point is there in reviewing patches, if everything is merged anyway. 
> > :-(
> > 
> 
> oops, mistake, sorry.  There's plenty of time to fix it though.

It has been signed off by both Ingo and Thomas and neither noticed 
anything? This makes me very afraid of the merging process...

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-10 Thread Roman Zippel

Hi,

On Fri, 8 Feb 2008, john stultz wrote:

> -ENOPATCH
> 
> We're taking weeks to critique fairly small bug fix. I'm sure we both
> have better things to do then continue to misunderstand each other. I'll
> do the testing and will happily ack it if it resolves the issue.

I don't want to just send a patch, I want you to understand why your 
approach is wrong.

> Now, If you're disputing that I'm correcting the wrong side of the
> equation, then we're getting somewhere. But its still not clear to me
> how you're suggesting the other side (which is calculated in
> ntp_update_frequency) be changed.
> [..]
> You keep on bringing up NO_HZ, and again, the bug I'm trying to fix
> occurs with or without NO_HZ. The fix I proposed resolves the issue with
> or without NO_HZ.

The correction is incorrect for NO_HZ.
Let's try it the other way around, as my explanation seem to lack 
something.
Please try to explain what this correction actually means and why it 
should be correct for NO_HZ as well.

> > The only other alternative would be to calculate this correction 
> > dynamically. For this you leave NTP_INTERVAL_LENGTH as is and when 
> > changing clocks you check whether "abs(((cs->xtime_interval * 
> > NTP_INTERVAL_FREQ) >> cs->shift) - NSEC_PER_SEC)" exceeds a certain limit 
> > (e.g. 200usec) and in this case you print a warning message, that the 
> > clock has large base drift value and is a bad ntp source and apply a 
> > correction value. This way the correction only hits the very few system 
> > which might need it and it would be the prefered solution, but it also 
> > requires a few more changes.
> 
> Uh, that seems to be just checking if the xtime_interval is off base, or
> if the ntp correction has gone too far. I just don't see how this
> connects to the issue at hand.

Above is the key to understanding the problem, if this difference is small 
enough there is no need to correct anything.

This is the original patch which introduced the correction:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=69c1cd9218e4cf3016b0f435d6ef3dffb5a53860

Keep in mind that at that time PIT was used as the base clock (even if the 
tsc was used, it was relative to PIT). So how much of those assumptions 
are still valid today (especially with NO_HZ enabled)?

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] correct inconsistent ntp interval/tick_length usage

2008-02-25 Thread Roman Zippel

Hi,

On Thu, 21 Feb 2008, john stultz wrote:

> > Again, what kind of crappy hardware do you expect? Aren't clocks supposed 
> > to get better and not worse?
> 
> Well, while I've seen much worse, I consider crappy hardware to be 100
> +ppm error. So if the hardware is perfect and the system results in
> 153ppm error, I'd consider that pretty crappy, especially if its not the
> hardware's fault.

Nevertheless this error is real, why are you trying to hide it?
This is isn't an error we can't handle, it's still perfectly within the 
limit and except that NTP reports a somewhat larger drift than you'd like 
to see, everything works fine.

> > Where do you get this idea that the 500ppm are exclusively for hardware 
> > errors? If you have such bad hardware, there is another simple solution: 
> > change HZ to 100 and the error is reduced to 15ppm.
> 
> True its not exclusively for hardware errors, and if we were talking
> about only 15ppm I wouldn't really worry about it. But when we're saying
> the system is adding 30% of the maximum error, that's just not good.

Another 30% is required for normal to crappy hardware clocks and then 
there is still enough room left.

> > I would see the point if this problem had actually any practically 
> > relevance, but this error is not a problem for pretty much all existing 
> > standard hardware. Why are you insisting on redesigning timekeeping for 
> > broken hardware?
> 
> Remember my earlier data? Where I was talking about the acpi_pm being a
> multiple of the PIT frequency? By removing CLOCK_TICK_ADJUST we got a
> 127ppm error when HZ=1000. NO_HZ drops that down to where we don't care,
> but this _does_ effect current hardware, so I'd call it relevant.

How exactly does it effect current hardware in a way that it breaks them? 
Despite this error everything still works fine, the hardware doesn't care.

> > There's nothing 'injected', that resolution error is very real and the 
> > 500ppm limit is more than enough to deal with this. _Nobody_ is hurt by 
> > this.
> 
> Sure, 500ppm is enough for most people with good hardware. But remember
> the alpha example you brought up earlier? The HZ=1200 case, with the
> CLOCK_TICK_RATE=32768? If we don't take CLOCK_TICK_ADJUST into account,
> we end up with a **11230ppm** error from the granularity issue. NTP just
> won't work on those systems.
> 
> Now granted, the three types of alpha systems that actually use that HZ
> value is probably as close to "nobody" as you're going to get, but I
> don't think we can just throw the granularity issue aside.

That's actually a good example, why it's irrelevant. First it's using a 
cycle based clock, thus the rounding error is irrelevant. Second in the 
common case they already use 1024 as HZ to reduce this error, so something 
similiar could be done for the HZ=1200 case and I suspect that it was 
already done and only CLOCK_TICK_RATE is just wrong. This mail 
http://consortiumlibrary.org/axp-list/archive/2002-11/0101.html suggest 
that this is the right thing to do.

There is _no_ reason to artificially optimize this error value, there are 
still enough other ways to improve timekeeping. The granularity error is 
there no matter what you do and as long as it's within a reasonable limit 
there is nothing that needs fixing.

bye, Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make the kernel NTP code hand 64-bit unsigned values to do_div()

2008-02-25 Thread Roman Zippel

Hi,

On Thu, 21 Feb 2008, David Howells wrote:

> The kernel NTP code shouldn't hand 64-bit *signed* values to do_div().  Make 
> it
> instead hand 64-bit unsigned values.  This gets rid of a couple of warnings.

I would actually prefer to introduce an explicit API for signed 64 
divides to get rid of the temps completely, something like below.
Right now it uses do_div as fallback. When all archs are converted, do_div 
can be single compatibility define and perhaps we can get rid of it
completely.
Bonus feature: implement the x86 version without the asm casts allowing 
gcc to generate better code.

bye, Roman

---
 include/asm-generic/div64.h |   14 ++
 include/asm-i386/div64.h|   20 
 include/linux/calc64.h  |   28 
 kernel/time.c   |   26 +++---
 kernel/time/ntp.c   |   21 +
 lib/div64.c |   21 -
 6 files changed, 94 insertions(+), 36 deletions(-)

Index: linux-2.6/include/asm-generic/div64.h
===
--- linux-2.6.orig/include/asm-generic/div64.h
+++ linux-2.6/include/asm-generic/div64.h
@@ -35,6 +35,20 @@ static inline uint64_t div64_64(uint64_t
return dividend / divisor;
 }
 
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   *remainder = dividend % divisor;
+   return dividend / divisor;
+}
+#define div_u64_remdiv_u64_rem
+
+static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
+{
+   *remainder = dividend % divisor;
+   return dividend / divisor;
+}
+#define div_s64_remdiv_s64_rem
+
 #elif BITS_PER_LONG == 32
 
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
Index: linux-2.6/include/asm-i386/div64.h
===
--- linux-2.6.orig/include/asm-i386/div64.h
+++ linux-2.6/include/asm-i386/div64.h
@@ -48,5 +48,25 @@ div_ll_X_l_rem(long long divs, long div,
 
 }
 
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   union {
+   u64 v64;
+   u32 v32[2];
+   } d = { dividend };
+   u32 upper;
+
+   upper = d.v32[1];
+   if (upper) {
+   upper = d.v32[1] % divisor;
+   d.v32[1] = d.v32[1] / divisor;
+   }
+   asm ("divl %2" : "=a" (d.v32[0]), "=d" (*remainder) :
+   "rm" (divisor), "0" (d.v32[0]), "1" (upper));
+   return d.v64;
+}
+#define div_u64_remdiv_u64_rem
+
 extern uint64_t div64_64(uint64_t dividend, uint64_t divisor);
+
 #endif
Index: linux-2.6/include/linux/calc64.h
===
--- linux-2.6.orig/include/linux/calc64.h
+++ linux-2.6/include/linux/calc64.h
@@ -46,4 +46,32 @@ static inline long div_long_long_rem_sig
return res;
 }
 
+#ifndef div_u64_rem
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   *remainder = do_div(dividend, divisor);
+   return dividend;
+}
+#endif
+
+#ifndef div_u64
+static inline u64 div_u64(u64 dividend, u32 divisor)
+{
+   u32 remainder;
+   return div_u64_rem(dividend, divisor, &remainder);
+}
+#endif
+
+#ifndef div_s64_rem
+extern s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder);
+#endif
+
+#ifndef div_s64
+static inline s64 div_s64(s64 dividend, s32 divisor)
+{
+   s32 remainder;
+   return div_s64_rem(dividend, divisor, &remainder);
+}
+#endif
+
 #endif
Index: linux-2.6/kernel/time.c
===
--- linux-2.6.orig/kernel/time.c
+++ linux-2.6/kernel/time.c
@@ -661,9 +661,7 @@ clock_t jiffies_to_clock_t(long x)
 #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
return x / (HZ / USER_HZ);
 #else
-   u64 tmp = (u64)x * TICK_NSEC;
-   do_div(tmp, (NSEC_PER_SEC / USER_HZ));
-   return (long)tmp;
+   return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
 #endif
 }
 EXPORT_SYMBOL(jiffies_to_clock_t);
@@ -675,16 +673,12 @@ unsigned long clock_t_to_jiffies(unsigne
return ~0UL;
return x * (HZ / USER_HZ);
 #else
-   u64 jif;
-
/* Don't worry about loss of precision here .. */
if (x >= ~0UL / HZ * USER_HZ)
return ~0UL;
 
/* .. but do try to contain it here */
-   jif = x * (u64) HZ;
-   do_div(jif, USER_HZ);
-   return jif;
+   return div_u64((u64)x * HZ, USER_HZ);
 #endif
 }
 EXPORT_SYMBOL(clock_t_to_jiffies);
@@ -692,17 +686,15 @@ EXPORT_SYMBOL(clock_t_to_jiffies);
 u64 jiffies_64_to_clock_t(u64 x)
 {
 #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-   do_div(x, HZ / USER_HZ);
+   return div_u64(x, HZ / USER_HZ);
 #else
/*
 * There are better ways that don't overflow early,
 * but even this doesn't overflow in hundreds of years
 * in 64 bits, so..

Re: amiga affs support broken in 2.4.x kernels??

2001-04-15 Thread Roman Zippel


Hi,

Mark Hounschell wrote:

>  I'm not a list member so IF you respond to this mail please CC me.
> I've been looking at the archives and see some problems with the 2.3.x
> kernel versions and affs support.

I've put a new version at
http://www.xs4all.nl/~zippel/affs.010414.tar.gz

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: amiga affs support broken in 2.4.x kernels??

2001-04-16 Thread Roman Zippel


Hi,

Mark Hounschell wrote:

> Thanks, I can now mount affs filesystems. However when I try to write
> to it via "cp somefile /amiga/somefile" I get a segmentation fault. If
> I then do a "df -h" it hangs the system very much like the mount command
> did before I installed your tar-ball. Was write support expected from
> it.

Yes, it should work.
What sort of filesystem is it (ffs or ofs)? Did you check the dmesg
output for an oops? Which kernel version did you use?

> Are you the NEW maintainer of the affs stuff.

Yes and as soon this problem is solved, I'm sending the changes to Linus
and Alan.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: amiga affs support broken in 2.4.x kernels??

2001-04-17 Thread Roman Zippel


Hi,

Mark Hounschell wrote:

>  Sorry I didn't get back to you yesterday afternoon. I was out of town.
>  Attached is the output from dmesg and the relavent info from
> /var/log/messages.

Could you try the attached patch? I forgot to initialize a variable
correctly.
(I also put a new version at
http://www.xs4all.nl/~zippel/affs.010417.tar.gz)

> I beleive the filesystem is ffs
> but not exactly sure. How do I tell?

It's printed if you mount with '-overbose', but it shouldn't be needed
anymore. :)

bye, Roman

--- fs/affs/bitmap.c.orgSat Apr  7 04:23:41 2001
+++ fs/affs/bitmap.cTue Apr 17 19:49:18 2001
@@ -124,7 +124,7 @@
 err_bh_read:
affs_error(sb,"affs_free_block","Cannot read bitmap block %u", bm->bm_key);
AFFS_SB->s_bmap_bh = NULL;
-   AFFS_SB->s_last_bmap = 0;
+   AFFS_SB->s_last_bmap = ~0;
up(&AFFS_SB->s_bmlock);
return;
 
@@ -262,7 +262,7 @@
 err_bh_read:
affs_error(sb,"affs_read_block","Cannot read bitmap block %u", bm->bm_key);
AFFS_SB->s_bmap_bh = NULL;
-   AFFS_SB->s_last_bmap = 0;
+   AFFS_SB->s_last_bmap = ~0;
 err_full:
pr_debug("failed\n");
up(&AFFS_SB->s_bmlock);
@@ -288,6 +288,8 @@
return 0;
}
 
+   AFFS_SB->s_last_bmap = ~0;
+   AFFS_SB->s_bmap_bh = NULL;
AFFS_SB->s_bmap_bits = sb->s_blocksize * 8 - 32;
AFFS_SB->s_bmap_count = (AFFS_SB->s_partition_size - AFFS_SB->s_reserved +
 AFFS_SB->s_bmap_bits - 1) / AFFS_SB->s_bmap_bits;

Re: Races in affs_unlink(), affs_rmdir() and affs_rename()

2001-04-21 Thread Roman Zippel

Hi,

Alexander Viro wrote:

> unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
> affs_remove_link(), which blocks.
> 
> unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
> affs_remove_link(). Which locks /B, renames removed entry into "b",
> removes old "b" and inserts renamed "a" into /B.
> 
> The rest is irrelevant - we're already in it.

Thanks for finding that one, but it should be easy to fix. I can remove
the parent pointer in aff_remove_hash and check for that before I try to
rename that entry.

> Since you don't lock /B for affs_empty_dir(), you can hit the
> window between removing old /B/a and inserting renamed /A/a into /B.
> Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
> for /A/a doesn't even look at it.

I thought about that one and I know it should be locked. The reason I
don't do right now is, that affs supports hardlinks to dirs. The problem
are especially recursive links, e.g.:

mkdir A
ln A A/B
rm A/B

This is possible with affs, but will already deadlock in vfs.

mkdir A
mkdir A/B
ln A A/B/C
rm A/B/C/A &
rm A/B/C &
rm A/B

Every rm already takes the hash lock of the parent and then I can't
simply also take the hash lock of the dir itself. What I actually want
to do is to insert a reverse is_subdir() check before taking the lock.
On the other hand I was thinking whether I should allow links to dirs at
all and just show them as empty/readonly dirs. For 2.4 that's probably
safer, as it would require a lot of locking changes in vfs and the other
fs to support this properly, particularly moving most of the locking
from vfs into the fs.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.4.3-ac12

2001-04-22 Thread Roman Zippel


Hi,

Jes Sorensen wrote:

> In principle you just need 2.7.2.3 for m68k, but someone decided to
> raise the bar for all architectures by putting a check in a common
> header file.

IIRC 2.7.2.3 has problems with labeled initializers for structures,
which makes 2.7.2.3 unusable for all archs under 2.4.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: problems with reiserfs + nfs using 2.4.2-pre4

2001-02-19 Thread Roman Zippel


Hi,

On Tue, 20 Feb 2001, Neil Brown wrote:

> 2/ lookup("..").

A small question:
Why exactly is this needed?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [NFS] Re: problems with reiserfs + nfs using 2.4.2-pre4

2001-02-20 Thread Roman Zippel

Hi,

On 20 Feb 2001, Trond Myklebust wrote:

> IIRC several NFS implementations (not Linux though) rely on being able
> to walk back up the directory tree in order to discover the path at
> any given moment.

If I read the source correctly, namespace operation are done with dir file
handle + file name. I'm playing with the idea if we could relax the rule,
that all dentries must be connected to the root. Inode to dentry lookups
are really evil, e.g. the current code ignores that there might be a fs
that supports links to dirs (besides that vfs doesn't support that very
well either).
What IMO knfsd needs is only a file handle <-> inode operation and as long
as the inode is not connected to a dcache entry (i_dentry is empty) it
gets a dummy dentry, which is used for further lookups. As soon as a real
dentry lookups that inode, we can flush the dummy dentry (small change to
d_instantiate()).
This would make it possible to support fs, that can't lookup ".." or it
would avoid extra checks for fs, that don't have real ".." dir entries.
All what a fs needs to do is to generate a 16(?) byte cookie, which can be
used to find the inode back (with the default to i_ino + i_generation).
This is nothing for 2.4, but IMO something that could be tried with 2.5.

bye, Roman

PS: /me is searching his fire proof underwear. :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [NFS] Re: problems with reiserfs + nfs using 2.4.2-pre4

2001-02-20 Thread Roman Zippel

Hi,

On Tue, 20 Feb 2001, Trond Myklebust wrote:

> If I read the code correctly, we set the dentry d_flag
> DCACHE_NFSD_DISCONNECTED on such dummy dentries.  We only force a
> lookup of the full path if the inode represents a directory or the
> NFSEXP_NOSUBTREECHECK export flag is not set.

IMO you can't safely delay the release of the dummy entry without help of
vfs. Are these dummy entries always properly released?
It seems I forgot about the subtree check, so it seems a fs that can't
provide a get_parent, can only be exported completely?

> It doesn't seem like a major change to delay that full path lookup of
> the dentry until nfsd_lookup('..') is actually called (in the case
> where the 'subtree_check' flag isn't used).
> However, outright banning lookups of '..' by any one filesystem isn't
> an option: path lookups are used for a lot more than just
> `getcwd'. Imagine for instance trying to follow a relative soft link
> across such a filesystem.

AFAIK this is already done in the generic code (in path_walk(), which is 
also called by vfs_follow_link()).

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CFT][PATCH] Re: fat problem in 2.4.2

2001-03-01 Thread Roman Zippel


Hi,

On Thu, 1 Mar 2001, Alexander Viro wrote:

> +static int generic_vm_expand(struct address_space *mapping, loff_t size)
> +{
> + struct page *page;
> + unsigned long index, offset;
> + int err;
> +
> + if (!mapping->a_ops->prepare_write || !mapping->a_ops->commit_write)
> + return -ENOSYS;
> +
> + offset = (size & (PAGE_CACHE_SIZE-1)); /* Within page */
> + index = size >> PAGE_CACHE_SHIFT;

For affs I did basically the same with a small difference:

offset = ((size-1) & (PAGE_CACHE_SIZE-1)) + 1;
index = (size-1) >> PAGE_CACHE_SHIFT;

That works fine here and allocates a page in the cache more likely to be
used.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CFT][PATCH] Re: fat problem in 2.4.2

2001-03-01 Thread Roman Zippel


Hi,

On Thu, 1 Mar 2001, Alexander Viro wrote:

>   IOW, if it's worth doing at all it probably should be
> on expanding path in vmtruncate() - limit checks are already
> done, but old i_size is still not lost...

The fs where it's important have mmu_private, that's what I use to decide
whether to expand or truncate.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: console spin_lock

2001-01-17 Thread Roman Zippel

Hi,

On Thu, 18 Jan 2001, Andrew Morton wrote:

> - Get rid of the special printk buffer - share the
>   log buffer.  (Implies writes to console
>   devices will be broken into two writes when they
>   wrap around).
> - Teach vsprintf to print into a circular buffer
>   (snprintf thus comes for free).

I have a different vsprintf variant - vpprintf(). It takes a function and
a data pointer, this function is called with the print buffer and within
that function you can take care of the locking. The only problem is that
%n doesn't work anymore, but it's not used anyway in the kernel (as far as
I can grep :) ).

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-18 Thread Roman Zippel

Hi,

On Thu, 18 Jan 2001, Linus Torvalds wrote:

> > Actually, this is a great example, because at one point I was working
> > on a device interface which would offload all of the disk-disk copying
> > overhead to the disks themselves, and not involve the CPU/RAM at all.
> 
> It's a horrible example.
> 
> device-to-device copies sound like the ultimate thing. 
> 
> They suck. They add a lot of complexity and do not work in general. And,
> if your "normal" usage pattern really is to just move the data without
> even looking at it, then you have to ask yourself whether you're doing
> something worthwhile in the first place.
> 
> Not going to happen.

device-to-device is not the same as disk-to-disk. A better example would
be a streaming file server. Slowly the pci bus becomes a bottleneck, why
would you want to move the data twice over the pci bus if once is enough
and the data very likely not needed afterwards? Sure you can use a more
expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is
theoretically fast enough for your application?
So I'm not advising it as "the ultimate thing", but I don't understand,
why it shouldn't happen.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-18 Thread Roman Zippel

Hi,

On Thu, 18 Jan 2001, Linus Torvalds wrote:

> It's too damn device-dependent, and it's not worth it. There's no way to
> make it general with any current hardware, and there probably isn't going
> to be for at least another decade or so. And because it's expensive and
> slow to do even on a hardware level, it probably won't be done even then.
> 
> [...]
> 
> An important point in interface design is to know when you don't know
> enough. We do not have the internal interfaces for doing anything like
> this, and I seriously doubt they'll be around soon.

I agree, it's device dependent, but such hardware exists. It needs of
course its own memory, but then you can see it as a NUMA architecture and
we already have the support for this. Create a new memory zone for the
device memory and keep the pages reserved. Now you can use it almost like
other memory, e.g. reading from/writing to it using address_space_ops.

An application, where I'd like to use it, is audio recording/playback
(24bit, 96kHz on 144 channels). Although it's possible to copy that amount
of data around, but then you can't do much beside this. All the data is
most of the time only needed on the soundcard, so why should I copy it
first to the main memory?

Right now I'm stuck to accessing a scsi device directly, but I would love
to use the generic file/address_space interface for that, so you can
directly stream to/from any filesystem. The only problem is that the fs
interface is still to slow.

That's btw the reason I suggested to split the get_block function. If you
record into a file, you first just want to allocate any block from the fs
for that file. A bit later when you start the write, you need a real
block. And again a bit later you can still update the inode. These three
stages have completely different locking requirements (except the page
lock) and you can use the same mechanism for delayed writes.

Anyway, now with the zerocopy network patches, there are basically already
all the needed interfaces and you don't have to wait for 10 years, so I
think you need to polish your crystal ball. :-)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-19 Thread Roman Zippel

Hi,

On Thu, 18 Jan 2001, Linus Torvalds wrote:

> > I agree, it's device dependent, but such hardware exists.
> 
> Show me any practical case where the hardware actually exists.

http://www.augan.com

> I do not know of _any_ disk controllers that let you map the controller
> buffers over PCI. Which means that with current hardware, you have to
> assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed?

Yes.

> I'm sure there are sound cards that just expose their buffers directly.
> Fine. Make a special user-space driver for it. Don't try to make it into a
> design.
> [..]
> You need to have a damn special sound card to do the above.

That's true. "Soundcard" is actually a small understatement. :)
Why should I make a new design for it, then it fits nicely into the
current design?

> And you wouldn't need a new memory zone - the kernel wouldn't ever touch
> the memory anyway, you'd just ioremap() it if you needed to access it
> programmatically in addition to the streaming of data off disk.

ioremapped memory is not the same (that's what we do right now), you have
to fake some virtual address to get the data to the right physical
location.

> Also, even when you happen to have the 1% card combination where it would
> work in the first place, you'd better make sure that they are on the same
> PCI bus. That's usually true on most PC's today, but that's probably going
> to be an issue eventually. 

I agree, it's a special setup.

> > Anyway, now with the zerocopy network patches, there are basically already
> > all the needed interfaces and you don't have to wait for 10 years, so I
> > think you need to polish your crystal ball. :-)
> 
> The zero-copy network patches have _none_ of the interfaces you think you
> need. They do not fix the fact that hardware usually doesn't even _allow_
> for what you are hoping for. And what you want is probably going to be
> less likely in the future than more likely.

It's about direct i/o from/to pages, for that you need a page struct (so
the ioremapping doesn't work). See the memory on the pci card as normal
memory, except that you can't allocate it normally, but you can still
organize it like normal memory. All you need to do is to setup this memory
area, then you can use it like normal memory, e.g. I can put it into the
page cache and I can do a normal read/write with it. The changes are very
minor, but it would solve so much other problems (especially alias
issues).

I know, that this isn't possible with any hardware combination,
nonetheless it's not that a big problem to support it where it's possible.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-20 Thread Roman Zippel

Hi,

On Sat, 20 Jan 2001, Linus Torvalds wrote:

> There's no no-no here: you can even create the "struct page"s on demand,
> and create a dummy local zone that contains them that they all point back
> to. It should be trivial - nobody else cares about those pages or that
> zone anyway.

AFAIK as long as that dummy page struct is only used in the page cache,
that should work, but you get new problems as soon as you map the page
also into a user process (grep for CONFIG_DISCONTIGMEM under
include/asm-mips64 to see the needed changes). In the worst case one
might need reverse mapping to get the page back. :)

> That said, nobody has actually done this in practice yet, so there may be
> details to work out, of course. I don't see any fundamental reasons it
> wouldn't easily work, but..

I hope I have soon the time to experiment with this, so I'll now for sure.
I don't see major problems, except I don't know yet, how the performance
will be.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-20 Thread Roman Zippel

Hi,

On Sat, 20 Jan 2001, Linus Torvalds wrote:

> But point-to-point also means that you don't get any real advantage from
> doing things like device-to-device DMA. Because the links are
> asynchronous, you need buffers in between them anyway, and there is no
> bandwidth advantage of not going through the hub if the topology is a
> pretty normal "star" kind of thing. And you _do_ want the star topology,
> because in the end most of the bandwidth you want concentrated at the
> point that uses it.

I agree, but who says, that the buffer always has to be the main memory?
That might be true especially for embedded devices. The cpu is then just
the local controller, that manages several devices with its own buffer.
Let's take a file server with multiple disks and multiple network cards
with it's own buffer. For stuff like this you don't want to go through the
main memory, on the other hand you still need to synchronize all the data.
Although I don't know such hardware, but I don't see a reason not to do it
under Linux. :-)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-20 Thread Roman Zippel

Hi,

On Sat, 20 Jan 2001, Linus Torvalds wrote:

> Now, there are things to look out for: when you do these kinds of dummy
> "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do
> not currently have a good "page_to_bus/phys()" function. That means that
> anybody trying to do DMA to this page is currently screwed, simply because
> he has no good way of getting the physical address.
> 
> This is a limitation in general: the PTE access functions would also like
> to have "page_to_phys()" and "phys_to_page()" functions. It gets even
> worse with IO mappings, where "CPU-physical" is NOT necessarily the same
> as "bus-physical".

That's why I want to avoid dummy struct page and use a real mem_map
instead. I have two options:
1. I map everything together in one mem_map, like it's still done for
m68k, the overhead here is in the phys_to_virt()/virt_to_phys() function.
2. I use several nodes like mips64/arm and virt_to_page() gets more
complex, but this usually assumes a specific memory layout to keep it
fast.
Once that problem is solved, I can manage the memory on the card like the
main memory and use it however I want. I probably do something like ia64
and use the highest bits as an offset into a table.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-20 Thread Roman Zippel

Hi,

On Sat, 20 Jan 2001, Linus Torvalds wrote:

> But think like a good hardware designer.
> 
> In 99% of all cases, where do you want the results of a read to end up?
> Where do you want the contents of a write to come from?
> 
> Right. Memory.
> 
> Now, optimize for the common case. Make the common case go as fast as you
> can, with as little latency and as high bandwidth as you can.
> 
> What kind of hardware would _you_ design for the point-to-point link?
> 
> I'm claiming that you'd do a nice DMA engine for each link point. There
> wouldn't be any reason to have any other buffers (except, of course,
> minimal buffers inside the IO chip itself - not for the whole packet, but
> for just being able to handle cases where you don't have 100% access to
> the memory bus all the time - and for doing things like burst reads and
> writes to memory etc).
> 
> I'm _not_ seeing the point for a high-performance link to have a generic
> packet buffer. 

I completely agree, if we are talking about standard pc hardware. I was
more thinking about some dedicated hardware, where you want to get the
data directly to the correct place. If the hardware does a bit more with
the data you need large buffers. In a standard pc the main cpu does most
of the data processing, but in dedicated hardware you might have several
cards each with it's own logic and memory and here the cpu does manage
that stuff only. You can do all this of course from user space, but this
means you have to copy the data around, what you don't want with such
hardware, when the kernel can help you a bit.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel


Hi,

Here is a patch for anyone who needs to access HFS on e.g. an MO drive.
It's only for 2.2.16, but I was able to do that as part of my job as we
need that functionality. Anyway, I've read also a bit through HFS+ spec
and IMO basically most of the current hfs needs to rewritten for 2.4,
e.g. its special files should better go into the page cache and hfs
basically assumes everywhere 512 byte blocks, what isn't true anymore
with hfs+. This 512 bytes block problem is also the reason that the
perfomance of this patch will suck badly on MOs, since _every_ write (of
a 512 byte block) requires a read (of a 1024 byte sector).
Anyway, I'm happy about any bug reports, that you can't reproduce with
hfs on a drive with 512 byte sectors (for that I still trying to fully
understand hfs btrees :-) ). I don't think this patch should be included
into standard 2.2, but on the other hand it also shouldn't make anything
worse than it already is.

bye, Roman
 hfs1024.diff.gz

Re: hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel


Hi,

> Darnit, documentation on filesystem locking is there for purpose. First
> folks complain about its absence, then they don't bother to read the
> bloody thing once it is there. Furrfu...

It's great that it's there, but still doesn't tell you everything.

> Said that, handling of indirect blocks used to be badly b0rken on all
> normal filesystems and it had been fixed only on ext2, so I wouldn't be
> amazed if regular files were bad on B-tree style filesystems. Directories
> are easy - all requests are process-synchronous (no pageout), no
> truncate() in sight, so the life is better.

I don't think that files are that easy, at least from what I know now from
hfs. For example reading from a file might require a read from a btree
file (extent file), with what another file write can be busy with (e.g.
reordering the btree nodes).
I really would prefer that a fs could sleep _and_ can use semaphores,
that would keep locking simple, otherwise it gets only a fscking mess.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel


Hi,

> > hfs. For example reading from a file might require a read from a btree
> > file (extent file), with what another file write can be busy with (e.g.
> > reordering the btree nodes).
> 
> And?

The point is: the thing I like about Linux is its simple interfaces, it's
the basic idea of unix - keep it simple. That is true for most parts - the
basic idea is simple and the real complexity is hidden behind it. But
that's currently not true for vfs interface, a fs maintainer has to fight
right now with fscking complex vfs interface and with a possible fscking
complex fs implementation. E2fs or affs have a pretty simple structure and
I believe you that it's not that hard to fix, maybe there is also a simple
solution for hfs. But I'd like you to forget about that and think about
the big picture (how Linus nicely states it). What we should aim at with
the vfs interface is simplicity, I want to use a fscking simple semaphore
to protect something like anywhere else, I don't want to juggle with lots
blocks wich have to be updated atomically. Maybe you get once right, but
it will follow you as a nightmare, you add one feature (e.g. quota), you
add another feature (like btrees), you so still damned fscking sure to get
and keeping it right?
So and? What I'd really like to see from you is to be a bit more
supportive for other peoples problems, I really don't expect you to solve
these problems, but if someone approaches a different solution, you're
pretty quick to refuse it.
So lets get back to the vfs interface, fs currently have to do pretty much 
all there changes atomically, they have to grab all the buffers they need
and do all changes at once. How can you be sure that this is possible for
every possible fs? How do you make sure you don't create other problems
like livelocks? We currently have problem that things like kswapd require 
an asynchronous interface, but fs prefer to synchronize it. Currently you
pushing all the burden of an asynchronous interface into the fs, which
want to rather avoid that. Why don't you think for a moment in the other
direction? Currently I'm playing with the idea of a kernel thread for
asynchronous io (maybe one per fs), that thread takes the io requests e.g.
from kswapd and the io thread can safely sleep on it, while kswapd can
continue its job, but I don't know yet, where to put, whether in the fs
specific part or whether it can be made generic enough to be put into the
generic part. Can we please think for a moment in that direction? At some
point you have to synchronize the io anyway (at latest when it hits the
device), but I would pretty much prefer if a fs would get some help at
some earlier point.
(Anyway, I need some sleep now as well... :) )

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel


Hi,

> Yes? And it will become simpler if you will put each and every locking
> scheme into the API?

No, I didn't say that. I want the API to be less restrictive and make
the job for the fs a bit easier. IMO the current API is inconsistent
and/or incomplete and I'm still trying to find out what exactly is
missing. The VFS is becoming more and more multithreaded, locks are
(re)moved, but nothing was added for the fs.

> We have ext2 with indirect blocks, inode bitmaps and block bitmaps, one
> per cylinder group + counters in each cylinder group. Should VFS know
> about the internal locking rules? Should it be aware of the fact that
> inodes foo and bar belong to the same cylinder group and if we remove them
> we will need to protect the bitmap for a while?

Ok, let's take ext2 as an example. Of course vfs should only be the
abstraction layer, but it shouldn't enforce locking rules like you added
them in ext2. I know the races exists already longer, so you don't have
to argue about that, but earlier I suggested a simpler solution, the
problem is that it requires holding an exclusive lock while it would
sleep. It wouldn't even be in the fast path and would only affect write
access to the indirect blocks of a single file, it doesn't affect reads
and it doesn't affect access to other files - that really shouldn't be a
problem even for a multi threaded environment. But currently this is not
possible and all I'm trying now is to explore possibilities to make that
possible, as it would make the life for ext2 and every other fs a lot
easier.

> We have AFFS with totally fscked directory structures.

Sorry? Why is that? Because it's not UNIX friendly? It was designed for
a completly different os and is very simple. The problems I know are
mostly shared with every other fs, that has a more dynamic directory
structure than ext2.

> It's insane - protection of purely internal data structures belongs to the
> module that knows about them.

I absolutly don't argue against that!

Anyway, somehow you skipped a lot of my mail, so it seems I have to
continue to discuss that with myself (hopefully without permanent
damage).
The major problem right now is that writepage() is supposed to be
asynchronous especially for kswapd, but the fs might have to
synchronized something _internal_. I think one problem here is that we
still have a synchronous buffer API, what makes it very hard to
implement a asynchronous interface. That's why I suggested an I/O
thread, which can sleep for the caller. Another possibility is to make
the already existing asynchronus interface in buffer.c available to the
fs. Anyway, if we want an asynchronous fs interface, we need an
asynchronous buffer interface, so e.g. writepage() in ext2 can lock the
indirect block, starts the I/O and gets called back later, another
writepage() call in the same area has to detect that lock (with a simple
down_trylock()) and schedules the complete I/O for later. With some help
from the buffer interface it should be possible pretty easily and ext2
would actually become much easier again. Something like this would also
be great for a real AIO support in userspace with great latencies.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

Tony Mantler wrote:

> For those of you who would rather not have read through this entire email,
> here's the condensed version: VFS is inherintly a wrong-level API, QNX does
> it much better. Flame on. :)

VFS isn't really wrong, the problem is that it moved from an almost
single threaded API to a multithreaded API and that development isn't
complete yet. I don't really expect that fs programming becomes easier,
but it should stay sane. For example I want to protect certain state
changes properly and not that insane "check all possible states at all
possible times and before and after every change" what Al is currently
doing in ext2.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel


Hi,

> It sounds to me like different FSes have different needs.  Maybe the best
> approach is to have two or three fs APIs, according to the needs of the
> fs.

No, having several fs API is a maintainance nightmare, I think that's
something everyone agrees on. What is needed is to modify the API to
meet all requirements of vfs and needs of the fs. (The problem is we
don't agree on what the fs needs...)

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel


Hi,

>   Show me these removed locks. The only polite explanation I see is
> that you have serious reading comprehension problems. Let me say it once
> more, hopefully that will sink in:
> 
>   Your repeated claims of VFS becoming more multi-threaded in ways
> that are not transparent to fs drivers wrt locking are false.

For example the usage of inode lock changed pretty much and was partly
replaced with the page lock? I can still remember times, where all of the
fs stuff happened under the BKL, for me that means only a _single_ thread
of execution could be busy in the whole fs layer. IMHO that's not really a
prime example of multi-threaded programming, if you have a different
definition please let me now.

> What? You've proposed locking on pageout. If _that_ isn't the fast path...

No, I suggested a lock (not necessarily the inode lock) during allocation
of indirect blocks (and defer truncation of them).

> > The major problem right now is that writepage() is supposed to be
> > asynchronous especially for kswapd, but the fs might have to
> > synchronized something _internal_. I think one problem here is that we
> > still have a synchronous buffer API, what makes it very hard to
> > implement a asynchronous interface. That's why I suggested an I/O
> 
> Wrong. As the matter of fact, we could trivially get rid of _any_ use of
> bread() and friends on ext2.

Excuse my stupidity, but could you please outline me how?

> _One_ thread? For the whole fs? So you would pass the dirty pages from
> kswapd to that guy. Fine. It attempts to acquire the inode semaphore (in
> your proposal, as far as I could parse it). It blocks. kswapd keeps
> pumping dirty pages into the queue of that thread. Wonderful...

Sorry, but did you read my mail? The purpose of that thread is to sleep
and to get waken up to continue the IO. Not very much changes, except that
this thread can safely sleep, whereas kswapd can't.
Excuse my ignorance, but who does currently stop kswapd to start lots of
IO?

>   b) doesn't help AFFS directory problems

Why the hell do you come always with this, I _never_ mentioned it.

>   Talk is cheap. If you can show the patch that would simplify ext2,
> I'm sure that Ted will be glad to see it. Same for maintainers of other
> filesystems. The only requirement is that it should work. Excuse me, but
> the longer I read your postings the more it looks like you have no idea of
> the things you are talking about. I would be glad to be proven wrong on
> that one too ;-/

I'm very sorry to waste your precious time, but your fscking arrogance
makes me sick. What's your problem? Shall I first worship you as our fs
god who saved us from all races?
Sorry, but from time to time I prefer _first_ to think about a problem and
I try to understand it. One way to do this is to post questions and/or
suggestions to a mailing list (at least I thought so). If you have an 
other suggestion please enlighten me.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

(Sorry for the previous empty mail, I was a bit too fast with sending and
couldn't stop it completly.)

On Wed, 30 Aug 2000, Alexander Viro wrote:

I concentrate on the most interesting part:

> As for AFFS directory format - fine, please describe the data
> manipulations required by unlink("foo"); done after the
> link("foo","bar/baz");. Both operations are supported on AmigaOS, so
> references to UNIX are utterly irrelevant. On the block level, please.
> Only for directory blocks. Now, tell me what kind of protection (pageout
> has nothing to directories, so all async problems are irrelevant) would
> you provide. Or what protection should VFS/core kernel/exec/whatever
> provide to filesystem.

Disclaimer: I know that the following doesn't match the current
implementation, it's just how I would intuitively would do it:

- get dentry foo
- get dentry baz
- lock inode foo
- mark dentry foo as deleted
- getblk file header foo
- mark file header foo as deleted
- getblk file header baz
- update file header baz from file header foo
- brelse file header baz
- update inode foo
- unlock inode foo
- put dentry baz
- lock foo's parent
- getblk and update dir header parent
- getblk file headers from foo's chain until file header of predecessor of
  foo found
- update predecessor to point to successor of foo
- brelse everything
- unlock foo's parent
- put and invalidate dentry foo
- last user of foo frees file header foo in bitmap

I probably forgot something, but you will surely tell me. Two things I
want to mention anyway. First, I only lock something when needed, that of
course breaks with current conventions. Second (and most important), I use
the dentry to block a possible lookup of an inode, so noone can open or
create foo or do anything else with it. A rename would work similiar only
that the new dentry would be marked as not complete yet.

> On that specific operation. When you are done with
> that, I have a rename() for you, but I think that even simpler example
> (unlink()) will be sufficient.

Please post it, I know there are some interesting examples, but I don't
have them at hand. Although I wanted to keep that flamewar for later, but
if we're already in it...

> Again, we are talking about the data structure and operations it has to
> deal with _according to its designers_. I claim that due to a bad data
> structure design (single-linked lists in hash chains, requirement to have
> all entries belonging to some chain) unlink() (one of the operations it
> was designed to deal with) becomes very complicated  and requires rather
> hairy exclusion rules.  On Amiga. Linux has nothing with the problem.

To be fair it shoud be mentioned, that links were added later to affs.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

On Wed, 30 Aug 2000, Alexander Viro wrote:

>   c) ->i_sem on pageout? When?

For 2.2.16:

filemap_write_page() <- filemap_swapout() <- try_to_swap_out() <- ... <-
swap_out() <- do_try_to_free_pages() <- kswapd()

filemap_write_page() takes i_sem and calls do_write_page(). What did I
miss?

>   BKL matters only in the areas where you do not block. Moreover,
> fs code is still under the BKL, so it's totally moot.

Let me state it differently, what I'm trying to say:
Past: Lots of filesystem code wasn't designed/written with multiple
threads in mind. The result is lots of races.
Future: We want to experiment with a preempting kernel. Maybe that
experiment will fail, but I'm certainly interested in it. But the result
here will be a wonderful world of new races and I'm pretty sure your ext2
fixes will break here, one more reason I'm so keen to use sempahores.

All I wanted to say is that level of threading is changing. How that is
visible in the fs layer is a different problem.

> > > Wrong. As the matter of fact, we could trivially get rid of _any_ use of
> > > bread() and friends on ext2.
> > 
> > Excuse my stupidity, but could you please outline me how?
> 
> Using kiovec, for one thing.

Huh? You said "trivially".

> One thing that became really obvious is that current documentation
> is either not enough or not read. Hell knows what to do about the latter,
> but the former can be helped.

Documentation is one (good) thing (I really tried to find as much as
possible), but my point is that I tried to discuss design issues, I didn't
want to know how it works now (for that I can and do read the source), I
want to discuss the possibility of alternative solutions, is that really
impossible?
Anyway, after I discussed that enough with myself, I think I can try to
code up something as soon as find the time for it.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel


Hi,

> > - get dentry foo
> > - get dentry baz
> 
> How? OK, you've found block of baz. You know the name, all right.

Links are chained together and all point back to the original, so if you
remove the original, you have quite something to do with lots of links.

> Now
> you've got to do the full tracing all the way back to root.

All file header have a pointer to the dir header, so it's not that
difficult, but that makes links to directories so interesting. :)

Anyway, I'll better try to describe the idea more generally:
The basic idea is to introduce transient states to vfs and to move the
locking into the fs, which probably knows better what needs to be
protected. This would avoid the current locking overkill. Let's take a
rename, first we mark the object as to be moved, no need to keep it locked
after this. An open on this object would either fail or had to wait (on a
seperate queue). Next we mark the destination dir as not removable. This
is basically the job of vfs so far, the next steps happen in the fs.
(I use affs here as an example.) First we lock the source dir and
remove the object from the chain and unlock the dir. Now I can lock the
destination, insert the object here and unlock the dir. (back to vfs) All
we have to do now is to restore now the state of destination dir and the
object and we have to wakeup anyone who's waiting.
Back to the original example of removing a file with links. I have to get
the dentry of baz as I have to prevent a lookup of that link, while I'm
modifying its block. But I think it's enough to lock that block and check
only the cached aliases. Then I can modify that block and unlock it again.

> > - update file header baz from file header foo
> 
> If it would be that simple... Extent blocks refer to foo, unfortunately.
> Yes, copying the thing would be easier. Too bad, data structure prohibits
> that.

Which data structure prohibits that?
Updating the extent blocks isn't that difficult as the back links are not
needed for general operation, it's just wasting I/O. A bit more
problematic are concurrent readers of foo, so I can't simply trash the
buffer of foo's file header, but I can simply keep it allocated till the
file is closed (keeps also the inode number constant and unique).

> Well, consider rename over the primary link and there you go... Keep in
> mind that extent blocks contain the reference to header block, so unless
> you want to update them all you've got to move the header into donor's
> chain ;-/

Oops, I just read rename(2) and notice that I forgot about a small detail.
Ok, above rename operation get's slightly more difficult. Basically it's
only a variation of the unlink problem, I first unlink the old file and
then insert the new file. As I do less locking, I shouldn't have a
locking problem or what do I miss? I just might have to update lots of
back links, but that is not a critical part.

[I can skip the affs history part, I just see you already got a better
answer than I could give.]

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hfs support for blocksize != 512

2000-09-01 Thread Roman Zippel

Hi,

On Thu, 31 Aug 2000, Alexander Viro wrote:

> Go ahead, write it. IMNSHO it's going to be much more complicated and
> race-prone, but code talks. If you will manage to write it in clear and
> race-free way - fine. Frankly, I don't believe that it's doable.

It will be insofar more complicated, as I want to use a more complex state
machine than "locked <-> unlocked", on the other hand I can avoid such
funny constructions as triple_down() and obscure locking order rules.

At any time the object will be either locked or in a well defined state,
where at any time only a single object is locked by a thread. (I hope some
pseudo code does for the beginning, too?) Most namespace operation work
simply like a semaphore:

restart:
lock(dentry);
if (dentry is busy) {
unlock(dentry);
sleep();
goto restart;
}
dentry->state = busy;
unlock(dentry);

If the operation is finished, the state is reset and everyone sleeping is
woken up. Ok, let's come to the most interesting operation - rename():

restart:
lock(olddentry);
if (olddentry is busy) {
unlock(olddentry);
sleep();
goto restart;
}
olddentry->state = moving;
unlock(olddentry);

restart2:
lock(newdentry);
if (newdentry->state == moving) {
lock(renamelock);
if (olddentry->state == deleted) {
unlock(renamelock);
unlock(newdentry);
sleep();
goto restart;
}
newdentry->state = deleted;
unlock(renamelock);
} else if (newdentry is busy) {
unlock(newdentry);
sleep();
goto restart2;
} else
newdentry->state = deleted;
unlock(newdentry);

if (!rename_valid(olddentry, newdentry)) {
lock(newdentry);
newdentry->state = idle;
unlock(renamelock);
lock(olddentry);
olddentry->state = idle;
unlock(olddentry);
wakeup_sleepers();
return;
}

if (newdentry exists)
unlink(newdentry);
do_rename(olddentry, newdentry);

lock(newdentry);
newdentry->state = idle;
unlock(renamelock);
lock(olddentry);
olddentry->state = deleted;
unlock(olddentry);
wakeup_sleepers();
return;

Note that I don't touch any inode here, everything happens in the dcache.
That means I move the complete inode locking into the fs, all I do here is
to make sure, that while operation("foo") is busy, no other operation will
use "foo".
IMO this should work, I tried it with a rename("foo", "bar") and 
rename("bar", "foo"):
case 1: one rename gets both dentries busy, the other rename will wait
till it's finished.
case 2: both mark the old dentry as moving and find the new dentry also
moving. To make the rename atomic the global rename lock is needed, one
rename will find the old dentry isn't moving anymore and has to restart
and wait, the other rename will complete.

Other operations will keep only one dentry busy, so that I don't a see
problem here. If you don't find any major problem here, I'm going to try
this. Since if this works, it will have some other advantages:
- a user space fs will become possible, that can't even deadlock the
system. The first restart loop can be easily made interruptable, so it can
be safely killed. (I don't really want to know how a 
triple_down_interruptable() looks, not to mention the other three locks
(+ BKL) taken during a rename.)
- I can imagine better support for hfs. It can access the other fork
without excessive locking (I think currently it doesn't even tries to).
The order in which the forks can be created can change then too.

> BTW, I really wonder what kind of locks are you going to have on _blocks_
> (you've mentioned that, unless I've misparsed what you've said). IMO that
> way lies the horror, but hey, code talks.

I thought about catching a bread, but while thinking about it, there
should also be other ways. But that's fs specific, let's concentrate on
the generic part first.

> You claim that it's doable. I seriously doubt it. Nobody knows your ideas
> better than you do, so... come on, demonstrate the patch.

I think the above example should do basically the same as some nothing
doing patch within affs.
I hope that example shows two important ideas (no idea if they will save
the world, but I'm willing to learn):
- I use the dcache instead of the inode to synchronize namespace
operation, what IMO makes quite a lot of sense, since it represents our
(cached) representation of the fs.
- Using states instead of a semaphore, makes it easily possible to detect
e.g. a rename loop.

bye, Roman

-
To

Re: What the Heck? [Fwd: Returned mail: User unknown]

2000-09-05 Thread Roman Zippel

Hi,

On Mon, 4 Sep 2000, Alan Cox wrote:

> Then they need more competant admins. It isnt _hard_ to transproxy outgoing
> smtp traffic via a spamtrapper that checks for valid src/destination and
> headers.

You get into a dangerous field here. If you start arguing like this, how
do you explain to a politician, the difference between this and content
based filtering.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: PROBLEM: mounting affs over loop hangs in syscall (x86 only?)

2000-12-20 Thread Roman Zippel

Hi,

On Mon, 18 Dec 2000, Bernardo Innocenti wrote:

> [1.] One line summary of the problem:
> mounting affs over loop hangs in syscall (x86 only?)

affs plays some games with the suberblock lock, I have a patch that plays
even worse games, but it works. I hope to finish a major cleanup of affs
over christmas.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2000-12-30 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.  
> 
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush.

Actually the getblock function could be split into 3 functions:
- alloc_block: mostly just decrementing a counter (and quota)
- get_block: allocating a block from the bitmap
- commit_block: inserting the new block into the inode

This would be really useful for streaming, one could get as fast as
possible the block number and the data could be very quickly written,
while keeping the cache usage low. Or streaming directly from a device
to disk also wants to get rid of the data as fast as possible.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sat, 30 Dec 2000, Linus Torvalds wrote:

> In fact, in a properly designed filesystem just a bit of low-level caching
> would easily make the average "get_block()" be very fast indeed. The fact
> that right now ext2 has not been optimized for this is _not_ a reason to
> design the VFS layer around a slow get_block() operation.
> [..]
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..
> 
> I really think that the #2 thing is the more interesting one, and that
> anybody looking at ext2 should look at just improving the locking and
> making the block allocation functions run faster. Which should not be all
> that difficult - the last time I looked at the thing it was quite
> horrible.

What makes get_block business complicated now, is that can be called
recursively: get_block needs to allocate something, what might start new
i/o which calls again get_block.
Writing dirty pages should be a real asynchronous process, but it isn't
right now, as get_block is synchronous. Making get_block asynchronous is
almost impossible, so one usually does it in a separate thread.
So IMO something like this should happen: dirty pages should be put on a
separate list and a thread takes these pages and allocates the buffers for
them and starts the i/o. This had another advantage: get_block wouldn't
really need to do preallocation anymore, the get_block function could work
instead on a number of pages (preallocation would instead happen in the
page cache).
This could make the get_block function and the needed locking very simple, 
e.g. one could use a simple semaphore instead of kernel_lock to protect
getting of multiple blocks instead of only one. Also splitting it into
several tasks can make it faster, so in one step we just do the resource
allocation to guarantee the write, in a separate step we do the real
allocation. If this is done for several pages at once, it can be very 
fast and simple.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> Let me repeat myself one more time:
> 
>  I do not believe that "get_block()" is as big of a problem as people make
>  it out to be.

The real problem is that get_block() doesn't scale and it's very hard to
do. A recursive per inode-semaphore might help, but it's still a pain to
get it right.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2000-12-31 Thread Roman Zippel


Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

>   cached_allocation = NULL;
> 
>   repeat:
>   spin_lock();
>   result = try_to_find_existing();
>   if (!result) {
>   if (!cached_allocation) {
>   spin_unlock();
>   cached_allocation = allocate_block();
>   goto repeat;
>   }
>   result = cached_allocation;
>   add_to_datastructures(result);
>   }
>   spin_unlock();
>   return result;
> 
> This is quite standard, and Linux does it in many places. It doesn't have
> to be complex or ugly.

No problem with that.

> Also, I don't see why you claim the current get_block() is recursive and
> hard to use: it obviously isn't. If you look at the current ext2
> get_block(), the way it protects most of its data structures is by the
> super-block-global lock. That wouldn't work if your claims of recursive
> invocation were true. 

I just rechecked that, but I don't see no superblock lock here, it uses
the kernel_lock instead. Although Al could give the definitive answer for
this, he wrote it. :)

> The way the Linux MM works, if the lower levels need to do buffer
> allocations, they will use GFP_BUFFER (which "bread()" does internally),
> which will mean that the VM layer will _not_ call down recursively to
> write stuff out while it's already trying to write something else. This is
> exactly so that filesystems don't have to release and re-try if they don't
> want to.
> 
> In short, I don't see any of your arguments.

Then I must have misunderstood Al. Al?
If you were right here, I would see absolutely no reason for the current
complexity. (Me is a bit confused here.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Sun, 31 Dec 2000, Alexander Viro wrote:

> Reread the original thread. GFP_BUFFER protects us from buffer cache
> allocations triggering pageouts. It has nothing to the deadlock scenario
> that would come from grabbing ->i_sem on pageout.

I don't want to grab i_sem. It was a very, very early idea... :)

> Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block()
> calls) is really, really not a problem. Normally it just gives you the
> straightforward path. All unrolls are for contention cases and they
> are precisely what we have to do there.

Maybe complexity is the wrong word, of course the logic in there is
straight forward (once one understood it :) ).
Let me ask it differently and it's now only about indirect block handling.
Is it possible to use a per-inode-indirect-block-semaphore?
The reason for the question is, that I maybe see a different sort of
contention here - live locks. I don't mind that getting of resources and
rechecking if everything went well. The problem is how much resources you
need to get (and to release, if something failed). Somewhere is always a
point, where two threads can't make any progress or one thread can stall
the progress of a second.
To get back to ext2_get_block: IMO such a scenario could happen in the
double or triple indirect block case, when two or more threads try to
allocate/truncate a block here. Maybe my concerns are baseless, but I'd
just like to know, that there isn't a possibility for a DOS attack here.
(BTW that's what I mean with complexity, it's less the logical complexity,
it's more the "runtime complexity").

The other reason for the question is that I'm currently overwork the block
handling in affs, especially the extended block handling, where I'm
implementing a new extended block cache, where I would pretty much prefer
to use a semaphore to protect it. Although I could do it probably without
the semaphore and use a spinlock+rechecking, but it would keep it so much
simpler. (I can post more details about this part on fsdevel if needed /
wanted.)

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2001-01-01 Thread Roman Zippel

Hi,

On Mon, 1 Jan 2001, Alexander Viro wrote:

> But... But with AFFS you _have_ exclusion between block-allocation and
> truncate(). It has no sparse files, so pageout will never allocate
> anything. I.e. all allocations come from write(2). And both write(2) and
> truncate(2) hold i_sem.
> 
> Problem with AFFS is on the directory side of that business and there it's
> really scary. Block allocation is trivial...

Block allocation is not my problem right now (and even directory handling
is not that difficult), but I will post somethings about this on fsdevel
later.
But one question is still open, I'd really like an answer for:
Is it possible to use a per-inode-indirect-block-semaphore?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] Generic deferred file writing

2001-01-02 Thread Roman Zippel

Hi,

On Tue, 2 Jan 2001, Alexander Viro wrote:

> Depends on a filesystem. Generally you don't want asynchronous operations
> to grab semaphores shared with something else. kswapd knows to skip the locked
> pages, but that's it - if writepage() is going to block on a semaphore you
> will not know what had hit you. And while buffer-cache operations will not
> trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have
> no such warranties for other sources of memory pressure. If one of them
> hits while you are holding such semaphore - you are toast.

I just checked that and you're right, sorry for causing confusion and
thanks for clearing this up.

> We probably could pull it off for ext2_truncate() vs. ext2_get_block()
> but it would not do us any good. It would give excessive exclusion for
> operations that can be done in parallel just fine (example: we have
> a hole from 100Kb to 200Kb. Pageouts in that area can be trivially
> done i parallel - current code will not even try to do unrolls. With
> your locking they will be serialized for no good reason). What for?

Let me come back to the three phases I mentioned earlier:
alloc_block: does only a read-only check whether a block needs to be
allocated or not, this can be done in parallel and only needs the page
lock.
get_block: blocks are now really allocated and this needs locking of the
bitmap.
commit_block: write the allocated blocks to the inode and this now would
use an inode specific semaphore to protect the updates of indirect blocks.

The only problem I see is truncate, but if we move the release of unneeded
indirect blocks to file_close, only new indirect blocks can appear while
the file is open, but they won't change anymore, what would make lots of
the checks easier.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] move xchg/cmpxchg to atomic.h

2001-01-02 Thread Roman Zippel

Hi,

On Tue, 2 Jan 2001, David S. Miller wrote:

>We really can't.  We _only_ have load-and-zero.  And it has to be
>16-byte aligned.  xchg() is just not something the CPU implements.
> 
> Oh bugger... you do have real problems.

For 2.5 we could move all the atomic functions from atomic.h, bitops.h,
system.h and give them a common interface. We could also give them a new
argument atomic_spinlock_t, which is a normal spinlock, but only used on
architectures which need it, everyone else can "optimize" it away. I think
one such lock per major subsystem should be enough, as the lock is only
held for a very short time, so contentation should be no problem.
Anyway, this had the huge advantage that we could use the complete 32/64
bit of the atomic value, e.g. for pointer operations.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Meaning of blk_size

2000-10-02 Thread Roman Zippel

Hi,

On Mon, 2 Oct 2000, Andries Brouwer wrote:

> These days I have as background activity the construction
> of the corresponding patch for 2.4. Maybe we can start 2.5
> without these arrays and with large device numbers.

I started something like this a few months ago, I was at the point to boot
a usermode kernel till the fsck, which failed. Currently I have no time to
continue it, as there is more important work pending.
But I didn't create a generic kdev_t, I changed the block device part to
use a bdev_t, I also started a few cleanups that make e.g. the partition
stuff a bit easier. On the other hand it breaks of course every block
device driver.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: SA_INTERRUPT

2000-10-02 Thread Roman Zippel


Hi,

On Sun, 1 Oct 2000, Andrea Arcangeli wrote:

> Comments?

When that is done, please don't call __sti() directly and use some macro
that can be overridden by the architectures.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: SA_INTERRUPT

2000-10-02 Thread Roman Zippel

Hi,

On Mon, 2 Oct 2000, Andrea Arcangeli wrote:

> > When that is done, please don't call __sti() directly and use some macro
> > that can be overridden by the architectures.
> 
> What do you have in mind while making this suggestion? The irq highlevel layer
> is pretty much architectural indipendent. Just run a diff between the irq.c in
> the IA32 and alpha ports. Also what about the drivers that are just using
> __sti() at the start of the irq handler right now?

m68k uses interrupt levels, so an interrupt with a higher priority can
interrupt another interrupt with a lower priority. To make things more
interesting several m68k machines don't have a seperate external interrupt
controller, so they rely on that the interrupt level isn't lowered during
an interrupt (the ide driver has an ide__sti() because of this).
I can imagine that newer lowend embedded targets have similiar problems
(on the other end I'm just looking at the s390 interrupt code which looks
also interesting).

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] initdata and bss

2000-10-05 Thread Roman Zippel


Hi,

A few bss changes (to remove zero initialization) in test9 were not
completly correct. Init data must be initialized if you want that it gets
into the init section (it's also mentioned in the gcc documentation).
The following patch fixes what I was able to find with grep and also adds
a note about in init.h.

bye, Roman

diff -ur linux-2.4.org/arch/arm/mm/init.c linux-2.4-initdata/arch/arm/mm/init.c
--- linux-2.4.org/arch/arm/mm/init.cThu Oct  5 19:35:19 2000
+++ linux-2.4-initdata/arch/arm/mm/init.c   Thu Oct  5 20:22:00 2000
@@ -56,7 +56,7 @@
  * The sole use of this is to pass memory configuration
  * data from paging_init to mem_init.
  */
-static struct meminfo __initdata meminfo;
+static struct meminfo meminfo __initdata = { 0, };
 
 /*
  * empty_bad_page is the page that is used for page faults when
diff -ur linux-2.4.org/arch/ia64/kernel/smp.c linux-2.4-initdata/arch/ia64/kernel/smp.c
--- linux-2.4.org/arch/ia64/kernel/smp.cThu Aug 24 19:30:52 2000
+++ linux-2.4-initdata/arch/ia64/kernel/smp.c   Thu Oct  5 20:23:31 2000
@@ -49,8 +49,8 @@
 
 spinlock_t kernel_flag = SPIN_LOCK_UNLOCKED;
 
-struct smp_boot_data __initdata smp;
-char __initdata no_int_routing = 0;
+struct smp_boot_data smp __initdata = { 0, };
+char no_int_routing __initdata = 0;
 
 unsigned char smp_int_redirect;/* are INT and IPI 
redirectable by the chipset? */
 volatile int __cpu_number_map[NR_CPUS] = { -1, };/* SAPIC ID -> Logical ID */
diff -ur linux-2.4.org/arch/m68k/kernel/setup.c 
linux-2.4-initdata/arch/m68k/kernel/setup.c
--- linux-2.4.org/arch/m68k/kernel/setup.c  Wed Jul  5 01:04:12 2000
+++ linux-2.4-initdata/arch/m68k/kernel/setup.c Thu Oct  5 20:20:01 2000
@@ -68,13 +68,13 @@
 
 char m68k_debug_device[6] = "";
 
-void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata;
+void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata = 
+NULL;
 /* machine dependent keyboard functions */
-int (*mach_keyb_init) (void) __initdata;
+int (*mach_keyb_init) (void) __initdata = NULL;
 int (*mach_kbdrate) (struct kbd_repeat *) = NULL;
 void (*mach_kbd_leds) (unsigned int) = NULL;
 /* machine dependent irq functions */
-void (*mach_init_IRQ) (void) __initdata;
+void (*mach_init_IRQ) (void) __initdata = NULL;
 void (*(*mach_default_handler)[]) (int, void *, struct pt_regs *) = NULL;
 void (*mach_get_model) (char *model) = NULL;
 int (*mach_get_hardware_list) (char *buffer) = NULL;
diff -ur linux-2.4.org/arch/ppc/kernel/apus_setup.c 
linux-2.4-initdata/arch/ppc/kernel/apus_setup.c
--- linux-2.4.org/arch/ppc/kernel/apus_setup.c  Thu Oct  5 19:35:21 2000
+++ linux-2.4-initdata/arch/ppc/kernel/apus_setup.c Thu Oct  5 20:19:41 2000
@@ -82,13 +82,13 @@
 
 extern void amiga_init_IRQ(void);
 
-void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata;
+void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata = 
+NULL;
 /* machine dependent keyboard functions */
-int (*mach_keyb_init) (void) __initdata;
+int (*mach_keyb_init) (void) __initdata = NULL;
 int (*mach_kbdrate) (struct kbd_repeat *) __apusdata = NULL;
 void (*mach_kbd_leds) (unsigned int) __apusdata = NULL;
 /* machine dependent irq functions */
-void (*mach_init_IRQ) (void) __initdata;
+void (*mach_init_IRQ) (void) __initdata = NULL;
 void (*(*mach_default_handler)[]) (int, void *, struct pt_regs *) __apusdata = NULL;
 void (*mach_get_model) (char *model) __apusdata = NULL;
 int (*mach_get_hardware_list) (char *buffer) __apusdata = NULL;
diff -ur linux-2.4.org/arch/ppc/kernel/prep_setup.c 
linux-2.4-initdata/arch/ppc/kernel/prep_setup.c
--- linux-2.4.org/arch/ppc/kernel/prep_setup.c  Thu Oct  5 19:35:22 2000
+++ linux-2.4-initdata/arch/ppc/kernel/prep_setup.c Thu Oct  5 20:18:55 2000
@@ -384,8 +384,8 @@
  * 2 following ones measure the interval. The precision of the method
  * is still doubtful due to the short interval sampled.
  */
-static __initdata volatile int calibrate_steps = 3;
-static __initdata unsigned tbstamp;
+static volatile int calibrate_steps __initdata = 3;
+static unsigned tbstamp __initdata = 0;
 
 void __init
 prep_calibrate_decr_handler(intirq,
diff -ur linux-2.4.org/drivers/block/xd.c linux-2.4-initdata/drivers/block/xd.c
--- linux-2.4.org/drivers/block/xd.cThu Oct  5 19:35:27 2000
+++ linux-2.4-initdata/drivers/block/xd.c   Thu Oct  5 20:14:58 2000
@@ -142,9 +142,9 @@
 static DECLARE_WAIT_QUEUE_HEAD(xd_wait_open);
 static u_char xd_valid[XD_MAXDRIVES] = { 0,0 };
 static u_char xd_drives, xd_irq = 5, xd_dma = 3, xd_maxsectors;
-static u_char xd_override __initdata, xd_type __initdata;
+static u_char xd_override __initdata = 0, xd_type __initdata = 0;
 static u_short xd_iobase = 0x320;
-static int xd_geo[XD_MAXDRIVES*3] __initdata;
+static int xd_geo[XD_MAXDRIVES*3] __initdata = { 0, };
 
 static volatile int xdc_busy;
 static DECLARE_WAIT_QUEUE_HEAD(xdc_wait);
diff -ur linux-2.

Re: Calling current() from interrupt context

2000-10-09 Thread Roman Zippel


Hi,

> The m68k port which has a interrupt stack solves the problem by 
> loading current into a global register variable on all kernel entries.

Not all m68k cpus have an interrupt stack and it can be turned off, so we
don't use it.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [OT] linux article with kernel references

2000-10-10 Thread Roman Zippel

Hi,

On Wed, 11 Oct 2000, Alan Cox wrote:

> > http://www.osopinion.com/Opinions/MontyManley/MontyManley15.html
> > 
> > good article, several unfortunate truths within.
> 
> Really, must be a wrong URL you posted then 8)
> 
> The average Linux kernel hacker right now is late 20's to early 30's with
> a degree and working professionally on the kernel

The article isn't that wrong, that several people get paid now for hacking
doesn't change much. It's more about proper software engineering, some
people call it "a matter of taste" and are born with it, but for other
people it's a hard learning process.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Roman Zippel


Hi,

> How?  If you compile with egcs-2.91.66 without frame pointers on ix86 then
> __builtin_return_address() yields garbage.  Does anybody have a generic
> solution to this problem, other than "compile with frame pointers"?  Or is
> it fixed in newer versions of gcc?

Are you sure? I just I tried it 2.91.66 and it works. With 
-fomit-frame-pointer only __builtin_return_address(0) works, but that is
true for any version.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Updated Linux 2.4 Status/TODO List (from the ALS show)

2000-10-14 Thread Roman Zippel

Hi,

On Fri, 13 Oct 2000, Richard Henderson wrote:

> Either that or adjust how we do atomic operations.  I can do
> 64-bit atomic widgetry, but not with the code as written.

It's probably more something for 2.5, but what about adding a lock
argument to the atomic operations, then sparc could use that explicit lock
and everyone else simply optimizes that away. That would allow us to use
the full 32/64 bit. What we could get is a nice generic atomic exchange
command like:

atomic_exchange(lock, ptr, old, new);

Where new can be a (simple) expression which can include old. Especially
for risc system every atomic operation in atomic.h can be replaced with
this. Or if you need more flexibility the same can be written as:

atomic_enter(lock);
__atomic_init(ptr, old);
do {
__atomic_reserve(ptr, old);
} while (!__atomic_update(ptr, old, new));
atomic_leave(lock);

atomic_enter/atomic_enter are either normal spinlocks or (in most cases)
dummys. The other macros are either using RMW instructions or special
load/store instructions.

Using a lock makes it a bit more difficult to use and especially the last
construction must never be required in normal drivers. On the other hand
it gets way more flexible as we are not limited to a single atomic_t
anymore. If anyone is interested how it could look like, I've put an
example at http://zeus.fh-brandenburg.de/~zippel/linux/bin/atomic.tar.gz
(It also includes a bit more documentation and some (a bit outdated)
examples). Somewhere I also have a patch where I use this to write a
spinlock free printk implementation, which is still interrupt and SMP
safe.
There are still some issues open (like ordering), but I'd like to know if
there is a general interest in this.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: rotr32 / rotl32 (wordops.h) in 2.4.x ?

2000-10-15 Thread Roman Zippel

Hi,

On Sun, 15 Oct 2000, Andi Kleen wrote:

> You can just use the coded out variant (x<>(sizeof(x)*8-n)))
> gcc is clever enough to turn it into an rotate when the CPU supports it.

Hmm, I just tried it and two things one should take care of here. 1. x
must be unsigned of course  and 2. only gcc 2.95 can this do with a
nonconstant n.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

BLKSSZGET change will break fdisk

2000-10-16 Thread Roman Zippel


Hi,

I noticed that behaviour of BLKSSZGET changed between 2.2 and 2.4. One of
the users will be fdisk, as soon as it is compiled with 2.4 kernel
headers, but then fdisk will be no longer usable under 2.2!
My question now is, wouldn't it be better to use a new ioctl (like
BLKHSSZGET) and keep the old behaviour of BLKSSZGET?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: BLKSSZGET change will break fdisk

2000-10-16 Thread Roman Zippel


Hi,

> Concerning fdisk, luckily you are mistaken - its source says
> 
> #if defined(BLKSSZGET) && defined(HAVE_blkpg_h)
> 
> so that it will not use the broken BLKSSZGET of 2.2.

??? BLKSSZGET has exactly the same ioctl number in 2.2 and 2.4, so if I
compile fdisk under 2.4 and try to use it under 2.2, it will break. I saw
the above test, but that is a compile time check not a run time check. Am
I missing something here?
BTW the problem is a bit bigger, I tried to partition a 4GB mo disk, what
horribly breaks with current fdisk under 2.2. It results in an odd
partition offset and you can't access any partition. So I need a fixed
fdisk. As quick hack I simply reused BLKSSZGET, but now I have a fdisk,
that only works with a fixed kernel.

> [now that you make me look at this, there is a flaw in fdisk there;
> fixed in 2.10p]

BLKSSZGET isn't defined for fdisk.c? :)
BTW sfdisk isn't fixed at all for different sector sizes.

bye, Roman




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: BLKSSZGET change will break fdisk

2000-10-16 Thread Roman Zippel


Hi,

> - BLKSSZGET added in common.h

Why don't we give BLKSSZGET a new number and make the old one obsolete? I
don't think it's used anywhere, as its result is pretty useless in
userspace (and even if it's used somewhere, they have to copy the define
anyway). This way we don't need that version check.

bye, Roman


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: BLKSSZGET change will break fdisk

2000-10-17 Thread Roman Zippel

Hi,

On Tue, 17 Oct 2000, Andries Brouwer wrote:

> But you see that one would need a new name as well,
> otherwise the value associated with BLKSSZGET would
> depend on the kernel version, and one would need
> version checks anyway.

We do rename structures too, and this would be similiar. I'm more
concerned about binary compability. If anyone uses BLKSSZGET (for
whatever reason) it should not suddenly change the behaviour.
Why should one need version checks? If someone really needs that ioctl, he
has to copy that define, anyway, so his copy will still have the old value
(and behaviour).

> I think all this is too unimportant. Almost nobody uses
> this stuff, and 2.4 is correct, and we may fix 2.2 one
> of these days, perhaps for 2.2.19, and we may fix *fdisk
> to correctly use it.

But I also want to patch 2.2.17 or earlier kernels, but I don't want a
fdisk that breaks on these kernels.

> (By the way, have you checked that replacing get_sectorsize
> by an empty routine, and specifying a -b option, works well?)

No, not yet.

> (Do you know which disks have unusual sector size?
> So far I had only seen reports on a Fujitsu 640 MB.
> Have you seen other sectorsizes than 512, 1024, 2048
> on non-IBM disks?)

No, I didn't see anything else.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.2 ext2 filesystem corruption ? (was 2.4.2: What happened?(No

2001-03-08 Thread Roman Zippel

Hi,

On Thu, 8 Mar 2001, God wrote:

> Look at some of the confirmation requests in windows, some ask you twice
> if you whish to perform an action.  Even Red Hat (that I know of, others
> may as well), has an alias for "rm" that by
> default turns on confirmation.  Why?  Because not ALL users will know
> better.  Sure there are warnings that you can put in a man page somewhere,
> but the truth is few users are actually going to READ the page.  Is it
> there fault?  Yes.  But should it be so easy to lose their data over
> it rather then writting code to detect if said feature will work or
> not? ...  

This is getting off topic, this has nothing to do with the kernel. You are
free to do whatever you want in userspace, if you have the right 
capabilities. You're also free to write your own userspace tools, which
protects the user from any danger, but it belongs in userspace not in
the kernel. So please go the KDE/Gnome/... guys and whine there.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioremap_nocache problem?

2001-01-23 Thread Roman Zippel


Hi,

On Tue, 23 Jan 2001, Mark Mokryn wrote:

> ioremap_nocache does the following:
>   return __ioremap(offset, size, _PAGE_PCD);
> 
> However, in drivers/char/mem.c (2.4.0), we see the following:
> 
>   /* On PPro and successors, PCD alone doesn't always mean 
>   uncached because of interactions with the MTRRs. PCD | PWT
>   means definitely uncached. */ 
>   if (boot_cpu_data.x86 > 3)
>   prot |= _PAGE_PCD | _PAGE_PWT;
> 
> Does this mean ioremap_nocache() may not do the job?

ioremap creates a new mapping that shouldn't interfere with MTRR, whereas
you can map a MTRR mapped area into userspace. But I'm not sure if it's
correct that no flag is set for boot_cpu_data.x86 <= 3...

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: ioremap_nocache problem?

2001-01-25 Thread Roman Zippel

Hi,

Timur Tabi wrote:

> I mark the page as reserved when I ioremap() it.  However, if I leave it marked
> reserved, then iounmap() will not unmap it.  If I mark it "unreserved" (i.e.
> reset the reserved bit), then iounmap will unmap it, but it will decrement the
> page counter to -1 and the whole system will crash soon thereafter.
> 
> I've been asking about this problem for months, but no one has bothered to help
> me out.

The order is important:

get_free_page();
set_bit(PG_reserved, &page->flags);
ioremap();
...
iounmap();
clear_bit(PG_reserved, &page->flags);
free_page();

Alternatively something like this should also be possible:

get_free_page();
ioremap();
...
iounmap();

nopage() {
...
atomic_inc(&page->count);
return page;
}

But I never tried this version, so I can't guarantee anything. :)

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [ANNOUNCE] Kernel Janitor's TODO list

2001-01-28 Thread Roman Zippel


Hi,

On Sun, 28 Jan 2001, Manfred Spraul wrote:

> And one more point for the Janitor's list:
> Get rid of superflous irqsave()/irqrestore()'s - in 90% of the cases
> either spin_lock_irq() or spin_lock() is sufficient. That's both faster
> and better readable.
> 
> spin_lock_irq(): you know that the function is called with enabled
> interrupts.
> spin_lock(): can be used in hardware interrupt handlers when only one
> hardware interrupt uses that spinlocks (most hardware drivers), or when
> all hardware interrupt handler set the SA_INTERRUPT flag (e.g. rtc and
> timer interrupt)

This is not a bug and only helps to make drivers nonportable. Please,
don't do this.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [ANNOUNCE] Kernel Janitor's TODO list

2001-01-29 Thread Roman Zippel


Hi,

On Mon, 29 Jan 2001, Andi Kleen wrote:

> You can miss wakeups. The standard pattern is:
> 
>   get locks
> 
>   add_wait_queue(&waitqueue, &wait);
>   for (;;) { 
>   if (condition you're waiting for is true) 
>   break; 
>   unlock any non sleeping locks you need for condition
>   __set_task_state(current, TASK_UNINTERRUPTIBLE); 
>   schedule(); 
>   __set_task_state(current, TASK_RUNNING); 
>   reaquire locks
>   }
>   remove_wait_queue(&waitqueue, &wait); 

You still miss wakeups. :)
Always set the task state first, then check the condition. See the
wait_event*() macros you mentioned for the right order.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-05 Thread Roman Zippel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> This all proves that the lowest level of layering should be pretty much
> noting but the vectors. No callbacks, no crap like that. That's already a
> level of abstraction away, and should not get tacked on. Your lowest level
> of abstraction should be just the "area". Something like
> 
>   struct buffer {
>   struct page *page;
>   u16 offset, length;
>   };
> 
>   int nr_buffers:
>   struct buffer *array;
> 
> should be the low-level abstraction. 

Does it has to be vectors? What about lists? I'm thinking about this for
some time now and I think lists are more flexible. At higher level we can
easily generate a list of pages and in a lower level you can still split
them up as needed. It would be basically the same structure, but you
could use it everywhere with the same kind of operations.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Roman Zippel

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> > Does it has to be vectors? What about lists?
> 
> I'd prefer to avoid lists unless there is some overriding concern, like a
> real implementation issue. But I don't care much one way or the other -
> what I care about is that the setup and usage time is as low as possible.
> I suspect arrays are better for that.

I was more thinking about the higher layers. Here it's simpler to setup a
list of pages which can be send to a lower layer. In the page cache we
already have per address space lists, so it would be very easy to use
that. A lower layer can generate of course anything it wants out of this,
e.g. it can generate sublists or vectors.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options

2005-04-06 Thread Roman Zippel

Hi,

Dave Hansen wrote:

> diff -puN mm/Kconfig~A6-mm-Kconfig mm/Kconfig
> --- memhotplug/mm/Kconfig~A6-mm-Kconfig   2005-04-04 09:04:48.0 
> -0700
> +++ memhotplug-dave/mm/Kconfig2005-04-04 10:15:23.0 -0700
> @@ -0,0 +1,25 @@
> +choice
> + prompt "Memory model"
> + default FLATMEM
> + default SPARSEMEM if ARCH_SPARSEMEM_DEFAULT
> + default DISCONTIGMEM if ARCH_DISCONTIGMEM_DEFAULT

Does this really have to be a user visible option and can't it be
derived from other values? The help text entries are really no help at all.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options

2005-04-06 Thread Roman Zippel

Hi,

On Wed, 6 Apr 2005, Dave Hansen wrote:

> On Wed, 2005-04-06 at 22:58 +0200, Roman Zippel wrote:
> > Dave Hansen wrote:
> > > --- memhotplug/mm/Kconfig~A6-mm-Kconfig   2005-04-04 09:04:48.0 
> > > -0700
> > > +++ memhotplug-dave/mm/Kconfig2005-04-04 10:15:23.0 -0700
> > > @@ -0,0 +1,25 @@
> > > +choice
> > > + prompt "Memory model"
> > > + default FLATMEM
> > > + default SPARSEMEM if ARCH_SPARSEMEM_DEFAULT
> > > + default DISCONTIGMEM if ARCH_DISCONTIGMEM_DEFAULT
> > 
> > Does this really have to be a user visible option and can't it be
> > derived from other values? The help text entries are really no help at all.
> 
> I hope that this selection will replace the current DISCONTIGMEM prompts
> in the individual architectures.  That way, you won't get a net increase
> in the number of prompts.

Why is this choice needed at all? Why would one choose SPARSEMEM over 
DISCONTIGMEM? Help texts such as "If unsure, choose " make 
the complete config option pretty useless.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options

2005-04-06 Thread Roman Zippel

Hi,

On Wed, 6 Apr 2005, Dave Hansen wrote:

> > Why is this choice needed at all? Why would one choose SPARSEMEM over 
> > DISCONTIGMEM?
> 
> For now, it's only so people can test either one, and we don't have to
> try to toss DICONTIGMEM out of the kernel in fell swoop.  When the
> memory hotplug options are enabled, the DISCONTIG option goes away, and
> SPARSEMEM is selected as the only option.
> 
> I hope to, in the future, make the options more like this:
> 
> config MEMORY_HOTPLUG...
> config NUMA...
> 
> config DISCONTIGMEM
>   depends on NUMA && !MEMORY_HOTPLUG
> 
> config SPARSEMEM
>   depends on MEMORY_HOTPLUG || OTHER_ARCH_THING
> 
> config FLATMEM
>   depends on !DISCONTIGMEM && !SPARSEMEM

I was hoping for this too, in the meantime can't you simply make it a 
suboption of DISCONTIGMEM? So an extra option is only visible when it's 
enabled and most people can ignore it completely by just disabling a 
single option.

> > Help texts such as "If unsure, choose " make 
> > the complete config option pretty useless.
> 
> They don't make it useless, they just guide a clueless user to the right
> place, without them having to think about it at all.  Those of us that
> need to test the various configurations are quite sure of what we're
> doing, and can ignore the messages. :)
> 
> I'm not opposed to creating some better help text for those things, I'm
> just not sure that we really need it, or that it will help end users get
> to the right place.  I guess more explanation never hurt anyone.

Some basic explanation with a link for more information can't hurt.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Roman Zippel

Hi,

On Thu, 7 Apr 2005, Linus Torvalds wrote:

> I really disliked that in BitKeeper too originally. I argued with Larry
> about it, but Larry (correctly, I believe) argued that efficient and
> reliable distribution really requires the concept of "history is
> immutable". It makes replication much easier when you know that the known
> subset _never_ shrinks or changes - you only add on top of it.

The problem is you pay a price for this. There must be a reason developers 
were adding another GB of memory just to run BK.
Preserving the complete merge history does indeed make repeated merges 
simpler, but it builds up complex meta data, which has to be managed 
forever. I doubt that this is really an advantage in the long term. I 
expect that we were better off serializing changesets in the main 
repository. For example bk does something like this:

A1 -> A2 -> A3 -> BM
  \-> B1 -> B2 --^

and instead of creating the merge changeset, one could merge them like 
this:

A1 -> A2 -> A3 -> B1 -> B2

This results in a simpler repository, which is more scalable and which 
is easier for users to work with (e.g. binary bug search).
The disadvantage would be it will cause more minor conflicts, when changes 
are pulled back into the original tree, but which should be easily 
resolvable most of the time.
I'm not saying with this that the bk model is bad, but I think it's a 
problem if it's the only model applied to everything.

> The thing is, cherry-picking very much implies that the people "up" the 
> foodchain end up editing the work of the people "below" them. The whole 
> reason you want cherry-picking is that you want to fix up somebody elses 
> mistakes, ie something you disagree with.
> 
> That sounds like an obviously good thing, right? Yes it does.
> 
> The problem is, it actually results in the wrong dynamics and psychology 
> in the system. First off, it makes the implicit assumption that there is 
> an "up" and "down" in the food-chain, and I think that's wrong.

These dynamics do exists and our tools should be able to represent them.
For example when people post patches, they get reviewed and often need 
more changes and bk doesn't really help them to redo the patches.
Bk helped you to offload the cherry-picking process to other people, so 
that you only had to do cherry-collecting very efficiently.
Another prime example of cherry-picking is Andrews mm tree, he picks a 
number of patches which are ready for merging and forwards them to you.
Our current basic development model (at least until a few days ago) looks 
something like this:

linux-mm -> linux-bk -> linux-stable

Ideally most changes would get into the tree via linux-mm and depending 
on depending various conditions (e.g. urgency, review state) it would get 
into the stable tree. In practice linux-mm is more an aggregation of 
patches which need testing and since most bk users were developing 
against linux-bk, it got a lot less testing and a lot of problems are 
only caught at the next stage. Changes from the stable tree would even 
flow in the opposite direction.
Bk supports certain aspects of the kernel development process very well, 
but due its closed nature it was practically impossible to really 
integrate it fully into this process (at least for anyone outside BM). 
In the short term we probably are in for a tough ride and we take whatever 
works best for you, but in the long term we need to think about how SCM 
fits into our kernel development model, which includes development, 
review, testing and releasing of kernel changes. This is more than just 
pulling and merging kernel trees. I'm aiming at a tool that can also 
support Andrews work, so that he can also better offload some of this 
work (and take a break sometimes :) ). Unfortunately every existing tool I 
know of is lacking in its own way, so we still have some way to go...

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-08 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Tupshin Harper wrote:

> > A1 -> A2 -> A3 -> B1 -> B2
> > 
> > This results in a simpler repository, which is more scalable and which is
> > easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
> > 
> Both darcs and arch (and arch's siblings) have ways of maintaining the
> complete history but speeding up operations.

Please show me how you would do a binary search with arch.

I don't really like the arch model, it's far too restrictive and it's 
jumping through hoops to get to an acceptable speed.
What I expect from a SCM is that it maintains both a version index of the 
directory structure and a version index of the individual files. Arch 
makes it especially painful to extract this data quickly. For the common 
cases it throws disk space at the problem and does a lot of caching, but 
there are still enough problems (e.g. annotate), which require scanning of 
lots of tarballs.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Also, I suspect that BKCVS actually bothers to get more details out of a
> BK tree than I cared about. People have pestered Larry about it, so BKCVS
> exports a lot of the nitty-gritty (per-file comments etc) that just
> doesn't actually _matter_, but people whine about. Me, I don't care. My
> sparse-conversion just took the important parts.

As soon as you want to synchronize and merge two trees, you will know why 
this information does matter.
(/me looks closer at the sparse-conversion...)
It seems you exported the complete parent information and this is exactly 
the "nitty-gritty" I was "whining" about and which is not available via 
bkcvs or bkweb and it's the most crucial information to make the bk data 
useful outside of bk. Larry was previously very clear about this that he 
considers this proprietary bk meta data and anyone attempting to export 
this information is in violation with the free bk licence, so you indeed 
just took the important parts and this is/was explicitly verboten for 
normal bk users.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Yes.  Per-file history is expensive in git, because if the way it is 
> indexed. Things are indexed by tree and by changeset, and there are no 
> per-file indexes.
> 
> You could create per-file _caches_ (*) on top of git if you wanted to make
> it behave more like a real SCM, but yes, it's all definitely optimized for
> the things that _I_ tend to care about, which is the whole-repository
> operations.

Per file history is also expensive for another reason. The basic reason is 
that I think that a hash based storage is not the best approach for SCM. 
It's lacking locality, so the more it grows the more it has to seek to 
collect all the data.
To reduce the space usage you could replace the parent file with a sha1 
reference + delta to the new file. This is basically what monotone does 
and might cause perfomance problems if you need to restore old versions 
(e.g. if you want to annotate a file).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel SCM saga..

2005-04-09 Thread Roman Zippel

Hi,

On Sat, 9 Apr 2005, Eric D. Mudama wrote:

> > For example bk does something like this:
> > 
> > A1 -> A2 -> A3 -> BM
> >   \-> B1 -> B2 --^
> > 
> > and instead of creating the merge changeset, one could merge them like
> > this:
> > 
> > A1 -> A2 -> A3 -> B1 -> B2
> > 
> > This results in a simpler repository, which is more scalable and which
> > is easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
> 
> The kicker comes that B1 was developed based on A1, so any test
> results were based on B1 being a single changeset delta away from A1. 
> If the resulting 'BM' fails testing, and you've converted into the
> linear model above where B2 has failed, you lose the ability to
> isolate B1's changes and where they came from, to revalidate the
> developer's results.

What good does it do if you can revalidate the original B1? The important 
point is that the end result works and if it only fails in the merged 
version you have a big problem. The serialized version gives you the 
chance to test whether it fails in B1 or B2.

> I believe that flattening the change graph makes history reproduction
> impossible, or alternately, you are imposing on each developer to test
> the merge results at B1 + A1..3 before submission, but in doing so,
> the test time may require additional test periods etc and with
> sufficient velocity, might never close.

The merge result has to be tested either way, so I'm not exactly sure, 
what you're trying to say.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Code snippet to reconstruct ancestry graph from bk repo

2005-04-10 Thread Roman Zippel

Hi,

On Sun, 10 Apr 2005, Paul P Komkoff Jr wrote:

> (borrowed from Tommi Virtanen)
> 
> Code snippet to reconstruct ancestry graph from bk repo:
> bk changes -end':I: $if(:PARENT:){:PARENT:$if(:MPARENT:){ :MPARENT:}} 
> $unless(:PARENT:){-}' |tac
> 
> format is:
> newrev parent1 [parent2]
> parent2 present if merge occurs.

I know that this is possible and Larry's response would have been 
something like this:
http://www.ussg.iu.edu/hypermail/linux/kernel/0502.1/0248.html

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-11 Thread Roman Zippel

Hi,

On Mon, 11 Jul 2005, Horst von Brand wrote:

> > I don't generally disagree with that, I just think that defines are not 
> > part of that list.
> 
> Covered in "bad coding style" and "hard to read code", at least.

Somehow I missed the last lkml debate about where simple defines where a 
problem.

> > Look, it's great that you do reviews, but please keep in mind it's the 
> > author who has to work with code and he has to be primarily happy with, 
> > so you don't have to point out every minor issue.
> 
> Wrong. The author has to work with the code, but there are much more people
> that have to read it now and fix it in the future. It doesn't make sense
> having everybody using their own indentation style, variable naming scheme,
> and ways of defining constants.

I didn't say this, I said "minor issues". Please read more carefully.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dependency bug in gconfig?

2005-07-13 Thread Roman Zippel

Hi,

On Tue, 12 Jul 2005, randy_dunlap wrote:

> This appears to be a dependency bug in gconfig to me.
> 
> If I enable NETCONSOLE to y, NETPOLL becomes y.  (OK)
> If I then disable NETCONSOLE to n, NETPOLL remains y.
> 
> If I enable NETCONSOLE to m, NETPOLL remains n.  Why is that?
> 
> config NETPOLL
>   def_bool NETCONSOLE
> 
> Should this cause NETCONSOLE to track NETPOLL?

It should (although it doesn't show it immediately).
Did you compare the saved config files?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 18/19] Kconfig I18N: LKC: whitespace removing

2005-07-13 Thread Roman Zippel

Hi,

On Wed, 13 Jul 2005, Egry G�bor wrote:

> diff -puN scripts/kconfig/zconf.l~kconfig-i18n-18-whitespace-fix 
> scripts/kconfig/zconf.l
> --- 
> linux-2.6.13-rc3-i18n-kconfig/scripts/kconfig/zconf.l~kconfig-i18n-18-whitespace-fix
>   2005-07-13 18:32:20.0 +0200
> +++ linux-2.6.13-rc3-i18n-kconfig-gabaman/scripts/kconfig/zconf.l 
> 2005-07-13 18:32:20.0 +0200
> @@ -57,6 +57,17 @@ void append_string(const char *str, int 
>   *text_ptr = 0;
>  }
>  
> +void append_helpstring(const char *str, int size)
> +{
> + while (size) {
> + if ((str[size-1] != ' ') && (str[size-1] != '\t'))
> + break;
> + size--;
> + }
> +
> + append_string (str, size);
> +}
> +
>  void alloc_string(const char *str, int size)
>  {
>   text = malloc(size + 1);
> @@ -225,7 +236,7 @@ n [A-Za-z0-9_]
>   append_string("\n", 1);
>   }
>   [^ \t\n].* {
> - append_string(yytext, yyleng);
> + append_helpstring(yytext, yyleng);
>   if (!first_ts)
>   first_ts = last_ts;
>   }

Simply integrate the function into the caller.

bye, Roman

Re: [PATCH 4/19] Kconfig I18N: lxdialog: multibyte character support

2005-07-13 Thread Roman Zippel

Hi,

On Wed, 13 Jul 2005, Egry G�bor wrote:

> UTF-8 support for lxdialog with wchar. The installed wide ncurses 
> (ncursesw) is optional because some languages (ex. English, Italian) 
> and ISO 8859-xx charsets don't require this patch.

This is ugly, this just adds lots of #ifdefs with practically duplicated 
code. Please use some wrapper functions/macros.

bye, Roman

Re: [PATCH 0/19] Kconfig I18N completion

2005-07-13 Thread Roman Zippel

Hi,

On Wed, 13 Jul 2005, Egry G�bor wrote:

> The following patches complete the "Kconfig I18N support" patch by
> Arnaldo. 

First I'd really like to see some documentation on this, which describes 
the interface how tools/distributions can provide Kconfig I18N support.

> - answering (Y/M/N)

This one is just silly. Provide a nice helptext, which describes what that
means, for xconfig I'm also accepting nice descriptive icons.

bye, Roman

Re: Merging relayfs?

2005-07-14 Thread Roman Zippel

Hi,

On Mon, 11 Jul 2005, Andrew Morton wrote:

> > > Hi Andrew, can you please merge relayfs?  It provides a low-overhead
> > > logging and buffering capability, which does not currently exist in
> > > the kernel.
> > 
> > While the code is pretty nicely in shape it seems rather pointless to
> > merge until an actual user goes with it.
> 
> Ordinarily I'd agree.  But this is a bit like kprobes - it's a funny thing
> which other kernel features rely upon, but those features are often ad-hoc
> and aren't intended for merging.

I agree with Christoph, I'd like to see a small (and useful) example 
included, which can be used as reference. relayfs client still need some 
code of their own to communicate with user space. If I look at the example 
code I'm not really sure netlink is a good way to go as control channel.
kprobes has a rather simple interface, relayfs is more complex and I think 
it's a good idea to provide some sane and complete example code to copy 
from.

Looking through the patch there are still a few areas I'm concerned about:
- the usage of atomic_t look a little silly, there is only a single 
writer and probably needs some cache line optimisations
- I would prefer "unsigned int" over just "unsigned"
- the padding/commit arrays can be easily managed by the client
- overwrite mode can be implemented via the buffer switch callback

In general I'm not against merging, but I have a few ideas for further 
cleanups/optimisations and it really would help to have some useful 
example code (e.g. a _simple_ event tracer).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/4] new human-time soft-timer subsystem

2005-07-14 Thread Roman Zippel

Hi,

On Thu, 14 Jul 2005, Nishanth Aravamudan wrote:

> We no longer use jiffies (the variable) as the basis for determining
> what "time" a timer should expire or when it should be added. Instead,
> we use a new function, do_monotonic_clock(), which is simply a wrapper
> for getnstimeofday().

And suddenly a simple 32bit integer becomes a complex 64bit integer, which 
requires hardware access to read a timer and additional conversion into ns.
Why is suddenly everyone so obsessed with molesting something simple and 
cute as jiffies?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [0/5+1] menu -> menuconfig part 1

2005-07-17 Thread Roman Zippel

Hi,

On Sun, 17 Jul 2005, Bodo Eggert wrote:

> These patches change some menus into menuconfig options.
> 
> Reworked to apply to linux-2.6.13-rc3-git3

I like it, but I would prefer to give it first a bit more exposure in -mm, 
as it does change the menu structure and the behaviour is little 
different, so I'd like to see if there's a some feedback first from people 
using it.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Merging relayfs?

2005-07-17 Thread Roman Zippel

Hi,

On Thu, 14 Jul 2005, Tom Zanussi wrote:

> The netlink control channel seems to work very well, but I can
> certainly change the examples to use something different.  Could you
> suggest something?

It just looks like a complicated way to do an ioctl, a control file that 
you can read/write would be a lot simpler and faster.

>  > Looking through the patch there are still a few areas I'm concerned about:
>  > - the usage of atomic_t look a little silly, there is only a single 
>  > writer and probably needs some cache line optimisations
> 
> The only things that are atomic are the counts of produced and
> consumed buffers and these are only ever updated or read in the slow
> buffer-switch path.  They're atomic because if they weren't, wouldn't
> it be possible for the client to read an unfinished value if the
> producer was in the middle of updating it?

No.

>  > - I would prefer "unsigned int" over just "unsigned"
>  > - the padding/commit arrays can be easily managed by the client
> 
> Yes, I can move them out and update the examples to reflect that, but
> I thought that if this was something that most clients would need to
> do, it made some sense to keep it in relayfs and avoid duplication in
> the clients.

If a lot of clients needs this, there a different ways to do this, e.g. by 
introducing some helper functions that clients can use. This way you can 
keep the core simple and allow the client to modify its behaviour.

>  > - overwrite mode can be implemented via the buffer switch callback
> 
> The buffer switch callback is already where this is handled, unless
> you're thinking of something else - one of the first checks in the
> buffer switch is relay_buf_full(), which always returns 0 if the
> buffer is in overwrite mode.

I mean, relayfs doesn't has to know about this, the client itself can do 
it (e.g. via helper functions).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/12] NTP cleanup work (v. B4)

2005-07-17 Thread Roman Zippel

Hi,

On Fri, 15 Jul 2005, john stultz wrote:

>   In my attempts to rework the timeofday subsystem, it was suggested I
> try to avoid mixing cleanups with functional changes. In response to the
> suggestion I've tried to break out the majority of the NTP cleanups I've
> been working out of my larger patch and try to feed it in piece meal. 
> 
> The goal of this patch set is to isolate the in kernel NTP state machine
> in the hope of simplifying the current timeofday code.

I don't really like, where you taken it with ntp_advance(). With these 
patches you put half the ntp state machine in there and execute it at 
every single tick.
>From the previous patches I can guess where you want to go with this, but 
I think it's the wrong direction. The code is currently as is for a 
reason, it's optimized for tick based system and I don't see a reason to 
change this for tick based system.
If you want to change this for cycle based system, you have to give more 
control to the arch/timer source, which simply call a different set of 
functions and the ntp core system basically just acts as a library to the 
timer source.
Tick based timer sources continue to update xtime and cycle based system 
will modify the cycle multiplier (e.g. what ppc64 does). Don't force 
everything to use the same set of functions, you'll make it only more 
complex. Larger ntp state updates don't have to be done more than once a 
second and leave the details of how the clock is updated to the clock 
source (just provide some library functions it can use).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Merging relayfs?

2005-07-18 Thread Roman Zippel

Hi,

On Mon, 18 Jul 2005, Steven Rostedt wrote:

> I'm actually very much against this. Looking at a point of view from the
> logdev device. Having a callback to know to continue at every buffer
> switch would just be slowing down something that is expected to be very
> fast.

What exactly would be slowed down?
It would just move around some code and even avoid the overwrite mode 
check.

> I don't see the problem with having an overwrite mode or not. Why
> can't relayfs know this?

The point is to design a simple and flexible relayfs layer, which means 
not every possible function has to be done in the relayfs layer, as long 
it's flexible enough to build additional functionality on top of it (for 
which it can again provide some library functions).

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Merging relayfs?

2005-07-18 Thread Roman Zippel

Hi,

On Mon, 18 Jul 2005, Steven Rostedt wrote:

> > What exactly would be slowed down?
> > It would just move around some code and even avoid the overwrite mode 
> > check.
> 
> Yes, you're adding a jump to another function via a function pointer,
> that would kill the cache line of execution, to avoid a simple check, or
> some other way of handling it.

RTFS. (deliver_default_callback)

> Since I don't want to know the internals
> of relayfs,

You have to anyway, currently relayfs client need some knowledge about how 
buffers are managed.

> the overwrite mode could be implemented in a more officient way.

I wouldn't call the buffer switch routine efficient, yet.

> > > I don't see the problem with having an overwrite mode or not. Why
> > > can't relayfs know this?
> > 
> > The point is to design a simple and flexible relayfs layer, which means 
> > not every possible function has to be done in the relayfs layer, as long 
> > it's flexible enough to build additional functionality on top of it (for 
> > which it can again provide some library functions).
> 
> The overwrite mode isn't that complex.  You don't want to make something
> so flexible that it becomes more complex.  Assembly is more flexible
> than C but I wouldn't want to code a lot with it.  A library function
> for me is out of the question, since what I build on top of relayfs is
> mostly in the kernel.  The overwrite mode would then have to be
> implemented through another kernel activity.  I might as well keep my
> own ring buffers and forget about using relayfs, and all my points in
> which I argue for it being merged is mute.

I must admit I have no clue, what you're talking about here...
The keywords above are "_simple_ _and_ _flexible_".

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Merging relayfs?

2005-07-18 Thread Roman Zippel

Hi,

On Mon, 18 Jul 2005, Karim Yaghmour wrote:

> I guess I just don't get the point here. Why cut something away if many
> users will need it. If it's that popular that you're ready to provide a
> library function to do it, then why not just leave it to boot? One of the
> goals of relayfs is to avoid code duplication with regards to buffering
> in general.

The road to bloatness is paved with lots of little features.
There aren't that many users anyway (none of the examples use that 
feature). I'd prefer to concentrate on a simple and correct relayfs layer 
and we can still think about other features as more users appear.
Starting a design by implementing every little feature which _might_ be 
needed is a really bad idea.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/12] NTP cleanup work (v. B4)

2005-07-21 Thread Roman Zippel

Hi,

On Wed, 20 Jul 2005, john stultz wrote:

> I really don't think the NTP changes I've mailed is very complex.
> Please, be specific and point to something you think is an issue and
> I'll do my best to fix it.

Maybe I should explain, in what direction I would take it.
Let's first only take tick based updates, one property I don't want to see 
go away (and which you remove in the last patch), is to basically update 
xtime at every tick by (tick_nsec+time_adj) (and maybe fold time_adjust 
into time_adj), no multiply/divide just adds/shifts. Every second (or 
maybe even less frequently) we update time_adj, where we even might 
integrate a better to way to add previous errors due to SHIFT_HZ.

To add support for continous time sources, the generic ntp code would just 
provide [tick,frequency,offset] values and the time source converts it 
into its internal values. A tick based source calculates [tick_nsec, 
time_adj] and a continous source calculates the [offset,multiplier]. These 
values should be recalculated as infrequently as possible and not every 
single tick as you do with ppc_adjtimex. This also means a continous 
source updates xtime basically by calling gettimeofday (what ppc64 already 
almost does) and doesn't use update_wall_time() at all.

Maybe I'm missing something, but I don't see a reason to forcibly merge 
both ways to update the clock, keep them seperate and let the generic ntp 
code provide the basic parameters which the time source uses to update the 
clock. The important thing is to precalculate as much as possible, so that 
the runtime overhead is as low as possible and these precalculations 
differ between time sources, so what your patches basically do is to 
remove all of these precalculations and I can't convince myself to see 
this as a good thing.

BTW do you have any user space test code for this? This might be useful to 
verify that the changes are really correct and a prototype might be a good 
way to demonstrate the kernel changes.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add schedule_timeout_{interruptible,uninterruptible}{,_msecs}() interfaces

2005-07-23 Thread Roman Zippel

Hi,

On Fri, 22 Jul 2005, Arjan van de Ven wrote:

> Also I'd rather not add the non-msec ones... either you're raw and use
> HZ, or you are "cooked" and use the msec variant.. I dont' see the point
> of adding an "in the middle" one. (Yes this means that several users
> need to be transformed to msecs but... I consider that progress ;)

What's wrong with using jiffies? It's simple and the current timeout 
system is based on it. Calling it something else doesn't suddenly give you 
more precision.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add schedule_timeout_{interruptible,uninterruptible}{,_msecs}() interfaces

2005-07-23 Thread Roman Zippel

Hi,

On Sat, 23 Jul 2005, Arjan van de Ven wrote:

> > What's wrong with using jiffies? 
> 
> A lot of the (driver) users want a wallclock based timeout. For that,
> miliseconds is a more obvious API with less chance to get the jiffies/HZ
> conversion wrong by the driver writer.

We have helper functions for that. The point about using jiffies is to 
make it _very_ clear, that the timeout is imprecise and for most users 
this is sufficient.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 538 matches

Mail list logo