Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Tue, 29 Jan 2008, john stultz wrote: > +/* Because using NSEC_PER_SEC would be too easy */ > +#define NTP_INTERVAL_LENGTH > s64)TICK_USEC*NSEC_PER_USEC*USER_HZ)+CLOCK_TICK_ADJUST)/NTP_INTERVAL_FREQ) Why are you using USER_HZ? Did you test this with HZ!=100? Anyway, please don't make more complicated than it already is. What I said previously about the update interval is still valid, so the correct solution is to use the simpler NTP_INTERVAL_LENGTH calculation from my last mail and to omit the correction for NO_HZ. bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Wed, 30 Jan 2008, john stultz wrote: > My concern is we state the accumulation interval is X ns long. Then > current_tick_length() is to return (X + ntp_adjustment), so each > accumulation interval we can keep track of the error and adjust our > interval length. > > So if ntp_update_frequency() sets tick_length_base to be: > > u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ) > << TICK_LENGTH_SHIFT; > second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT; > second_length += (s64)time_freq > << (TICK_LENGTH_SHIFT - SHIFT_NSEC); > > tick_length_base = second_length; > do_div(tick_length_base, NTP_INTERVAL_FREQ); > > > The above is basically (X + part of ntp_adjustment) CLOCK_TICK_ADJUST is based on LATCH and HZ, if the update frequency isn't based on HZ, there is no point in using it! Let's look at what actually needs to be done: 1. initializing clock interval: clock_cycle_interval = timer_cycle_interval * clock_frequency / timer_frequency It's simply about converting timer cycles into clock cycles, so they're about the same time interval. We already make it a bit more complicated than necessary as we go via nsec: ntp_interval = timer_cycle_interval * 10^9nsec / timer_frequency and in clocksource_calculate_interval() basically: clock_cycle_interval = ntp_interval * clock_frequency / 10^9nsec Without a fixed timer tick it's actually even easier, then we use the same frequency for clock and timer and the cycle interval is simply: clock_cycle_interval = timer_cycle_interval = clock_frequency / NTP_INTERVAL_FREQ There is no need to use the adjustment here, you'll only cause a mismatch between the clock and timer cycle interval, which had to be corrected by NTP. 2. initializing clock adjustment: clock_adjust = timer_cycle_interval * NTP_INTERVAL_FREQ / timer_frequency - 1sec This adjustment is used make up for the difference that the timer frequency isn't evenly divisible by HZ, so that the clock is advanced by 1sec after timer_frequency cycles. Like above the clock frequency is used for the timer frequency for this calculation for CONFIG_NO_HZ, so it would be incorrect to use CLOCK_TICK_RATE/LATCH/HZ here and since NTP_INTERVAL_FREQ is quite small the resulting adjustment would be rather small, it's easier not to bother in this case. What you're basically trying is to add an error to the clock initialization, so that we can later compensate for it. The correct solution is really to not add the error in first place, so that there is no need to compensate for it. bye. Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Mon, 11 Feb 2008, john stultz wrote: > > I don't want to just send a patch, I want you to understand why your > > approach is wrong. > > With all due respect, it also keeps the critique in one direction and > makes your review less collaborative and more confrontational then I > suspect (or maybe just hope) you intend. I don't think that's necessarily a contradiction, if we keep it to confronting the problem. A simple patch wouldn't have provided any further understanding of the problem compared to what I already said. You would have seen what the patch does (which I described already differently), but not really why it does that. In this sense I prefer to force the confrontation of the problem. I'm afraid a working patch would encourage to simply ignore the problem, as your problem at hand would be solved without completely understanding it. The point is that I'd really like you to understand the problem, so I'm not the only one who understands this code :) and in the end it might allow better collaboration to further improve this code. To make it very clear this is just about understanding the problem, I don't want to force a specific solution (which a patch would practically do). If we both understand the problem, we can also discuss the solution and maybe we find something better, but maybe I'm also totally wrong, which would be a little embarrassing :), but that would be fine too. There may be better ways to go about this problem, but IMO it would still be better than just ignoring the problem and force it with a patch. > This fine grained error accounting is where the bug I'm trying to > address is cropping up from. In order to have the comparison we need to > have two values: > A: The clocksource's notion of how long the fixed interval is. > B: NTP's notion of how long the fixed interval is. > > When no NTP adjustment is being made, these two values should be equal, > but currently they are not. This is what causes the 280ppm error seen on > my test system. > > Part A is calculated in the following fashion: > #define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ) > > Which is then functionally shifted up by TICK_LENGTH_SHIFT, but for the > course of this discussion, lets ignore that. > > Part B is calculated in ntp_update_frequency() as: > > u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ) > << TICK_LENGTH_SHIFT; > second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT; > second_length += (s64)time_freq << (TICK_LENGTH_SHIFT - SHIFT_NSEC); > > tick_length_base = second_length; > do_div(tick_length_base, NTP_INTERVAL_FREQ); > > > If we're assuming there is no NTP adjustment, and avoiding the > TICK_LENGTH_SHIFT, this can be shorted to: > B = ((TICK_USEC * NSEC_PER_USEC * USER_HZ) > + CLOCK_TICK_ADJUST)/NTP_ITNERVAL_FREQ > > > The A vs B comparison can be shortened further to: > NSEC_PER_SEC != (TICK_USEC * NSEC_PER_USEC * USER_HZ) > + CLOCK_TICK_ADJUST > > So now on to what seems to be the point of contention: > If A != B, which side do we fix? > > > My patches fix the A side so that it matches B, which on its face isn't > terribly complicated, but you seem to be suggesting we fix the B side > instead (Now I'm assuming here, because there's no patch. So I can only > speak to your emails, which were not clear to me). If we go from your base assumption above "there is no NTP adjustment", I would actually agree with you and it wouldn't matter much on which side to correct the equation. The question is now what happens, if there are NTP adjustments, i.e. when the time_freq value isn't zero. Based on this initialization we tell the NTP daemon the base frequency, although not directly but it knows the length freq1 == 1 sec. If the clock now needs adjustment, the NTP daemon tells the kernel via time_freq how to change the speed so that freq1 == 1 sec + time_freq. The problem is now that by using CLOCK_TICK_ADJUST we are cheating and we don't tell the NTP daemon the real frequency. We define 1 sec as freq2 + tick_adjust and with the NTP adjustment we have freq2 + tick_adj == 1 sec + time_freq. Above initialization now calcalutes the base time length for an update interval of freq2 / NTP_INTERVAL_FREQ, this means the requested time_freq adjustment isn't applied to (freq2 + tick_adj) cycles but to freq2 cycles, so this finally means any adjustment made by the NTP daemon is slightly off. To demonstrate this let's look at some real values and let's use the PIT for it (as this is where this originated and on which CLOCK_TICK_ADJUST is based on). With freq1=1193182 and HZ=1000 we program the timer with 1193 cycles and the actual update frequency is freq2=1193000. To adjust for this difference we change the length of a timer tick: (NSEC_PER_SEC + CLOCK_TICK_ADJUST) / NTP_INTERVAL_FREQ
Re: Question on timekeeping subsystem
Hi, On Wednesday 13. February 2008, Francis Moreau wrote: > First I tried to find some documentation on the current implementation > but haven't found any thing really usefull. Specially there's nothing about > it in Documentation/ directory. Please correct me if I'm already wrong. > > Actually I read the implementation of update_wall_time() and I really fail > to understand how it works. This is probably because I don't know > what "xtime_nsec" and "error" fields in clocksource struct are for. > These fields are not documented anywhere in the source code so it > should be obvious but unfortunately not for me. These mails should help to understand, what this code does: http://lkml.org/lkml/2006/3/4/61 http://lkml.org/lkml/2006/4/3/205 bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: distributed module configuration [Was: Announce: Linux-next (Or Andrew's dream :-))]
Hi, On Wednesday 13. February 2008, Sam Ravnborg wrote: > config foo > tristate "do you want foo?" > depends on USB && BAR > module > obj-$(CONFIG_FOO) += foo.o > foo-y := file1.o file2.o > help > foo will allow you to explode your PC I'm more thinking about something like this: module foo [FOO] tristate "do you want foo?" depends on USB && BAR source file1.c source file2.c if BAZ Avoiding direct Makefile fragments would give us far more flexibility in the final Makefile output. > And we could introduce support for > > source "drivers/net/Kconfig.*" > > But then we would have to make the kconfig step mandatory > for each build as we would otherwise not know if there > were added any Kconfig files. That's a real problem and it would be a step back of what we have right now, so I'm not exactly comfortable with it. bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Wed, 13 Feb 2008, john stultz wrote: > Oh! So your issue is that since time_freq is stored in ppm, or in effect > a usecs per sec offset, when we add it to something other then 1 second > we mis-apply what NTP tells us to. Is that right? Pretty much everything is centered around that 1sec, so the closer the update frequency is to it the better. > Right, so if NTP has us apply a 10ppm adjustment, instead of doing: > NSEC_PER_SEC + 10,000 > > We're doing: > NSEC_PER_SEC + CLOCK_TICK_ADJUST + 10,000 > > Which, if I'm doing my math right, results in a 10.002ppm adjustment > (using the 999847467ns number above), instead of just a 10ppm > adjustment. > > Now, true, this is an error, but it is a pretty small one. Even at the > maximum 500ppm value, it only results in an error of 76 parts per > billion. As you'll see below, that tends to be less error then what we > get from the clock granularity. Is there something else I'm missing here > or is this really the core issue you're concerned with? The error accumulates and there is no good reason to do this for the common case. > > In consequence this means, if we want to improve timekeeping, we first set > > the (update_cycles * NTP_INTERVAL_FREQ) interval as close as possible to > > the real frequency. It doesn't has to be perfect as we usually don't know > > the real frequency with 100% certainty anyway. > > This might need some more explanation, as I'm not certain I know what > update_cycles refers to. Do you mean cycle_interval? I guess I'm not > completely sure how you're suggesting we change things here. clock->cycle_interval > > Second, we drop the tick > > adjustment if possible and leave the adjustments to the NTP daemon and as > > long as the drift is within the 500ppm limit it has no problem to manage > > this. > > Dropping the tick adjustment? By that do you mean the tick_usec value > set by adjtimex()? I don't quite see why we want that. Could you expand > here? CLOCK_TICK_ADJUST > HZ=1000 CLOCK_TICK_ADJUST=-152533 > jiffies 467 ppb error > jiffies NOHZ 467 ppb error > pit 0 ppb error > pit NOHZ 0 ppb error > acpi_pm -280 ppb error > acpi_pm NOHZ 279 ppb error > > HZ=1000 CLOCK_TICK_ADJUST=0 > jiffies 153000 ppb error > jiffies NOHZ 153000 ppb error > pit 152533 ppb error > pit NOHZ 0 ppb error > acpi_pm -127112 ppb error > acpi_pm NOHZ 279 ppb error > > So you are right, w/ pit & NO_HZ, the granularity error is always very > small both with or without CLOCK_TICK_ADJUST. If you change the frequency of acpi_pm to 3579000 you'll get this: HZ=1000 CLOCK_TICK_ADJUST=0 jiffies 153000 ppb error jiffies NOHZ153000 ppb error pit 152533 ppb error pit NOHZ0 ppb error acpi_pm 0 ppb error acpi_pm NOHZ0 ppb error HZ=1000 CLOCK_TICK_ADJUST=-152533 jiffies 0 ppb error jiffies NOHZ466 ppb error pit -467 ppb error pit NOHZ-1 ppb error acpi_pm 126407 ppb error acpi_pm NOHZ22 ppb error CLOCK_TICK_ADJUST has only any meaning for PIT (and indirectly for jiffies). For every other clock you just add some random value, where it doesn't do _any_ good. The worst case error there will always be (ntp_hz/freq/2*10^9nsec), all you do with CLOCK_TICK_ADJUST is to do shift it around, but it doesn't actually fix the error - it's still there. > However, without CLOCK_TICK_ADJUST, the jiffies error increases for all > values of HZ except 100 (which at first seems odd, but seems to be due > to loss from rounding in the ACTHZ calculation). jiffies depends on the timer resolution, so it will practically produce the same results as PIT (assuming it's used to generate the timer tick). > One interesting surprise in the data: With CLOCK_TICK_ADJUST=0, the > acpi_pm's error frequency shot up in the !NO_HZ cases. This ends up > being due to the acpi_pm being a very close to a multiple (3x) of the > pit frequency, so CLOCK_TICK_ADJUST helps it as well. What exactly does it help with? All you are doing is number cosmetics, it has _no_ practically value and only decreases the quality of timekeeping. > Further it seems to point that if we are going to be chasing down small > sub-100ppb errors (which I think would be great to do, but lets not make > users to endure 200+ppm errors while we debate the fine-tuning :) we > might want to consider a method where we let ntp_update_freq take into > account the current clocksource's interval length, so it becomes the > base value against which we apply adjustments (scaled appropriately). The error at least is real, the use value of CLOCK_TICK_ADJUST for the common case is not existent. > There are 3 sources of error that we've discussed here: > 1) The large (280ppm) error seen with no-NTP adjustment, caused by the > inconsistent (A!=B) interval comparisons which started this discussion, > which my patch does address. Part of the error is caused by CLOCK
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Mon, 18 Feb 2008, john stultz wrote: > If we are building a train station, and each train car is 60ft, it > doesn't make much sense to build 1000ft stations, right? Instead you'll > do better if you build a 1020ft station. That would assume trains are always 60ft long, which is the basic error in your assumption. Since we're using analogies: What you're doing is to put you winter clothes on your weight scale and reset the scale to zero to offset for the weigth of the clothes. If you stand now with your bathing clothes on the scale, does that mean you lost weight? That's all you do - you change the scale and slightly screw the scale for everyone else trying to use it. To keep in mind what time adjusting is supposed to do: freq = 1sec + time_freq What we do instead is: freq + tick_adj = 1sec + time_freq Where exactly is now the problem to integrate tick_adj into time_freq? The result is _exactly_ the same. The only visible difference is a slightly higher time_freq value and as long as it is within the 500 ppm limit there is simply no problem. > And yes, if we remove CLOCK_TICK_ADJUST, that would also resolve the > (A!=B) issue, but it doesn't address the error from #2 below. > [..] > 2) We need a solution that handles granularity error well, as this is a > moderate source of error for course clocksources such as jiffies. > CLOCK_TICK_ADJUST does cover this fairly well in most cases. I suspect > we could do even better, but that will take some deeper changes. How exactly does CLOCK_TICK_ADJUST address this problem? The error due to insufficient resolution is still there, all it does is shifting the scale, so it's not immediately visible. > My understanding of your approach (removing CLOCK_TICK_ADJUST), > addresses issues #1 and #3, but hurts issue #2. What exactly is hurt? bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
ly based on the PIT anymore. The whole reason for the original patch is pretty much gone by now. If you really need some kind of adjustment for your extremely broken hardware, below is the absolute maximum you need, which doesn't inflict more insanity on all the sane hardware. bye, Roman Revert bbe4d18ac2e058c56adb0cd71f49d9ed3216a405 and e13a2e61dd5152f5499d2003470acf9c838eab84 and remove CLOCK_TICK_ADJUST completely. Add a optional kernel parameter ntp_tick_adj instead to allow adjusting of a large base drift and thus keeping ntpd happy. The CLOCK_TICK_ADJUST mechanism was introduced at a time PIT was the primary clock, but we have a varity of clock sources now, so a global PIT specific adjustment makes little sense anymore. Signed-off-by: Roman Zippel <[EMAIL PROTECTED]> --- include/linux/timex.h |9 + kernel/time/ntp.c | 11 ++- kernel/time/timekeeping.c |6 ++ 3 files changed, 13 insertions(+), 13 deletions(-) Index: linux-2.6/include/linux/timex.h === --- linux-2.6.orig/include/linux/timex.h +++ linux-2.6/include/linux/timex.h @@ -232,14 +232,7 @@ static inline int ntp_synced(void) #else #define NTP_INTERVAL_FREQ (HZ) #endif - -#define CLOCK_TICK_OVERFLOW(LATCH * HZ - CLOCK_TICK_RATE) -#define CLOCK_TICK_ADJUST (((s64)CLOCK_TICK_OVERFLOW * NSEC_PER_SEC) / \ - (s64)CLOCK_TICK_RATE) - -/* Because using NSEC_PER_SEC would be too easy */ -#define NTP_INTERVAL_LENGTH s64)TICK_USEC * NSEC_PER_USEC * USER_HZ) + \ - CLOCK_TICK_ADJUST) / NTP_INTERVAL_FREQ) +#define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ) /* Returns how long ticks are at present, in ns / 2^(SHIFT_SCALE-10). */ extern u64 current_tick_length(void); Index: linux-2.6/kernel/time/ntp.c === --- linux-2.6.orig/kernel/time/ntp.c +++ linux-2.6/kernel/time/ntp.c @@ -42,12 +42,13 @@ long time_esterror = NTP_PHASE_LIMIT; /* long time_freq;/* frequency offset (scaled ppm)*/ static long time_reftime; /* time at last adjustment (s) */ long time_adjust; +long ntp_tick_adj; static void ntp_update_frequency(void) { u64 second_length = (u64)(tick_usec * NSEC_PER_USEC * USER_HZ) << TICK_LENGTH_SHIFT; - second_length += (s64)CLOCK_TICK_ADJUST << TICK_LENGTH_SHIFT; + second_length += (s64)ntp_tick_adj << TICK_LENGTH_SHIFT; second_length += (s64)time_freq << (TICK_LENGTH_SHIFT - SHIFT_NSEC); tick_length_base = second_length; @@ -400,3 +401,11 @@ leave: if ((time_status & (STA_UNSYNC|ST notify_cmos_timer(); return(result); } + +static int __init ntp_tick_adj_setup(char *str) +{ + ntp_tick_adj = simple_strtol(str, NULL, 0); + return 1; +} + +__setup("ntp_tick_adj=", ntp_tick_adj_setup); Index: linux-2.6/kernel/time/timekeeping.c === --- linux-2.6.orig/kernel/time/timekeeping.c +++ linux-2.6/kernel/time/timekeeping.c @@ -187,8 +187,7 @@ static void change_clocksource(void) clock->error = 0; clock->xtime_nsec = 0; - clocksource_calculate_interval(clock, - (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT)); + clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH); tick_clock_notify(); @@ -245,8 +244,7 @@ void __init timekeeping_init(void) ntp_clear(); clock = clocksource_get_next(); - clocksource_calculate_interval(clock, - (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT)); + clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH); clock->cycle_last = clocksource_read(clock); xtime.tv_sec = sec; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Fri, 1 Feb 2008, John Stultz wrote: > > CLOCK_TICK_ADJUST is based on LATCH and HZ, if the update frequency isn't > > based on HZ, there is no point in using it! > > Hey Roman, > > Again, I'm sorry I don't seem to be following your objections. If you > want to suggest a different patch to fix the issue, it might help. I already gave you the necessary details for how to set NTP_INTERVAL_LENGTH and in the previous mail I explained the basis for it. I really don't understand what's your problem with it. Why do you try to make it more complex than necessary? > The big issue for me, is that we have to initialize the clocksource > cycle interval so it matches the base tick_length that NTP uses. > > To be clear, the issue I'm trying to address is only this: > Assuming there is no NTP adjustment yet to be made, if we initialize the > clocksource interval to X, then compare it with Y when we accumulate, we > introduce error if X and Y are not the same. > > It really doesn't matter how long the length is, if we're including > CLOCK_TICK_ADJUST, or if it really matches the actual HZ tick length or > not. The issue is that we have to be consistent. If we're not, then we > introduce error that ntpd has to additionally correct for. You don't create consistency by adding corrections all over the place until it adds up to the right sum. The current correction is already somewhat of a hack and I'd rather get rid of it than to let it spread all over the place (it's really only needed so that people with weird HZ settings don't hit the 500ppm limit and we're basically cheating to the ntpd by not telling it the real frequency). Please keep the knowledge about this crutch at a single place and don't spread it. Anyway, for NO_HZ this correction is completely irrelevant, so again there's no point in adding random values all over the place until you get the correct result. The only other alternative would be to calculate this correction dynamically. For this you leave NTP_INTERVAL_LENGTH as is and when changing clocks you check whether "abs(((cs->xtime_interval * NTP_INTERVAL_FREQ) >> cs->shift) - NSEC_PER_SEC)" exceeds a certain limit (e.g. 200usec) and in this case you print a warning message, that the clock has large base drift value and is a bad ntp source and apply a correction value. This way the correction only hits the very few system which might need it and it would be the prefered solution, but it also requires a few more changes. bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 01/26] mount options: add documentation
Hi, On Wed, 30 Jan 2008, Miklos Szeredi wrote: > > How does this deal with certain special cases: > > - chroot: how will mount/df only show the for chroot relevant mounts? > > That is a very good question. Andreas Gruenbacher had some patches > for fixing behavior of /proc/mounts under a chroot, but people are > paranoid about userspace ABI changes (unwarranted in this case, IMO). > > http://lkml.org/lkml/2007/4/20/147 > > Anyway, if we are going to have a new 'mountinfo' file, this could be > easily fixed as well. > > > - loop: how is the connection between file and loop device maintained? > > We also discussed this with Karel, maybe it didn't make it onto lkml. > > The proposed solution was to store the "loop" flag separately in a > file under /var. It could just be an empty file for each such loop > device: > > /var/lib/mount/loops/loop0 > > This file is created by mount(8) if the '-oloop' option is given. And > umount(8) automatically tears down the loop device if it finds this > file. My question was maybe a little short. I don't doubt that we can shove a lot into the kernel, the question is rather how much of this will be unnecessary information, which the kernel doesn't really need itself. > > Could also please explain why you want to go via user mounts. Other OS use > > a > > daemon for that, which e.g. can maintain access controls. How do you want > > to > > manage this? > > The unprivileged mounts patches do contain a simple form of access > control. I don't think anything more is needed, but of course, having > unprivileged mounts in the kernel does not prevent the use of a more > sophisticated access control daemon in userspace, if that becomes > necessary. A "I don't think anything more is needed" lets go off all sorts of warning lights. Most things start out simple, so IMO it's very worth it to check where it might go to to know the limits beforehand. The main question here is why should a kernel based solution be preferable over a daemon based solution? If we look for example look at OS X, it has no need for user mounts but has a daemon instead, which also provides an interesting notification system for new devices, mounts or unmount requests. All this could also be done in the kernel, but where would be the advantage in doing so? The kernel implementation would be either rather limited or only bloat the kernel. What is the feature that would make user mounts more than just a cool kernel hack? bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Fri, 8 Feb 2008, john stultz wrote: > > clock = clocksource_get_next(); > - clocksource_calculate_interval(clock, > - (unsigned long)(current_tick_length()>>TICK_LENGTH_SHIFT)); > + clocksource_calculate_interval(clock, NTP_INTERVAL_LENGTH); > clock->cycle_last = clocksource_read(clock); > Only now I noticed that the first patch had been merged without any further question. :-( What point is there in reviewing patches, if everything is merged anyway. :-( bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Fri, 8 Feb 2008, Andrew Morton wrote: > > Only now I noticed that the first patch had been merged without any > > further question. :-( > > What point is there in reviewing patches, if everything is merged anyway. > > :-( > > > > oops, mistake, sorry. There's plenty of time to fix it though. It has been signed off by both Ingo and Thomas and neither noticed anything? This makes me very afraid of the merging process... bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Fri, 8 Feb 2008, john stultz wrote: > -ENOPATCH > > We're taking weeks to critique fairly small bug fix. I'm sure we both > have better things to do then continue to misunderstand each other. I'll > do the testing and will happily ack it if it resolves the issue. I don't want to just send a patch, I want you to understand why your approach is wrong. > Now, If you're disputing that I'm correcting the wrong side of the > equation, then we're getting somewhere. But its still not clear to me > how you're suggesting the other side (which is calculated in > ntp_update_frequency) be changed. > [..] > You keep on bringing up NO_HZ, and again, the bug I'm trying to fix > occurs with or without NO_HZ. The fix I proposed resolves the issue with > or without NO_HZ. The correction is incorrect for NO_HZ. Let's try it the other way around, as my explanation seem to lack something. Please try to explain what this correction actually means and why it should be correct for NO_HZ as well. > > The only other alternative would be to calculate this correction > > dynamically. For this you leave NTP_INTERVAL_LENGTH as is and when > > changing clocks you check whether "abs(((cs->xtime_interval * > > NTP_INTERVAL_FREQ) >> cs->shift) - NSEC_PER_SEC)" exceeds a certain limit > > (e.g. 200usec) and in this case you print a warning message, that the > > clock has large base drift value and is a bad ntp source and apply a > > correction value. This way the correction only hits the very few system > > which might need it and it would be the prefered solution, but it also > > requires a few more changes. > > Uh, that seems to be just checking if the xtime_interval is off base, or > if the ntp correction has gone too far. I just don't see how this > connects to the issue at hand. Above is the key to understanding the problem, if this difference is small enough there is no need to correct anything. This is the original patch which introduced the correction: http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=69c1cd9218e4cf3016b0f435d6ef3dffb5a53860 Keep in mind that at that time PIT was used as the base clock (even if the tsc was used, it was relative to PIT). So how much of those assumptions are still valid today (especially with NO_HZ enabled)? bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] correct inconsistent ntp interval/tick_length usage
Hi, On Thu, 21 Feb 2008, john stultz wrote: > > Again, what kind of crappy hardware do you expect? Aren't clocks supposed > > to get better and not worse? > > Well, while I've seen much worse, I consider crappy hardware to be 100 > +ppm error. So if the hardware is perfect and the system results in > 153ppm error, I'd consider that pretty crappy, especially if its not the > hardware's fault. Nevertheless this error is real, why are you trying to hide it? This is isn't an error we can't handle, it's still perfectly within the limit and except that NTP reports a somewhat larger drift than you'd like to see, everything works fine. > > Where do you get this idea that the 500ppm are exclusively for hardware > > errors? If you have such bad hardware, there is another simple solution: > > change HZ to 100 and the error is reduced to 15ppm. > > True its not exclusively for hardware errors, and if we were talking > about only 15ppm I wouldn't really worry about it. But when we're saying > the system is adding 30% of the maximum error, that's just not good. Another 30% is required for normal to crappy hardware clocks and then there is still enough room left. > > I would see the point if this problem had actually any practically > > relevance, but this error is not a problem for pretty much all existing > > standard hardware. Why are you insisting on redesigning timekeeping for > > broken hardware? > > Remember my earlier data? Where I was talking about the acpi_pm being a > multiple of the PIT frequency? By removing CLOCK_TICK_ADJUST we got a > 127ppm error when HZ=1000. NO_HZ drops that down to where we don't care, > but this _does_ effect current hardware, so I'd call it relevant. How exactly does it effect current hardware in a way that it breaks them? Despite this error everything still works fine, the hardware doesn't care. > > There's nothing 'injected', that resolution error is very real and the > > 500ppm limit is more than enough to deal with this. _Nobody_ is hurt by > > this. > > Sure, 500ppm is enough for most people with good hardware. But remember > the alpha example you brought up earlier? The HZ=1200 case, with the > CLOCK_TICK_RATE=32768? If we don't take CLOCK_TICK_ADJUST into account, > we end up with a **11230ppm** error from the granularity issue. NTP just > won't work on those systems. > > Now granted, the three types of alpha systems that actually use that HZ > value is probably as close to "nobody" as you're going to get, but I > don't think we can just throw the granularity issue aside. That's actually a good example, why it's irrelevant. First it's using a cycle based clock, thus the rounding error is irrelevant. Second in the common case they already use 1024 as HZ to reduce this error, so something similiar could be done for the HZ=1200 case and I suspect that it was already done and only CLOCK_TICK_RATE is just wrong. This mail http://consortiumlibrary.org/axp-list/archive/2002-11/0101.html suggest that this is the right thing to do. There is _no_ reason to artificially optimize this error value, there are still enough other ways to improve timekeeping. The granularity error is there no matter what you do and as long as it's within a reasonable limit there is nothing that needs fixing. bye, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make the kernel NTP code hand 64-bit *unsigned* values to do_div()
Hi, On Thu, 21 Feb 2008, David Howells wrote: > The kernel NTP code shouldn't hand 64-bit *signed* values to do_div(). Make > it > instead hand 64-bit unsigned values. This gets rid of a couple of warnings. I would actually prefer to introduce an explicit API for signed 64 divides to get rid of the temps completely, something like below. Right now it uses do_div as fallback. When all archs are converted, do_div can be single compatibility define and perhaps we can get rid of it completely. Bonus feature: implement the x86 version without the asm casts allowing gcc to generate better code. bye, Roman --- include/asm-generic/div64.h | 14 ++ include/asm-i386/div64.h| 20 include/linux/calc64.h | 28 kernel/time.c | 26 +++--- kernel/time/ntp.c | 21 + lib/div64.c | 21 - 6 files changed, 94 insertions(+), 36 deletions(-) Index: linux-2.6/include/asm-generic/div64.h === --- linux-2.6.orig/include/asm-generic/div64.h +++ linux-2.6/include/asm-generic/div64.h @@ -35,6 +35,20 @@ static inline uint64_t div64_64(uint64_t return dividend / divisor; } +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + *remainder = dividend % divisor; + return dividend / divisor; +} +#define div_u64_remdiv_u64_rem + +static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder) +{ + *remainder = dividend % divisor; + return dividend / divisor; +} +#define div_s64_remdiv_s64_rem + #elif BITS_PER_LONG == 32 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor); Index: linux-2.6/include/asm-i386/div64.h === --- linux-2.6.orig/include/asm-i386/div64.h +++ linux-2.6/include/asm-i386/div64.h @@ -48,5 +48,25 @@ div_ll_X_l_rem(long long divs, long div, } +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + union { + u64 v64; + u32 v32[2]; + } d = { dividend }; + u32 upper; + + upper = d.v32[1]; + if (upper) { + upper = d.v32[1] % divisor; + d.v32[1] = d.v32[1] / divisor; + } + asm ("divl %2" : "=a" (d.v32[0]), "=d" (*remainder) : + "rm" (divisor), "0" (d.v32[0]), "1" (upper)); + return d.v64; +} +#define div_u64_remdiv_u64_rem + extern uint64_t div64_64(uint64_t dividend, uint64_t divisor); + #endif Index: linux-2.6/include/linux/calc64.h === --- linux-2.6.orig/include/linux/calc64.h +++ linux-2.6/include/linux/calc64.h @@ -46,4 +46,32 @@ static inline long div_long_long_rem_sig return res; } +#ifndef div_u64_rem +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + *remainder = do_div(dividend, divisor); + return dividend; +} +#endif + +#ifndef div_u64 +static inline u64 div_u64(u64 dividend, u32 divisor) +{ + u32 remainder; + return div_u64_rem(dividend, divisor, &remainder); +} +#endif + +#ifndef div_s64_rem +extern s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder); +#endif + +#ifndef div_s64 +static inline s64 div_s64(s64 dividend, s32 divisor) +{ + s32 remainder; + return div_s64_rem(dividend, divisor, &remainder); +} +#endif + #endif Index: linux-2.6/kernel/time.c === --- linux-2.6.orig/kernel/time.c +++ linux-2.6/kernel/time.c @@ -661,9 +661,7 @@ clock_t jiffies_to_clock_t(long x) #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 return x / (HZ / USER_HZ); #else - u64 tmp = (u64)x * TICK_NSEC; - do_div(tmp, (NSEC_PER_SEC / USER_HZ)); - return (long)tmp; + return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ); #endif } EXPORT_SYMBOL(jiffies_to_clock_t); @@ -675,16 +673,12 @@ unsigned long clock_t_to_jiffies(unsigne return ~0UL; return x * (HZ / USER_HZ); #else - u64 jif; - /* Don't worry about loss of precision here .. */ if (x >= ~0UL / HZ * USER_HZ) return ~0UL; /* .. but do try to contain it here */ - jif = x * (u64) HZ; - do_div(jif, USER_HZ); - return jif; + return div_u64((u64)x * HZ, USER_HZ); #endif } EXPORT_SYMBOL(clock_t_to_jiffies); @@ -692,17 +686,15 @@ EXPORT_SYMBOL(clock_t_to_jiffies); u64 jiffies_64_to_clock_t(u64 x) { #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 - do_div(x, HZ / USER_HZ); + return div_u64(x, HZ / USER_HZ); #else /* * There are better ways that don't overflow early, * but even this doesn't overflow in hundreds of years * in 64 bits, so..
Re: amiga affs support broken in 2.4.x kernels??
Hi, Mark Hounschell wrote: > I'm not a list member so IF you respond to this mail please CC me. > I've been looking at the archives and see some problems with the 2.3.x > kernel versions and affs support. I've put a new version at http://www.xs4all.nl/~zippel/affs.010414.tar.gz bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: amiga affs support broken in 2.4.x kernels??
Hi, Mark Hounschell wrote: > Thanks, I can now mount affs filesystems. However when I try to write > to it via "cp somefile /amiga/somefile" I get a segmentation fault. If > I then do a "df -h" it hangs the system very much like the mount command > did before I installed your tar-ball. Was write support expected from > it. Yes, it should work. What sort of filesystem is it (ffs or ofs)? Did you check the dmesg output for an oops? Which kernel version did you use? > Are you the NEW maintainer of the affs stuff. Yes and as soon this problem is solved, I'm sending the changes to Linus and Alan. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: amiga affs support broken in 2.4.x kernels??
Hi, Mark Hounschell wrote: > Sorry I didn't get back to you yesterday afternoon. I was out of town. > Attached is the output from dmesg and the relavent info from > /var/log/messages. Could you try the attached patch? I forgot to initialize a variable correctly. (I also put a new version at http://www.xs4all.nl/~zippel/affs.010417.tar.gz) > I beleive the filesystem is ffs > but not exactly sure. How do I tell? It's printed if you mount with '-overbose', but it shouldn't be needed anymore. :) bye, Roman --- fs/affs/bitmap.c.orgSat Apr 7 04:23:41 2001 +++ fs/affs/bitmap.cTue Apr 17 19:49:18 2001 @@ -124,7 +124,7 @@ err_bh_read: affs_error(sb,"affs_free_block","Cannot read bitmap block %u", bm->bm_key); AFFS_SB->s_bmap_bh = NULL; - AFFS_SB->s_last_bmap = 0; + AFFS_SB->s_last_bmap = ~0; up(&AFFS_SB->s_bmlock); return; @@ -262,7 +262,7 @@ err_bh_read: affs_error(sb,"affs_read_block","Cannot read bitmap block %u", bm->bm_key); AFFS_SB->s_bmap_bh = NULL; - AFFS_SB->s_last_bmap = 0; + AFFS_SB->s_last_bmap = ~0; err_full: pr_debug("failed\n"); up(&AFFS_SB->s_bmlock); @@ -288,6 +288,8 @@ return 0; } + AFFS_SB->s_last_bmap = ~0; + AFFS_SB->s_bmap_bh = NULL; AFFS_SB->s_bmap_bits = sb->s_blocksize * 8 - 32; AFFS_SB->s_bmap_count = (AFFS_SB->s_partition_size - AFFS_SB->s_reserved + AFFS_SB->s_bmap_bits - 1) / AFFS_SB->s_bmap_bits;
Re: Races in affs_unlink(), affs_rmdir() and affs_rename()
Hi, Alexander Viro wrote: > unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls > affs_remove_link(), which blocks. > > unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls > affs_remove_link(). Which locks /B, renames removed entry into "b", > removes old "b" and inserts renamed "a" into /B. > > The rest is irrelevant - we're already in it. Thanks for finding that one, but it should be easy to fix. I can remove the parent pointer in aff_remove_hash and check for that before I try to rename that entry. > Since you don't lock /B for affs_empty_dir(), you can hit the > window between removing old /B/a and inserting renamed /A/a into /B. > Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link() > for /A/a doesn't even look at it. I thought about that one and I know it should be locked. The reason I don't do right now is, that affs supports hardlinks to dirs. The problem are especially recursive links, e.g.: mkdir A ln A A/B rm A/B This is possible with affs, but will already deadlock in vfs. mkdir A mkdir A/B ln A A/B/C rm A/B/C/A & rm A/B/C & rm A/B Every rm already takes the hash lock of the parent and then I can't simply also take the hash lock of the dir itself. What I actually want to do is to insert a reverse is_subdir() check before taking the lock. On the other hand I was thinking whether I should allow links to dirs at all and just show them as empty/readonly dirs. For 2.4 that's probably safer, as it would require a lot of locking changes in vfs and the other fs to support this properly, particularly moving most of the locking from vfs into the fs. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.3-ac12
Hi, Jes Sorensen wrote: > In principle you just need 2.7.2.3 for m68k, but someone decided to > raise the bar for all architectures by putting a check in a common > header file. IIRC 2.7.2.3 has problems with labeled initializers for structures, which makes 2.7.2.3 unusable for all archs under 2.4. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: problems with reiserfs + nfs using 2.4.2-pre4
Hi, On Tue, 20 Feb 2001, Neil Brown wrote: > 2/ lookup(".."). A small question: Why exactly is this needed? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] Re: problems with reiserfs + nfs using 2.4.2-pre4
Hi, On 20 Feb 2001, Trond Myklebust wrote: > IIRC several NFS implementations (not Linux though) rely on being able > to walk back up the directory tree in order to discover the path at > any given moment. If I read the source correctly, namespace operation are done with dir file handle + file name. I'm playing with the idea if we could relax the rule, that all dentries must be connected to the root. Inode to dentry lookups are really evil, e.g. the current code ignores that there might be a fs that supports links to dirs (besides that vfs doesn't support that very well either). What IMO knfsd needs is only a file handle <-> inode operation and as long as the inode is not connected to a dcache entry (i_dentry is empty) it gets a dummy dentry, which is used for further lookups. As soon as a real dentry lookups that inode, we can flush the dummy dentry (small change to d_instantiate()). This would make it possible to support fs, that can't lookup ".." or it would avoid extra checks for fs, that don't have real ".." dir entries. All what a fs needs to do is to generate a 16(?) byte cookie, which can be used to find the inode back (with the default to i_ino + i_generation). This is nothing for 2.4, but IMO something that could be tried with 2.5. bye, Roman PS: /me is searching his fire proof underwear. :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] Re: problems with reiserfs + nfs using 2.4.2-pre4
Hi, On Tue, 20 Feb 2001, Trond Myklebust wrote: > If I read the code correctly, we set the dentry d_flag > DCACHE_NFSD_DISCONNECTED on such dummy dentries. We only force a > lookup of the full path if the inode represents a directory or the > NFSEXP_NOSUBTREECHECK export flag is not set. IMO you can't safely delay the release of the dummy entry without help of vfs. Are these dummy entries always properly released? It seems I forgot about the subtree check, so it seems a fs that can't provide a get_parent, can only be exported completely? > It doesn't seem like a major change to delay that full path lookup of > the dentry until nfsd_lookup('..') is actually called (in the case > where the 'subtree_check' flag isn't used). > However, outright banning lookups of '..' by any one filesystem isn't > an option: path lookups are used for a lot more than just > `getcwd'. Imagine for instance trying to follow a relative soft link > across such a filesystem. AFAIK this is already done in the generic code (in path_walk(), which is also called by vfs_follow_link()). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CFT][PATCH] Re: fat problem in 2.4.2
Hi, On Thu, 1 Mar 2001, Alexander Viro wrote: > +static int generic_vm_expand(struct address_space *mapping, loff_t size) > +{ > + struct page *page; > + unsigned long index, offset; > + int err; > + > + if (!mapping->a_ops->prepare_write || !mapping->a_ops->commit_write) > + return -ENOSYS; > + > + offset = (size & (PAGE_CACHE_SIZE-1)); /* Within page */ > + index = size >> PAGE_CACHE_SHIFT; For affs I did basically the same with a small difference: offset = ((size-1) & (PAGE_CACHE_SIZE-1)) + 1; index = (size-1) >> PAGE_CACHE_SHIFT; That works fine here and allocates a page in the cache more likely to be used. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CFT][PATCH] Re: fat problem in 2.4.2
Hi, On Thu, 1 Mar 2001, Alexander Viro wrote: > IOW, if it's worth doing at all it probably should be > on expanding path in vmtruncate() - limit checks are already > done, but old i_size is still not lost... The fs where it's important have mmu_private, that's what I use to decide whether to expand or truncate. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: console spin_lock
Hi, On Thu, 18 Jan 2001, Andrew Morton wrote: > - Get rid of the special printk buffer - share the > log buffer. (Implies writes to console > devices will be broken into two writes when they > wrap around). > - Teach vsprintf to print into a circular buffer > (snprintf thus comes for free). I have a different vsprintf variant - vpprintf(). It takes a function and a data pointer, this function is called with the print buffer and within that function you can take care of the locking. The only problem is that %n doesn't work anymore, but it's not used anyway in the kernel (as far as I can grep :) ). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > > Actually, this is a great example, because at one point I was working > > on a device interface which would offload all of the disk-disk copying > > overhead to the disks themselves, and not involve the CPU/RAM at all. > > It's a horrible example. > > device-to-device copies sound like the ultimate thing. > > They suck. They add a lot of complexity and do not work in general. And, > if your "normal" usage pattern really is to just move the data without > even looking at it, then you have to ask yourself whether you're doing > something worthwhile in the first place. > > Not going to happen. device-to-device is not the same as disk-to-disk. A better example would be a streaming file server. Slowly the pci bus becomes a bottleneck, why would you want to move the data twice over the pci bus if once is enough and the data very likely not needed afterwards? Sure you can use a more expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is theoretically fast enough for your application? So I'm not advising it as "the ultimate thing", but I don't understand, why it shouldn't happen. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > It's too damn device-dependent, and it's not worth it. There's no way to > make it general with any current hardware, and there probably isn't going > to be for at least another decade or so. And because it's expensive and > slow to do even on a hardware level, it probably won't be done even then. > > [...] > > An important point in interface design is to know when you don't know > enough. We do not have the internal interfaces for doing anything like > this, and I seriously doubt they'll be around soon. I agree, it's device dependent, but such hardware exists. It needs of course its own memory, but then you can see it as a NUMA architecture and we already have the support for this. Create a new memory zone for the device memory and keep the pages reserved. Now you can use it almost like other memory, e.g. reading from/writing to it using address_space_ops. An application, where I'd like to use it, is audio recording/playback (24bit, 96kHz on 144 channels). Although it's possible to copy that amount of data around, but then you can't do much beside this. All the data is most of the time only needed on the soundcard, so why should I copy it first to the main memory? Right now I'm stuck to accessing a scsi device directly, but I would love to use the generic file/address_space interface for that, so you can directly stream to/from any filesystem. The only problem is that the fs interface is still to slow. That's btw the reason I suggested to split the get_block function. If you record into a file, you first just want to allocate any block from the fs for that file. A bit later when you start the write, you need a real block. And again a bit later you can still update the inode. These three stages have completely different locking requirements (except the page lock) and you can use the same mechanism for delayed writes. Anyway, now with the zerocopy network patches, there are basically already all the needed interfaces and you don't have to wait for 10 years, so I think you need to polish your crystal ball. :-) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Thu, 18 Jan 2001, Linus Torvalds wrote: > > I agree, it's device dependent, but such hardware exists. > > Show me any practical case where the hardware actually exists. http://www.augan.com > I do not know of _any_ disk controllers that let you map the controller > buffers over PCI. Which means that with current hardware, you have to > assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed? Yes. > I'm sure there are sound cards that just expose their buffers directly. > Fine. Make a special user-space driver for it. Don't try to make it into a > design. > [..] > You need to have a damn special sound card to do the above. That's true. "Soundcard" is actually a small understatement. :) Why should I make a new design for it, then it fits nicely into the current design? > And you wouldn't need a new memory zone - the kernel wouldn't ever touch > the memory anyway, you'd just ioremap() it if you needed to access it > programmatically in addition to the streaming of data off disk. ioremapped memory is not the same (that's what we do right now), you have to fake some virtual address to get the data to the right physical location. > Also, even when you happen to have the 1% card combination where it would > work in the first place, you'd better make sure that they are on the same > PCI bus. That's usually true on most PC's today, but that's probably going > to be an issue eventually. I agree, it's a special setup. > > Anyway, now with the zerocopy network patches, there are basically already > > all the needed interfaces and you don't have to wait for 10 years, so I > > think you need to polish your crystal ball. :-) > > The zero-copy network patches have _none_ of the interfaces you think you > need. They do not fix the fact that hardware usually doesn't even _allow_ > for what you are hoping for. And what you want is probably going to be > less likely in the future than more likely. It's about direct i/o from/to pages, for that you need a page struct (so the ioremapping doesn't work). See the memory on the pci card as normal memory, except that you can't allocate it normally, but you can still organize it like normal memory. All you need to do is to setup this memory area, then you can use it like normal memory, e.g. I can put it into the page cache and I can do a normal read/write with it. The changes are very minor, but it would solve so much other problems (especially alias issues). I know, that this isn't possible with any hardware combination, nonetheless it's not that a big problem to support it where it's possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > There's no no-no here: you can even create the "struct page"s on demand, > and create a dummy local zone that contains them that they all point back > to. It should be trivial - nobody else cares about those pages or that > zone anyway. AFAIK as long as that dummy page struct is only used in the page cache, that should work, but you get new problems as soon as you map the page also into a user process (grep for CONFIG_DISCONTIGMEM under include/asm-mips64 to see the needed changes). In the worst case one might need reverse mapping to get the page back. :) > That said, nobody has actually done this in practice yet, so there may be > details to work out, of course. I don't see any fundamental reasons it > wouldn't easily work, but.. I hope I have soon the time to experiment with this, so I'll now for sure. I don't see major problems, except I don't know yet, how the performance will be. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > But point-to-point also means that you don't get any real advantage from > doing things like device-to-device DMA. Because the links are > asynchronous, you need buffers in between them anyway, and there is no > bandwidth advantage of not going through the hub if the topology is a > pretty normal "star" kind of thing. And you _do_ want the star topology, > because in the end most of the bandwidth you want concentrated at the > point that uses it. I agree, but who says, that the buffer always has to be the main memory? That might be true especially for embedded devices. The cpu is then just the local controller, that manages several devices with its own buffer. Let's take a file server with multiple disks and multiple network cards with it's own buffer. For stuff like this you don't want to go through the main memory, on the other hand you still need to synchronize all the data. Although I don't know such hardware, but I don't see a reason not to do it under Linux. :-) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > Now, there are things to look out for: when you do these kinds of dummy > "struct page" tricks, some macros etc WILL NOT WORK. In particular, we do > not currently have a good "page_to_bus/phys()" function. That means that > anybody trying to do DMA to this page is currently screwed, simply because > he has no good way of getting the physical address. > > This is a limitation in general: the PTE access functions would also like > to have "page_to_phys()" and "phys_to_page()" functions. It gets even > worse with IO mappings, where "CPU-physical" is NOT necessarily the same > as "bus-physical". That's why I want to avoid dummy struct page and use a real mem_map instead. I have two options: 1. I map everything together in one mem_map, like it's still done for m68k, the overhead here is in the phys_to_virt()/virt_to_phys() function. 2. I use several nodes like mips64/arm and virt_to_page() gets more complex, but this usually assumes a specific memory layout to keep it fast. Once that problem is solved, I can manage the memory on the card like the main memory and use it however I want. I probably do something like ia64 and use the highest bits as an offset into a table. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Hi, On Sat, 20 Jan 2001, Linus Torvalds wrote: > But think like a good hardware designer. > > In 99% of all cases, where do you want the results of a read to end up? > Where do you want the contents of a write to come from? > > Right. Memory. > > Now, optimize for the common case. Make the common case go as fast as you > can, with as little latency and as high bandwidth as you can. > > What kind of hardware would _you_ design for the point-to-point link? > > I'm claiming that you'd do a nice DMA engine for each link point. There > wouldn't be any reason to have any other buffers (except, of course, > minimal buffers inside the IO chip itself - not for the whole packet, but > for just being able to handle cases where you don't have 100% access to > the memory bus all the time - and for doing things like burst reads and > writes to memory etc). > > I'm _not_ seeing the point for a high-performance link to have a generic > packet buffer. I completely agree, if we are talking about standard pc hardware. I was more thinking about some dedicated hardware, where you want to get the data directly to the correct place. If the hardware does a bit more with the data you need large buffers. In a standard pc the main cpu does most of the data processing, but in dedicated hardware you might have several cards each with it's own logic and memory and here the cpu does manage that stuff only. You can do all this of course from user space, but this means you have to copy the data around, what you don't want with such hardware, when the kernel can help you a bit. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
hfs support for blocksize != 512
Hi, Here is a patch for anyone who needs to access HFS on e.g. an MO drive. It's only for 2.2.16, but I was able to do that as part of my job as we need that functionality. Anyway, I've read also a bit through HFS+ spec and IMO basically most of the current hfs needs to rewritten for 2.4, e.g. its special files should better go into the page cache and hfs basically assumes everywhere 512 byte blocks, what isn't true anymore with hfs+. This 512 bytes block problem is also the reason that the perfomance of this patch will suck badly on MOs, since _every_ write (of a 512 byte block) requires a read (of a 1024 byte sector). Anyway, I'm happy about any bug reports, that you can't reproduce with hfs on a drive with 512 byte sectors (for that I still trying to fully understand hfs btrees :-) ). I don't think this patch should be included into standard 2.2, but on the other hand it also shouldn't make anything worse than it already is. bye, Roman hfs1024.diff.gz
Re: hfs support for blocksize != 512
Hi, > Darnit, documentation on filesystem locking is there for purpose. First > folks complain about its absence, then they don't bother to read the > bloody thing once it is there. Furrfu... It's great that it's there, but still doesn't tell you everything. > Said that, handling of indirect blocks used to be badly b0rken on all > normal filesystems and it had been fixed only on ext2, so I wouldn't be > amazed if regular files were bad on B-tree style filesystems. Directories > are easy - all requests are process-synchronous (no pageout), no > truncate() in sight, so the life is better. I don't think that files are that easy, at least from what I know now from hfs. For example reading from a file might require a read from a btree file (extent file), with what another file write can be busy with (e.g. reordering the btree nodes). I really would prefer that a fs could sleep _and_ can use semaphores, that would keep locking simple, otherwise it gets only a fscking mess. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, > > hfs. For example reading from a file might require a read from a btree > > file (extent file), with what another file write can be busy with (e.g. > > reordering the btree nodes). > > And? The point is: the thing I like about Linux is its simple interfaces, it's the basic idea of unix - keep it simple. That is true for most parts - the basic idea is simple and the real complexity is hidden behind it. But that's currently not true for vfs interface, a fs maintainer has to fight right now with fscking complex vfs interface and with a possible fscking complex fs implementation. E2fs or affs have a pretty simple structure and I believe you that it's not that hard to fix, maybe there is also a simple solution for hfs. But I'd like you to forget about that and think about the big picture (how Linus nicely states it). What we should aim at with the vfs interface is simplicity, I want to use a fscking simple semaphore to protect something like anywhere else, I don't want to juggle with lots blocks wich have to be updated atomically. Maybe you get once right, but it will follow you as a nightmare, you add one feature (e.g. quota), you add another feature (like btrees), you so still damned fscking sure to get and keeping it right? So and? What I'd really like to see from you is to be a bit more supportive for other peoples problems, I really don't expect you to solve these problems, but if someone approaches a different solution, you're pretty quick to refuse it. So lets get back to the vfs interface, fs currently have to do pretty much all there changes atomically, they have to grab all the buffers they need and do all changes at once. How can you be sure that this is possible for every possible fs? How do you make sure you don't create other problems like livelocks? We currently have problem that things like kswapd require an asynchronous interface, but fs prefer to synchronize it. Currently you pushing all the burden of an asynchronous interface into the fs, which want to rather avoid that. Why don't you think for a moment in the other direction? Currently I'm playing with the idea of a kernel thread for asynchronous io (maybe one per fs), that thread takes the io requests e.g. from kswapd and the io thread can safely sleep on it, while kswapd can continue its job, but I don't know yet, where to put, whether in the fs specific part or whether it can be made generic enough to be put into the generic part. Can we please think for a moment in that direction? At some point you have to synchronize the io anyway (at latest when it hits the device), but I would pretty much prefer if a fs would get some help at some earlier point. (Anyway, I need some sleep now as well... :) ) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, > Yes? And it will become simpler if you will put each and every locking > scheme into the API? No, I didn't say that. I want the API to be less restrictive and make the job for the fs a bit easier. IMO the current API is inconsistent and/or incomplete and I'm still trying to find out what exactly is missing. The VFS is becoming more and more multithreaded, locks are (re)moved, but nothing was added for the fs. > We have ext2 with indirect blocks, inode bitmaps and block bitmaps, one > per cylinder group + counters in each cylinder group. Should VFS know > about the internal locking rules? Should it be aware of the fact that > inodes foo and bar belong to the same cylinder group and if we remove them > we will need to protect the bitmap for a while? Ok, let's take ext2 as an example. Of course vfs should only be the abstraction layer, but it shouldn't enforce locking rules like you added them in ext2. I know the races exists already longer, so you don't have to argue about that, but earlier I suggested a simpler solution, the problem is that it requires holding an exclusive lock while it would sleep. It wouldn't even be in the fast path and would only affect write access to the indirect blocks of a single file, it doesn't affect reads and it doesn't affect access to other files - that really shouldn't be a problem even for a multi threaded environment. But currently this is not possible and all I'm trying now is to explore possibilities to make that possible, as it would make the life for ext2 and every other fs a lot easier. > We have AFFS with totally fscked directory structures. Sorry? Why is that? Because it's not UNIX friendly? It was designed for a completly different os and is very simple. The problems I know are mostly shared with every other fs, that has a more dynamic directory structure than ext2. > It's insane - protection of purely internal data structures belongs to the > module that knows about them. I absolutly don't argue against that! Anyway, somehow you skipped a lot of my mail, so it seems I have to continue to discuss that with myself (hopefully without permanent damage). The major problem right now is that writepage() is supposed to be asynchronous especially for kswapd, but the fs might have to synchronized something _internal_. I think one problem here is that we still have a synchronous buffer API, what makes it very hard to implement a asynchronous interface. That's why I suggested an I/O thread, which can sleep for the caller. Another possibility is to make the already existing asynchronus interface in buffer.c available to the fs. Anyway, if we want an asynchronous fs interface, we need an asynchronous buffer interface, so e.g. writepage() in ext2 can lock the indirect block, starts the I/O and gets called back later, another writepage() call in the same area has to detect that lock (with a simple down_trylock()) and schedules the complete I/O for later. With some help from the buffer interface it should be possible pretty easily and ext2 would actually become much easier again. Something like this would also be great for a real AIO support in userspace with great latencies. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, Tony Mantler wrote: > For those of you who would rather not have read through this entire email, > here's the condensed version: VFS is inherintly a wrong-level API, QNX does > it much better. Flame on. :) VFS isn't really wrong, the problem is that it moved from an almost single threaded API to a multithreaded API and that development isn't complete yet. I don't really expect that fs programming becomes easier, but it should stay sane. For example I want to protect certain state changes properly and not that insane "check all possible states at all possible times and before and after every change" what Al is currently doing in ext2. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, > It sounds to me like different FSes have different needs. Maybe the best > approach is to have two or three fs APIs, according to the needs of the > fs. No, having several fs API is a maintainance nightmare, I think that's something everyone agrees on. What is needed is to modify the API to meet all requirements of vfs and needs of the fs. (The problem is we don't agree on what the fs needs...) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, > Show me these removed locks. The only polite explanation I see is > that you have serious reading comprehension problems. Let me say it once > more, hopefully that will sink in: > > Your repeated claims of VFS becoming more multi-threaded in ways > that are not transparent to fs drivers wrt locking are false. For example the usage of inode lock changed pretty much and was partly replaced with the page lock? I can still remember times, where all of the fs stuff happened under the BKL, for me that means only a _single_ thread of execution could be busy in the whole fs layer. IMHO that's not really a prime example of multi-threaded programming, if you have a different definition please let me now. > What? You've proposed locking on pageout. If _that_ isn't the fast path... No, I suggested a lock (not necessarily the inode lock) during allocation of indirect blocks (and defer truncation of them). > > The major problem right now is that writepage() is supposed to be > > asynchronous especially for kswapd, but the fs might have to > > synchronized something _internal_. I think one problem here is that we > > still have a synchronous buffer API, what makes it very hard to > > implement a asynchronous interface. That's why I suggested an I/O > > Wrong. As the matter of fact, we could trivially get rid of _any_ use of > bread() and friends on ext2. Excuse my stupidity, but could you please outline me how? > _One_ thread? For the whole fs? So you would pass the dirty pages from > kswapd to that guy. Fine. It attempts to acquire the inode semaphore (in > your proposal, as far as I could parse it). It blocks. kswapd keeps > pumping dirty pages into the queue of that thread. Wonderful... Sorry, but did you read my mail? The purpose of that thread is to sleep and to get waken up to continue the IO. Not very much changes, except that this thread can safely sleep, whereas kswapd can't. Excuse my ignorance, but who does currently stop kswapd to start lots of IO? > b) doesn't help AFFS directory problems Why the hell do you come always with this, I _never_ mentioned it. > Talk is cheap. If you can show the patch that would simplify ext2, > I'm sure that Ted will be glad to see it. Same for maintainers of other > filesystems. The only requirement is that it should work. Excuse me, but > the longer I read your postings the more it looks like you have no idea of > the things you are talking about. I would be glad to be proven wrong on > that one too ;-/ I'm very sorry to waste your precious time, but your fscking arrogance makes me sick. What's your problem? Shall I first worship you as our fs god who saved us from all races? Sorry, but from time to time I prefer _first_ to think about a problem and I try to understand it. One way to do this is to post questions and/or suggestions to a mailing list (at least I thought so). If you have an other suggestion please enlighten me. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, (Sorry for the previous empty mail, I was a bit too fast with sending and couldn't stop it completly.) On Wed, 30 Aug 2000, Alexander Viro wrote: I concentrate on the most interesting part: > As for AFFS directory format - fine, please describe the data > manipulations required by unlink("foo"); done after the > link("foo","bar/baz");. Both operations are supported on AmigaOS, so > references to UNIX are utterly irrelevant. On the block level, please. > Only for directory blocks. Now, tell me what kind of protection (pageout > has nothing to directories, so all async problems are irrelevant) would > you provide. Or what protection should VFS/core kernel/exec/whatever > provide to filesystem. Disclaimer: I know that the following doesn't match the current implementation, it's just how I would intuitively would do it: - get dentry foo - get dentry baz - lock inode foo - mark dentry foo as deleted - getblk file header foo - mark file header foo as deleted - getblk file header baz - update file header baz from file header foo - brelse file header baz - update inode foo - unlock inode foo - put dentry baz - lock foo's parent - getblk and update dir header parent - getblk file headers from foo's chain until file header of predecessor of foo found - update predecessor to point to successor of foo - brelse everything - unlock foo's parent - put and invalidate dentry foo - last user of foo frees file header foo in bitmap I probably forgot something, but you will surely tell me. Two things I want to mention anyway. First, I only lock something when needed, that of course breaks with current conventions. Second (and most important), I use the dentry to block a possible lookup of an inode, so noone can open or create foo or do anything else with it. A rename would work similiar only that the new dentry would be marked as not complete yet. > On that specific operation. When you are done with > that, I have a rename() for you, but I think that even simpler example > (unlink()) will be sufficient. Please post it, I know there are some interesting examples, but I don't have them at hand. Although I wanted to keep that flamewar for later, but if we're already in it... > Again, we are talking about the data structure and operations it has to > deal with _according to its designers_. I claim that due to a bad data > structure design (single-linked lists in hash chains, requirement to have > all entries belonging to some chain) unlink() (one of the operations it > was designed to deal with) becomes very complicated and requires rather > hairy exclusion rules. On Amiga. Linux has nothing with the problem. To be fair it shoud be mentioned, that links were added later to affs. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, On Wed, 30 Aug 2000, Alexander Viro wrote: > c) ->i_sem on pageout? When? For 2.2.16: filemap_write_page() <- filemap_swapout() <- try_to_swap_out() <- ... <- swap_out() <- do_try_to_free_pages() <- kswapd() filemap_write_page() takes i_sem and calls do_write_page(). What did I miss? > BKL matters only in the areas where you do not block. Moreover, > fs code is still under the BKL, so it's totally moot. Let me state it differently, what I'm trying to say: Past: Lots of filesystem code wasn't designed/written with multiple threads in mind. The result is lots of races. Future: We want to experiment with a preempting kernel. Maybe that experiment will fail, but I'm certainly interested in it. But the result here will be a wonderful world of new races and I'm pretty sure your ext2 fixes will break here, one more reason I'm so keen to use sempahores. All I wanted to say is that level of threading is changing. How that is visible in the fs layer is a different problem. > > > Wrong. As the matter of fact, we could trivially get rid of _any_ use of > > > bread() and friends on ext2. > > > > Excuse my stupidity, but could you please outline me how? > > Using kiovec, for one thing. Huh? You said "trivially". > One thing that became really obvious is that current documentation > is either not enough or not read. Hell knows what to do about the latter, > but the former can be helped. Documentation is one (good) thing (I really tried to find as much as possible), but my point is that I tried to discuss design issues, I didn't want to know how it works now (for that I can and do read the source), I want to discuss the possibility of alternative solutions, is that really impossible? Anyway, after I discussed that enough with myself, I think I can try to code up something as soon as find the time for it. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, > > - get dentry foo > > - get dentry baz > > How? OK, you've found block of baz. You know the name, all right. Links are chained together and all point back to the original, so if you remove the original, you have quite something to do with lots of links. > Now > you've got to do the full tracing all the way back to root. All file header have a pointer to the dir header, so it's not that difficult, but that makes links to directories so interesting. :) Anyway, I'll better try to describe the idea more generally: The basic idea is to introduce transient states to vfs and to move the locking into the fs, which probably knows better what needs to be protected. This would avoid the current locking overkill. Let's take a rename, first we mark the object as to be moved, no need to keep it locked after this. An open on this object would either fail or had to wait (on a seperate queue). Next we mark the destination dir as not removable. This is basically the job of vfs so far, the next steps happen in the fs. (I use affs here as an example.) First we lock the source dir and remove the object from the chain and unlock the dir. Now I can lock the destination, insert the object here and unlock the dir. (back to vfs) All we have to do now is to restore now the state of destination dir and the object and we have to wakeup anyone who's waiting. Back to the original example of removing a file with links. I have to get the dentry of baz as I have to prevent a lookup of that link, while I'm modifying its block. But I think it's enough to lock that block and check only the cached aliases. Then I can modify that block and unlock it again. > > - update file header baz from file header foo > > If it would be that simple... Extent blocks refer to foo, unfortunately. > Yes, copying the thing would be easier. Too bad, data structure prohibits > that. Which data structure prohibits that? Updating the extent blocks isn't that difficult as the back links are not needed for general operation, it's just wasting I/O. A bit more problematic are concurrent readers of foo, so I can't simply trash the buffer of foo's file header, but I can simply keep it allocated till the file is closed (keeps also the inode number constant and unique). > Well, consider rename over the primary link and there you go... Keep in > mind that extent blocks contain the reference to header block, so unless > you want to update them all you've got to move the header into donor's > chain ;-/ Oops, I just read rename(2) and notice that I forgot about a small detail. Ok, above rename operation get's slightly more difficult. Basically it's only a variation of the unlink problem, I first unlink the old file and then insert the new file. As I do less locking, I shouldn't have a locking problem or what do I miss? I just might have to update lots of back links, but that is not a critical part. [I can skip the affs history part, I just see you already got a better answer than I could give.] bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hfs support for blocksize != 512
Hi, On Thu, 31 Aug 2000, Alexander Viro wrote: > Go ahead, write it. IMNSHO it's going to be much more complicated and > race-prone, but code talks. If you will manage to write it in clear and > race-free way - fine. Frankly, I don't believe that it's doable. It will be insofar more complicated, as I want to use a more complex state machine than "locked <-> unlocked", on the other hand I can avoid such funny constructions as triple_down() and obscure locking order rules. At any time the object will be either locked or in a well defined state, where at any time only a single object is locked by a thread. (I hope some pseudo code does for the beginning, too?) Most namespace operation work simply like a semaphore: restart: lock(dentry); if (dentry is busy) { unlock(dentry); sleep(); goto restart; } dentry->state = busy; unlock(dentry); If the operation is finished, the state is reset and everyone sleeping is woken up. Ok, let's come to the most interesting operation - rename(): restart: lock(olddentry); if (olddentry is busy) { unlock(olddentry); sleep(); goto restart; } olddentry->state = moving; unlock(olddentry); restart2: lock(newdentry); if (newdentry->state == moving) { lock(renamelock); if (olddentry->state == deleted) { unlock(renamelock); unlock(newdentry); sleep(); goto restart; } newdentry->state = deleted; unlock(renamelock); } else if (newdentry is busy) { unlock(newdentry); sleep(); goto restart2; } else newdentry->state = deleted; unlock(newdentry); if (!rename_valid(olddentry, newdentry)) { lock(newdentry); newdentry->state = idle; unlock(renamelock); lock(olddentry); olddentry->state = idle; unlock(olddentry); wakeup_sleepers(); return; } if (newdentry exists) unlink(newdentry); do_rename(olddentry, newdentry); lock(newdentry); newdentry->state = idle; unlock(renamelock); lock(olddentry); olddentry->state = deleted; unlock(olddentry); wakeup_sleepers(); return; Note that I don't touch any inode here, everything happens in the dcache. That means I move the complete inode locking into the fs, all I do here is to make sure, that while operation("foo") is busy, no other operation will use "foo". IMO this should work, I tried it with a rename("foo", "bar") and rename("bar", "foo"): case 1: one rename gets both dentries busy, the other rename will wait till it's finished. case 2: both mark the old dentry as moving and find the new dentry also moving. To make the rename atomic the global rename lock is needed, one rename will find the old dentry isn't moving anymore and has to restart and wait, the other rename will complete. Other operations will keep only one dentry busy, so that I don't a see problem here. If you don't find any major problem here, I'm going to try this. Since if this works, it will have some other advantages: - a user space fs will become possible, that can't even deadlock the system. The first restart loop can be easily made interruptable, so it can be safely killed. (I don't really want to know how a triple_down_interruptable() looks, not to mention the other three locks (+ BKL) taken during a rename.) - I can imagine better support for hfs. It can access the other fork without excessive locking (I think currently it doesn't even tries to). The order in which the forks can be created can change then too. > BTW, I really wonder what kind of locks are you going to have on _blocks_ > (you've mentioned that, unless I've misparsed what you've said). IMO that > way lies the horror, but hey, code talks. I thought about catching a bread, but while thinking about it, there should also be other ways. But that's fs specific, let's concentrate on the generic part first. > You claim that it's doable. I seriously doubt it. Nobody knows your ideas > better than you do, so... come on, demonstrate the patch. I think the above example should do basically the same as some nothing doing patch within affs. I hope that example shows two important ideas (no idea if they will save the world, but I'm willing to learn): - I use the dcache instead of the inode to synchronize namespace operation, what IMO makes quite a lot of sense, since it represents our (cached) representation of the fs. - Using states instead of a semaphore, makes it easily possible to detect e.g. a rename loop. bye, Roman - To
Re: What the Heck? [Fwd: Returned mail: User unknown]
Hi, On Mon, 4 Sep 2000, Alan Cox wrote: > Then they need more competant admins. It isnt _hard_ to transproxy outgoing > smtp traffic via a spamtrapper that checks for valid src/destination and > headers. You get into a dangerous field here. If you start arguing like this, how do you explain to a politician, the difference between this and content based filtering. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: mounting affs over loop hangs in syscall (x86 only?)
Hi, On Mon, 18 Dec 2000, Bernardo Innocenti wrote: > [1.] One line summary of the problem: > mounting affs over loop hangs in syscall (x86 only?) affs plays some games with the suberblock lock, I have a patch that plays even worse games, but it works. I hope to finish a major cleanup of affs over christmas. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Andrea Arcangeli wrote: > > estimate than just the data blocks it should not be hard to add an > > extra callback to the filesystem. > > Yes, I was thinking at this callback too. Such a callback is nearly the only > support we need from the filesystem to provide allocate on flush. Actually the getblock function could be split into 3 functions: - alloc_block: mostly just decrementing a counter (and quota) - get_block: allocating a block from the bitmap - commit_block: inserting the new block into the inode This would be really useful for streaming, one could get as fast as possible the block number and the data could be very quickly written, while keeping the cache usage low. Or streaming directly from a device to disk also wants to get rid of the data as fast as possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sat, 30 Dec 2000, Linus Torvalds wrote: > In fact, in a properly designed filesystem just a bit of low-level caching > would easily make the average "get_block()" be very fast indeed. The fact > that right now ext2 has not been optimized for this is _not_ a reason to > design the VFS layer around a slow get_block() operation. > [..] > The second point is completely different, and THIS is where I think there > are potentially real advantages. However, I also think that this is not > actually about deferred writes at all: it's really a question of the > filesystem having access to the page when the physical write is actually > started so that the filesystem might choose to _change_ the allocation it > did - it might have allocated a backing store block at "get_block()" time, > but by the time it actually writes the stuff out to disk it may have > allocated a bigger contiguous area somewhere else for the data.. > > I really think that the #2 thing is the more interesting one, and that > anybody looking at ext2 should look at just improving the locking and > making the block allocation functions run faster. Which should not be all > that difficult - the last time I looked at the thing it was quite > horrible. What makes get_block business complicated now, is that can be called recursively: get_block needs to allocate something, what might start new i/o which calls again get_block. Writing dirty pages should be a real asynchronous process, but it isn't right now, as get_block is synchronous. Making get_block asynchronous is almost impossible, so one usually does it in a separate thread. So IMO something like this should happen: dirty pages should be put on a separate list and a thread takes these pages and allocates the buffers for them and starts the i/o. This had another advantage: get_block wouldn't really need to do preallocation anymore, the get_block function could work instead on a number of pages (preallocation would instead happen in the page cache). This could make the get_block function and the needed locking very simple, e.g. one could use a simple semaphore instead of kernel_lock to protect getting of multiple blocks instead of only one. Also splitting it into several tasks can make it faster, so in one step we just do the resource allocation to guarantee the write, in a separate step we do the real allocation. If this is done for several pages at once, it can be very fast and simple. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: > Let me repeat myself one more time: > > I do not believe that "get_block()" is as big of a problem as people make > it out to be. The real problem is that get_block() doesn't scale and it's very hard to do. A recursive per inode-semaphore might help, but it's still a pain to get it right. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: > cached_allocation = NULL; > > repeat: > spin_lock(); > result = try_to_find_existing(); > if (!result) { > if (!cached_allocation) { > spin_unlock(); > cached_allocation = allocate_block(); > goto repeat; > } > result = cached_allocation; > add_to_datastructures(result); > } > spin_unlock(); > return result; > > This is quite standard, and Linux does it in many places. It doesn't have > to be complex or ugly. No problem with that. > Also, I don't see why you claim the current get_block() is recursive and > hard to use: it obviously isn't. If you look at the current ext2 > get_block(), the way it protects most of its data structures is by the > super-block-global lock. That wouldn't work if your claims of recursive > invocation were true. I just rechecked that, but I don't see no superblock lock here, it uses the kernel_lock instead. Although Al could give the definitive answer for this, he wrote it. :) > The way the Linux MM works, if the lower levels need to do buffer > allocations, they will use GFP_BUFFER (which "bread()" does internally), > which will mean that the VM layer will _not_ call down recursively to > write stuff out while it's already trying to write something else. This is > exactly so that filesystems don't have to release and re-try if they don't > want to. > > In short, I don't see any of your arguments. Then I must have misunderstood Al. Al? If you were right here, I would see absolutely no reason for the current complexity. (Me is a bit confused here.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Alexander Viro wrote: > Reread the original thread. GFP_BUFFER protects us from buffer cache > allocations triggering pageouts. It has nothing to the deadlock scenario > that would come from grabbing ->i_sem on pageout. I don't want to grab i_sem. It was a very, very early idea... :) > Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block() > calls) is really, really not a problem. Normally it just gives you the > straightforward path. All unrolls are for contention cases and they > are precisely what we have to do there. Maybe complexity is the wrong word, of course the logic in there is straight forward (once one understood it :) ). Let me ask it differently and it's now only about indirect block handling. Is it possible to use a per-inode-indirect-block-semaphore? The reason for the question is, that I maybe see a different sort of contention here - live locks. I don't mind that getting of resources and rechecking if everything went well. The problem is how much resources you need to get (and to release, if something failed). Somewhere is always a point, where two threads can't make any progress or one thread can stall the progress of a second. To get back to ext2_get_block: IMO such a scenario could happen in the double or triple indirect block case, when two or more threads try to allocate/truncate a block here. Maybe my concerns are baseless, but I'd just like to know, that there isn't a possibility for a DOS attack here. (BTW that's what I mean with complexity, it's less the logical complexity, it's more the "runtime complexity"). The other reason for the question is that I'm currently overwork the block handling in affs, especially the extended block handling, where I'm implementing a new extended block cache, where I would pretty much prefer to use a semaphore to protect it. Although I could do it probably without the semaphore and use a spinlock+rechecking, but it would keep it so much simpler. (I can post more details about this part on fsdevel if needed / wanted.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Mon, 1 Jan 2001, Alexander Viro wrote: > But... But with AFFS you _have_ exclusion between block-allocation and > truncate(). It has no sparse files, so pageout will never allocate > anything. I.e. all allocations come from write(2). And both write(2) and > truncate(2) hold i_sem. > > Problem with AFFS is on the directory side of that business and there it's > really scary. Block allocation is trivial... Block allocation is not my problem right now (and even directory handling is not that difficult), but I will post somethings about this on fsdevel later. But one question is still open, I'd really like an answer for: Is it possible to use a per-inode-indirect-block-semaphore? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Tue, 2 Jan 2001, Alexander Viro wrote: > Depends on a filesystem. Generally you don't want asynchronous operations > to grab semaphores shared with something else. kswapd knows to skip the locked > pages, but that's it - if writepage() is going to block on a semaphore you > will not know what had hit you. And while buffer-cache operations will not > trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have > no such warranties for other sources of memory pressure. If one of them > hits while you are holding such semaphore - you are toast. I just checked that and you're right, sorry for causing confusion and thanks for clearing this up. > We probably could pull it off for ext2_truncate() vs. ext2_get_block() > but it would not do us any good. It would give excessive exclusion for > operations that can be done in parallel just fine (example: we have > a hole from 100Kb to 200Kb. Pageouts in that area can be trivially > done i parallel - current code will not even try to do unrolls. With > your locking they will be serialized for no good reason). What for? Let me come back to the three phases I mentioned earlier: alloc_block: does only a read-only check whether a block needs to be allocated or not, this can be done in parallel and only needs the page lock. get_block: blocks are now really allocated and this needs locking of the bitmap. commit_block: write the allocated blocks to the inode and this now would use an inode specific semaphore to protect the updates of indirect blocks. The only problem I see is truncate, but if we move the release of unneeded indirect blocks to file_close, only new indirect blocks can appear while the file is open, but they won't change anymore, what would make lots of the checks easier. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] move xchg/cmpxchg to atomic.h
Hi, On Tue, 2 Jan 2001, David S. Miller wrote: >We really can't. We _only_ have load-and-zero. And it has to be >16-byte aligned. xchg() is just not something the CPU implements. > > Oh bugger... you do have real problems. For 2.5 we could move all the atomic functions from atomic.h, bitops.h, system.h and give them a common interface. We could also give them a new argument atomic_spinlock_t, which is a normal spinlock, but only used on architectures which need it, everyone else can "optimize" it away. I think one such lock per major subsystem should be enough, as the lock is only held for a very short time, so contentation should be no problem. Anyway, this had the huge advantage that we could use the complete 32/64 bit of the atomic value, e.g. for pointer operations. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Meaning of blk_size
Hi, On Mon, 2 Oct 2000, Andries Brouwer wrote: > These days I have as background activity the construction > of the corresponding patch for 2.4. Maybe we can start 2.5 > without these arrays and with large device numbers. I started something like this a few months ago, I was at the point to boot a usermode kernel till the fsck, which failed. Currently I have no time to continue it, as there is more important work pending. But I didn't create a generic kdev_t, I changed the block device part to use a bdev_t, I also started a few cleanups that make e.g. the partition stuff a bit easier. On the other hand it breaks of course every block device driver. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: SA_INTERRUPT
Hi, On Sun, 1 Oct 2000, Andrea Arcangeli wrote: > Comments? When that is done, please don't call __sti() directly and use some macro that can be overridden by the architectures. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: SA_INTERRUPT
Hi, On Mon, 2 Oct 2000, Andrea Arcangeli wrote: > > When that is done, please don't call __sti() directly and use some macro > > that can be overridden by the architectures. > > What do you have in mind while making this suggestion? The irq highlevel layer > is pretty much architectural indipendent. Just run a diff between the irq.c in > the IA32 and alpha ports. Also what about the drivers that are just using > __sti() at the start of the irq handler right now? m68k uses interrupt levels, so an interrupt with a higher priority can interrupt another interrupt with a lower priority. To make things more interesting several m68k machines don't have a seperate external interrupt controller, so they rely on that the interrupt level isn't lowered during an interrupt (the ide driver has an ide__sti() because of this). I can imagine that newer lowend embedded targets have similiar problems (on the other end I'm just looking at the s390 interrupt code which looks also interesting). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] initdata and bss
Hi, A few bss changes (to remove zero initialization) in test9 were not completly correct. Init data must be initialized if you want that it gets into the init section (it's also mentioned in the gcc documentation). The following patch fixes what I was able to find with grep and also adds a note about in init.h. bye, Roman diff -ur linux-2.4.org/arch/arm/mm/init.c linux-2.4-initdata/arch/arm/mm/init.c --- linux-2.4.org/arch/arm/mm/init.cThu Oct 5 19:35:19 2000 +++ linux-2.4-initdata/arch/arm/mm/init.c Thu Oct 5 20:22:00 2000 @@ -56,7 +56,7 @@ * The sole use of this is to pass memory configuration * data from paging_init to mem_init. */ -static struct meminfo __initdata meminfo; +static struct meminfo meminfo __initdata = { 0, }; /* * empty_bad_page is the page that is used for page faults when diff -ur linux-2.4.org/arch/ia64/kernel/smp.c linux-2.4-initdata/arch/ia64/kernel/smp.c --- linux-2.4.org/arch/ia64/kernel/smp.cThu Aug 24 19:30:52 2000 +++ linux-2.4-initdata/arch/ia64/kernel/smp.c Thu Oct 5 20:23:31 2000 @@ -49,8 +49,8 @@ spinlock_t kernel_flag = SPIN_LOCK_UNLOCKED; -struct smp_boot_data __initdata smp; -char __initdata no_int_routing = 0; +struct smp_boot_data smp __initdata = { 0, }; +char no_int_routing __initdata = 0; unsigned char smp_int_redirect;/* are INT and IPI redirectable by the chipset? */ volatile int __cpu_number_map[NR_CPUS] = { -1, };/* SAPIC ID -> Logical ID */ diff -ur linux-2.4.org/arch/m68k/kernel/setup.c linux-2.4-initdata/arch/m68k/kernel/setup.c --- linux-2.4.org/arch/m68k/kernel/setup.c Wed Jul 5 01:04:12 2000 +++ linux-2.4-initdata/arch/m68k/kernel/setup.c Thu Oct 5 20:20:01 2000 @@ -68,13 +68,13 @@ char m68k_debug_device[6] = ""; -void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata; +void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata = +NULL; /* machine dependent keyboard functions */ -int (*mach_keyb_init) (void) __initdata; +int (*mach_keyb_init) (void) __initdata = NULL; int (*mach_kbdrate) (struct kbd_repeat *) = NULL; void (*mach_kbd_leds) (unsigned int) = NULL; /* machine dependent irq functions */ -void (*mach_init_IRQ) (void) __initdata; +void (*mach_init_IRQ) (void) __initdata = NULL; void (*(*mach_default_handler)[]) (int, void *, struct pt_regs *) = NULL; void (*mach_get_model) (char *model) = NULL; int (*mach_get_hardware_list) (char *buffer) = NULL; diff -ur linux-2.4.org/arch/ppc/kernel/apus_setup.c linux-2.4-initdata/arch/ppc/kernel/apus_setup.c --- linux-2.4.org/arch/ppc/kernel/apus_setup.c Thu Oct 5 19:35:21 2000 +++ linux-2.4-initdata/arch/ppc/kernel/apus_setup.c Thu Oct 5 20:19:41 2000 @@ -82,13 +82,13 @@ extern void amiga_init_IRQ(void); -void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata; +void (*mach_sched_init) (void (*handler)(int, void *, struct pt_regs *)) __initdata = +NULL; /* machine dependent keyboard functions */ -int (*mach_keyb_init) (void) __initdata; +int (*mach_keyb_init) (void) __initdata = NULL; int (*mach_kbdrate) (struct kbd_repeat *) __apusdata = NULL; void (*mach_kbd_leds) (unsigned int) __apusdata = NULL; /* machine dependent irq functions */ -void (*mach_init_IRQ) (void) __initdata; +void (*mach_init_IRQ) (void) __initdata = NULL; void (*(*mach_default_handler)[]) (int, void *, struct pt_regs *) __apusdata = NULL; void (*mach_get_model) (char *model) __apusdata = NULL; int (*mach_get_hardware_list) (char *buffer) __apusdata = NULL; diff -ur linux-2.4.org/arch/ppc/kernel/prep_setup.c linux-2.4-initdata/arch/ppc/kernel/prep_setup.c --- linux-2.4.org/arch/ppc/kernel/prep_setup.c Thu Oct 5 19:35:22 2000 +++ linux-2.4-initdata/arch/ppc/kernel/prep_setup.c Thu Oct 5 20:18:55 2000 @@ -384,8 +384,8 @@ * 2 following ones measure the interval. The precision of the method * is still doubtful due to the short interval sampled. */ -static __initdata volatile int calibrate_steps = 3; -static __initdata unsigned tbstamp; +static volatile int calibrate_steps __initdata = 3; +static unsigned tbstamp __initdata = 0; void __init prep_calibrate_decr_handler(intirq, diff -ur linux-2.4.org/drivers/block/xd.c linux-2.4-initdata/drivers/block/xd.c --- linux-2.4.org/drivers/block/xd.cThu Oct 5 19:35:27 2000 +++ linux-2.4-initdata/drivers/block/xd.c Thu Oct 5 20:14:58 2000 @@ -142,9 +142,9 @@ static DECLARE_WAIT_QUEUE_HEAD(xd_wait_open); static u_char xd_valid[XD_MAXDRIVES] = { 0,0 }; static u_char xd_drives, xd_irq = 5, xd_dma = 3, xd_maxsectors; -static u_char xd_override __initdata, xd_type __initdata; +static u_char xd_override __initdata = 0, xd_type __initdata = 0; static u_short xd_iobase = 0x320; -static int xd_geo[XD_MAXDRIVES*3] __initdata; +static int xd_geo[XD_MAXDRIVES*3] __initdata = { 0, }; static volatile int xdc_busy; static DECLARE_WAIT_QUEUE_HEAD(xdc_wait); diff -ur linux-2.
Re: Calling current() from interrupt context
Hi, > The m68k port which has a interrupt stack solves the problem by > loading current into a global register variable on all kernel entries. Not all m68k cpus have an interrupt stack and it can be turned off, so we don't use it. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [OT] linux article with kernel references
Hi, On Wed, 11 Oct 2000, Alan Cox wrote: > > http://www.osopinion.com/Opinions/MontyManley/MontyManley15.html > > > > good article, several unfortunate truths within. > > Really, must be a wrong URL you posted then 8) > > The average Linux kernel hacker right now is late 20's to early 30's with > a degree and working professionally on the kernel The article isn't that wrong, that several people get paid now for hacking doesn't change much. It's more about proper software engineering, some people call it "a matter of taste" and are born with it, but for other people it's a hard learning process. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test10-pre1 problems on 4-way SuperServer8050
Hi, > How? If you compile with egcs-2.91.66 without frame pointers on ix86 then > __builtin_return_address() yields garbage. Does anybody have a generic > solution to this problem, other than "compile with frame pointers"? Or is > it fixed in newer versions of gcc? Are you sure? I just I tried it 2.91.66 and it works. With -fomit-frame-pointer only __builtin_return_address(0) works, but that is true for any version. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Updated Linux 2.4 Status/TODO List (from the ALS show)
Hi, On Fri, 13 Oct 2000, Richard Henderson wrote: > Either that or adjust how we do atomic operations. I can do > 64-bit atomic widgetry, but not with the code as written. It's probably more something for 2.5, but what about adding a lock argument to the atomic operations, then sparc could use that explicit lock and everyone else simply optimizes that away. That would allow us to use the full 32/64 bit. What we could get is a nice generic atomic exchange command like: atomic_exchange(lock, ptr, old, new); Where new can be a (simple) expression which can include old. Especially for risc system every atomic operation in atomic.h can be replaced with this. Or if you need more flexibility the same can be written as: atomic_enter(lock); __atomic_init(ptr, old); do { __atomic_reserve(ptr, old); } while (!__atomic_update(ptr, old, new)); atomic_leave(lock); atomic_enter/atomic_enter are either normal spinlocks or (in most cases) dummys. The other macros are either using RMW instructions or special load/store instructions. Using a lock makes it a bit more difficult to use and especially the last construction must never be required in normal drivers. On the other hand it gets way more flexible as we are not limited to a single atomic_t anymore. If anyone is interested how it could look like, I've put an example at http://zeus.fh-brandenburg.de/~zippel/linux/bin/atomic.tar.gz (It also includes a bit more documentation and some (a bit outdated) examples). Somewhere I also have a patch where I use this to write a spinlock free printk implementation, which is still interrupt and SMP safe. There are still some issues open (like ordering), but I'd like to know if there is a general interest in this. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: rotr32 / rotl32 (wordops.h) in 2.4.x ?
Hi, On Sun, 15 Oct 2000, Andi Kleen wrote: > You can just use the coded out variant (x<>(sizeof(x)*8-n))) > gcc is clever enough to turn it into an rotate when the CPU supports it. Hmm, I just tried it and two things one should take care of here. 1. x must be unsigned of course and 2. only gcc 2.95 can this do with a nonconstant n. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
BLKSSZGET change will break fdisk
Hi, I noticed that behaviour of BLKSSZGET changed between 2.2 and 2.4. One of the users will be fdisk, as soon as it is compiled with 2.4 kernel headers, but then fdisk will be no longer usable under 2.2! My question now is, wouldn't it be better to use a new ioctl (like BLKHSSZGET) and keep the old behaviour of BLKSSZGET? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: BLKSSZGET change will break fdisk
Hi, > Concerning fdisk, luckily you are mistaken - its source says > > #if defined(BLKSSZGET) && defined(HAVE_blkpg_h) > > so that it will not use the broken BLKSSZGET of 2.2. ??? BLKSSZGET has exactly the same ioctl number in 2.2 and 2.4, so if I compile fdisk under 2.4 and try to use it under 2.2, it will break. I saw the above test, but that is a compile time check not a run time check. Am I missing something here? BTW the problem is a bit bigger, I tried to partition a 4GB mo disk, what horribly breaks with current fdisk under 2.2. It results in an odd partition offset and you can't access any partition. So I need a fixed fdisk. As quick hack I simply reused BLKSSZGET, but now I have a fdisk, that only works with a fixed kernel. > [now that you make me look at this, there is a flaw in fdisk there; > fixed in 2.10p] BLKSSZGET isn't defined for fdisk.c? :) BTW sfdisk isn't fixed at all for different sector sizes. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: BLKSSZGET change will break fdisk
Hi, > - BLKSSZGET added in common.h Why don't we give BLKSSZGET a new number and make the old one obsolete? I don't think it's used anywhere, as its result is pretty useless in userspace (and even if it's used somewhere, they have to copy the define anyway). This way we don't need that version check. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: BLKSSZGET change will break fdisk
Hi, On Tue, 17 Oct 2000, Andries Brouwer wrote: > But you see that one would need a new name as well, > otherwise the value associated with BLKSSZGET would > depend on the kernel version, and one would need > version checks anyway. We do rename structures too, and this would be similiar. I'm more concerned about binary compability. If anyone uses BLKSSZGET (for whatever reason) it should not suddenly change the behaviour. Why should one need version checks? If someone really needs that ioctl, he has to copy that define, anyway, so his copy will still have the old value (and behaviour). > I think all this is too unimportant. Almost nobody uses > this stuff, and 2.4 is correct, and we may fix 2.2 one > of these days, perhaps for 2.2.19, and we may fix *fdisk > to correctly use it. But I also want to patch 2.2.17 or earlier kernels, but I don't want a fdisk that breaks on these kernels. > (By the way, have you checked that replacing get_sectorsize > by an empty routine, and specifying a -b option, works well?) No, not yet. > (Do you know which disks have unusual sector size? > So far I had only seen reports on a Fujitsu 640 MB. > Have you seen other sectorsizes than 512, 1024, 2048 > on non-IBM disks?) No, I didn't see anything else. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.2 ext2 filesystem corruption ? (was 2.4.2: What happened?(No
Hi, On Thu, 8 Mar 2001, God wrote: > Look at some of the confirmation requests in windows, some ask you twice > if you whish to perform an action. Even Red Hat (that I know of, others > may as well), has an alias for "rm" that by > default turns on confirmation. Why? Because not ALL users will know > better. Sure there are warnings that you can put in a man page somewhere, > but the truth is few users are actually going to READ the page. Is it > there fault? Yes. But should it be so easy to lose their data over > it rather then writting code to detect if said feature will work or > not? ... This is getting off topic, this has nothing to do with the kernel. You are free to do whatever you want in userspace, if you have the right capabilities. You're also free to write your own userspace tools, which protects the user from any danger, but it belongs in userspace not in the kernel. So please go the KDE/Gnome/... guys and whine there. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioremap_nocache problem?
Hi, On Tue, 23 Jan 2001, Mark Mokryn wrote: > ioremap_nocache does the following: > return __ioremap(offset, size, _PAGE_PCD); > > However, in drivers/char/mem.c (2.4.0), we see the following: > > /* On PPro and successors, PCD alone doesn't always mean > uncached because of interactions with the MTRRs. PCD | PWT > means definitely uncached. */ > if (boot_cpu_data.x86 > 3) > prot |= _PAGE_PCD | _PAGE_PWT; > > Does this mean ioremap_nocache() may not do the job? ioremap creates a new mapping that shouldn't interfere with MTRR, whereas you can map a MTRR mapped area into userspace. But I'm not sure if it's correct that no flag is set for boot_cpu_data.x86 <= 3... bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ioremap_nocache problem?
Hi, Timur Tabi wrote: > I mark the page as reserved when I ioremap() it. However, if I leave it marked > reserved, then iounmap() will not unmap it. If I mark it "unreserved" (i.e. > reset the reserved bit), then iounmap will unmap it, but it will decrement the > page counter to -1 and the whole system will crash soon thereafter. > > I've been asking about this problem for months, but no one has bothered to help > me out. The order is important: get_free_page(); set_bit(PG_reserved, &page->flags); ioremap(); ... iounmap(); clear_bit(PG_reserved, &page->flags); free_page(); Alternatively something like this should also be possible: get_free_page(); ioremap(); ... iounmap(); nopage() { ... atomic_inc(&page->count); return page; } But I never tried this version, so I can't guarantee anything. :) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Kernel Janitor's TODO list
Hi, On Sun, 28 Jan 2001, Manfred Spraul wrote: > And one more point for the Janitor's list: > Get rid of superflous irqsave()/irqrestore()'s - in 90% of the cases > either spin_lock_irq() or spin_lock() is sufficient. That's both faster > and better readable. > > spin_lock_irq(): you know that the function is called with enabled > interrupts. > spin_lock(): can be used in hardware interrupt handlers when only one > hardware interrupt uses that spinlocks (most hardware drivers), or when > all hardware interrupt handler set the SA_INTERRUPT flag (e.g. rtc and > timer interrupt) This is not a bug and only helps to make drivers nonportable. Please, don't do this. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Kernel Janitor's TODO list
Hi, On Mon, 29 Jan 2001, Andi Kleen wrote: > You can miss wakeups. The standard pattern is: > > get locks > > add_wait_queue(&waitqueue, &wait); > for (;;) { > if (condition you're waiting for is true) > break; > unlock any non sleeping locks you need for condition > __set_task_state(current, TASK_UNINTERRUPTIBLE); > schedule(); > __set_task_state(current, TASK_RUNNING); > reaquire locks > } > remove_wait_queue(&waitqueue, &wait); You still miss wakeups. :) Always set the task state first, then check the condition. See the wait_event*() macros you mentioned for the right order. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Mon, 5 Feb 2001, Linus Torvalds wrote: > This all proves that the lowest level of layering should be pretty much > noting but the vectors. No callbacks, no crap like that. That's already a > level of abstraction away, and should not get tacked on. Your lowest level > of abstraction should be just the "area". Something like > > struct buffer { > struct page *page; > u16 offset, length; > }; > > int nr_buffers: > struct buffer *array; > > should be the low-level abstraction. Does it has to be vectors? What about lists? I'm thinking about this for some time now and I think lists are more flexible. At higher level we can easily generate a list of pages and in a lower level you can still split them up as needed. It would be basically the same structure, but you could use it everywhere with the same kind of operations. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Mon, 5 Feb 2001, Linus Torvalds wrote: > > Does it has to be vectors? What about lists? > > I'd prefer to avoid lists unless there is some overriding concern, like a > real implementation issue. But I don't care much one way or the other - > what I care about is that the setup and usage time is as low as possible. > I suspect arrays are better for that. I was more thinking about the higher layers. Here it's simpler to setup a list of pages which can be send to a lower layer. In the page cache we already have per address space lists, so it would be very easy to use that. A lower layer can generate of course anything it wants out of this, e.g. it can generate sublists or vectors. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options
Hi, Dave Hansen wrote: > diff -puN mm/Kconfig~A6-mm-Kconfig mm/Kconfig > --- memhotplug/mm/Kconfig~A6-mm-Kconfig 2005-04-04 09:04:48.0 > -0700 > +++ memhotplug-dave/mm/Kconfig2005-04-04 10:15:23.0 -0700 > @@ -0,0 +1,25 @@ > +choice > + prompt "Memory model" > + default FLATMEM > + default SPARSEMEM if ARCH_SPARSEMEM_DEFAULT > + default DISCONTIGMEM if ARCH_DISCONTIGMEM_DEFAULT Does this really have to be a user visible option and can't it be derived from other values? The help text entries are really no help at all. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options
Hi, On Wed, 6 Apr 2005, Dave Hansen wrote: > On Wed, 2005-04-06 at 22:58 +0200, Roman Zippel wrote: > > Dave Hansen wrote: > > > --- memhotplug/mm/Kconfig~A6-mm-Kconfig 2005-04-04 09:04:48.0 > > > -0700 > > > +++ memhotplug-dave/mm/Kconfig2005-04-04 10:15:23.0 -0700 > > > @@ -0,0 +1,25 @@ > > > +choice > > > + prompt "Memory model" > > > + default FLATMEM > > > + default SPARSEMEM if ARCH_SPARSEMEM_DEFAULT > > > + default DISCONTIGMEM if ARCH_DISCONTIGMEM_DEFAULT > > > > Does this really have to be a user visible option and can't it be > > derived from other values? The help text entries are really no help at all. > > I hope that this selection will replace the current DISCONTIGMEM prompts > in the individual architectures. That way, you won't get a net increase > in the number of prompts. Why is this choice needed at all? Why would one choose SPARSEMEM over DISCONTIGMEM? Help texts such as "If unsure, choose " make the complete config option pretty useless. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] create mm/Kconfig for arch-independent memory options
Hi, On Wed, 6 Apr 2005, Dave Hansen wrote: > > Why is this choice needed at all? Why would one choose SPARSEMEM over > > DISCONTIGMEM? > > For now, it's only so people can test either one, and we don't have to > try to toss DICONTIGMEM out of the kernel in fell swoop. When the > memory hotplug options are enabled, the DISCONTIG option goes away, and > SPARSEMEM is selected as the only option. > > I hope to, in the future, make the options more like this: > > config MEMORY_HOTPLUG... > config NUMA... > > config DISCONTIGMEM > depends on NUMA && !MEMORY_HOTPLUG > > config SPARSEMEM > depends on MEMORY_HOTPLUG || OTHER_ARCH_THING > > config FLATMEM > depends on !DISCONTIGMEM && !SPARSEMEM I was hoping for this too, in the meantime can't you simply make it a suboption of DISCONTIGMEM? So an extra option is only visible when it's enabled and most people can ignore it completely by just disabling a single option. > > Help texts such as "If unsure, choose " make > > the complete config option pretty useless. > > They don't make it useless, they just guide a clueless user to the right > place, without them having to think about it at all. Those of us that > need to test the various configurations are quite sure of what we're > doing, and can ignore the messages. :) > > I'm not opposed to creating some better help text for those things, I'm > just not sure that we really need it, or that it will help end users get > to the right place. I guess more explanation never hurt anyone. Some basic explanation with a link for more information can't hurt. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel SCM saga..
Hi, On Thu, 7 Apr 2005, Linus Torvalds wrote: > I really disliked that in BitKeeper too originally. I argued with Larry > about it, but Larry (correctly, I believe) argued that efficient and > reliable distribution really requires the concept of "history is > immutable". It makes replication much easier when you know that the known > subset _never_ shrinks or changes - you only add on top of it. The problem is you pay a price for this. There must be a reason developers were adding another GB of memory just to run BK. Preserving the complete merge history does indeed make repeated merges simpler, but it builds up complex meta data, which has to be managed forever. I doubt that this is really an advantage in the long term. I expect that we were better off serializing changesets in the main repository. For example bk does something like this: A1 -> A2 -> A3 -> BM \-> B1 -> B2 --^ and instead of creating the merge changeset, one could merge them like this: A1 -> A2 -> A3 -> B1 -> B2 This results in a simpler repository, which is more scalable and which is easier for users to work with (e.g. binary bug search). The disadvantage would be it will cause more minor conflicts, when changes are pulled back into the original tree, but which should be easily resolvable most of the time. I'm not saying with this that the bk model is bad, but I think it's a problem if it's the only model applied to everything. > The thing is, cherry-picking very much implies that the people "up" the > foodchain end up editing the work of the people "below" them. The whole > reason you want cherry-picking is that you want to fix up somebody elses > mistakes, ie something you disagree with. > > That sounds like an obviously good thing, right? Yes it does. > > The problem is, it actually results in the wrong dynamics and psychology > in the system. First off, it makes the implicit assumption that there is > an "up" and "down" in the food-chain, and I think that's wrong. These dynamics do exists and our tools should be able to represent them. For example when people post patches, they get reviewed and often need more changes and bk doesn't really help them to redo the patches. Bk helped you to offload the cherry-picking process to other people, so that you only had to do cherry-collecting very efficiently. Another prime example of cherry-picking is Andrews mm tree, he picks a number of patches which are ready for merging and forwards them to you. Our current basic development model (at least until a few days ago) looks something like this: linux-mm -> linux-bk -> linux-stable Ideally most changes would get into the tree via linux-mm and depending on depending various conditions (e.g. urgency, review state) it would get into the stable tree. In practice linux-mm is more an aggregation of patches which need testing and since most bk users were developing against linux-bk, it got a lot less testing and a lot of problems are only caught at the next stage. Changes from the stable tree would even flow in the opposite direction. Bk supports certain aspects of the kernel development process very well, but due its closed nature it was practically impossible to really integrate it fully into this process (at least for anyone outside BM). In the short term we probably are in for a tough ride and we take whatever works best for you, but in the long term we need to think about how SCM fits into our kernel development model, which includes development, review, testing and releasing of kernel changes. This is more than just pulling and merging kernel trees. I'm aiming at a tool that can also support Andrews work, so that he can also better offload some of this work (and take a break sometimes :) ). Unfortunately every existing tool I know of is lacking in its own way, so we still have some way to go... bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel SCM saga..
Hi, On Fri, 8 Apr 2005, Tupshin Harper wrote: > > A1 -> A2 -> A3 -> B1 -> B2 > > > > This results in a simpler repository, which is more scalable and which is > > easier for users to work with (e.g. binary bug search). > > The disadvantage would be it will cause more minor conflicts, when changes > > are pulled back into the original tree, but which should be easily > > resolvable most of the time. > > > Both darcs and arch (and arch's siblings) have ways of maintaining the > complete history but speeding up operations. Please show me how you would do a binary search with arch. I don't really like the arch model, it's far too restrictive and it's jumping through hoops to get to an acceptable speed. What I expect from a SCM is that it maintains both a version index of the directory structure and a version index of the individual files. Arch makes it especially painful to extract this data quickly. For the common cases it throws disk space at the problem and does a lot of caching, but there are still enough problems (e.g. annotate), which require scanning of lots of tarballs. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel SCM saga..
Hi, On Fri, 8 Apr 2005, Linus Torvalds wrote: > Also, I suspect that BKCVS actually bothers to get more details out of a > BK tree than I cared about. People have pestered Larry about it, so BKCVS > exports a lot of the nitty-gritty (per-file comments etc) that just > doesn't actually _matter_, but people whine about. Me, I don't care. My > sparse-conversion just took the important parts. As soon as you want to synchronize and merge two trees, you will know why this information does matter. (/me looks closer at the sparse-conversion...) It seems you exported the complete parent information and this is exactly the "nitty-gritty" I was "whining" about and which is not available via bkcvs or bkweb and it's the most crucial information to make the bk data useful outside of bk. Larry was previously very clear about this that he considers this proprietary bk meta data and anyone attempting to export this information is in violation with the free bk licence, so you indeed just took the important parts and this is/was explicitly verboten for normal bk users. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel SCM saga..
Hi, On Fri, 8 Apr 2005, Linus Torvalds wrote: > Yes. Per-file history is expensive in git, because if the way it is > indexed. Things are indexed by tree and by changeset, and there are no > per-file indexes. > > You could create per-file _caches_ (*) on top of git if you wanted to make > it behave more like a real SCM, but yes, it's all definitely optimized for > the things that _I_ tend to care about, which is the whole-repository > operations. Per file history is also expensive for another reason. The basic reason is that I think that a hash based storage is not the best approach for SCM. It's lacking locality, so the more it grows the more it has to seek to collect all the data. To reduce the space usage you could replace the parent file with a sha1 reference + delta to the new file. This is basically what monotone does and might cause perfomance problems if you need to restore old versions (e.g. if you want to annotate a file). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel SCM saga..
Hi, On Sat, 9 Apr 2005, Eric D. Mudama wrote: > > For example bk does something like this: > > > > A1 -> A2 -> A3 -> BM > > \-> B1 -> B2 --^ > > > > and instead of creating the merge changeset, one could merge them like > > this: > > > > A1 -> A2 -> A3 -> B1 -> B2 > > > > This results in a simpler repository, which is more scalable and which > > is easier for users to work with (e.g. binary bug search). > > The disadvantage would be it will cause more minor conflicts, when changes > > are pulled back into the original tree, but which should be easily > > resolvable most of the time. > > The kicker comes that B1 was developed based on A1, so any test > results were based on B1 being a single changeset delta away from A1. > If the resulting 'BM' fails testing, and you've converted into the > linear model above where B2 has failed, you lose the ability to > isolate B1's changes and where they came from, to revalidate the > developer's results. What good does it do if you can revalidate the original B1? The important point is that the end result works and if it only fails in the merged version you have a big problem. The serialized version gives you the chance to test whether it fails in B1 or B2. > I believe that flattening the change graph makes history reproduction > impossible, or alternately, you are imposing on each developer to test > the merge results at B1 + A1..3 before submission, but in doing so, > the test time may require additional test periods etc and with > sufficient velocity, might never close. The merge result has to be tested either way, so I'm not exactly sure, what you're trying to say. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Code snippet to reconstruct ancestry graph from bk repo
Hi, On Sun, 10 Apr 2005, Paul P Komkoff Jr wrote: > (borrowed from Tommi Virtanen) > > Code snippet to reconstruct ancestry graph from bk repo: > bk changes -end':I: $if(:PARENT:){:PARENT:$if(:MPARENT:){ :MPARENT:}} > $unless(:PARENT:){-}' |tac > > format is: > newrev parent1 [parent2] > parent2 present if merge occurs. I know that this is possible and Larry's response would have been something like this: http://www.ussg.iu.edu/hypermail/linux/kernel/0502.1/0248.html bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
Hi, On Mon, 11 Jul 2005, Horst von Brand wrote: > > I don't generally disagree with that, I just think that defines are not > > part of that list. > > Covered in "bad coding style" and "hard to read code", at least. Somehow I missed the last lkml debate about where simple defines where a problem. > > Look, it's great that you do reviews, but please keep in mind it's the > > author who has to work with code and he has to be primarily happy with, > > so you don't have to point out every minor issue. > > Wrong. The author has to work with the code, but there are much more people > that have to read it now and fix it in the future. It doesn't make sense > having everybody using their own indentation style, variable naming scheme, > and ways of defining constants. I didn't say this, I said "minor issues". Please read more carefully. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dependency bug in gconfig?
Hi, On Tue, 12 Jul 2005, randy_dunlap wrote: > This appears to be a dependency bug in gconfig to me. > > If I enable NETCONSOLE to y, NETPOLL becomes y. (OK) > If I then disable NETCONSOLE to n, NETPOLL remains y. > > If I enable NETCONSOLE to m, NETPOLL remains n. Why is that? > > config NETPOLL > def_bool NETCONSOLE > > Should this cause NETCONSOLE to track NETPOLL? It should (although it doesn't show it immediately). Did you compare the saved config files? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 18/19] Kconfig I18N: LKC: whitespace removing
Hi, On Wed, 13 Jul 2005, Egry G�bor wrote: > diff -puN scripts/kconfig/zconf.l~kconfig-i18n-18-whitespace-fix > scripts/kconfig/zconf.l > --- > linux-2.6.13-rc3-i18n-kconfig/scripts/kconfig/zconf.l~kconfig-i18n-18-whitespace-fix > 2005-07-13 18:32:20.0 +0200 > +++ linux-2.6.13-rc3-i18n-kconfig-gabaman/scripts/kconfig/zconf.l > 2005-07-13 18:32:20.0 +0200 > @@ -57,6 +57,17 @@ void append_string(const char *str, int > *text_ptr = 0; > } > > +void append_helpstring(const char *str, int size) > +{ > + while (size) { > + if ((str[size-1] != ' ') && (str[size-1] != '\t')) > + break; > + size--; > + } > + > + append_string (str, size); > +} > + > void alloc_string(const char *str, int size) > { > text = malloc(size + 1); > @@ -225,7 +236,7 @@ n [A-Za-z0-9_] > append_string("\n", 1); > } > [^ \t\n].* { > - append_string(yytext, yyleng); > + append_helpstring(yytext, yyleng); > if (!first_ts) > first_ts = last_ts; > } Simply integrate the function into the caller. bye, Roman
Re: [PATCH 4/19] Kconfig I18N: lxdialog: multibyte character support
Hi, On Wed, 13 Jul 2005, Egry G�bor wrote: > UTF-8 support for lxdialog with wchar. The installed wide ncurses > (ncursesw) is optional because some languages (ex. English, Italian) > and ISO 8859-xx charsets don't require this patch. This is ugly, this just adds lots of #ifdefs with practically duplicated code. Please use some wrapper functions/macros. bye, Roman
Re: [PATCH 0/19] Kconfig I18N completion
Hi, On Wed, 13 Jul 2005, Egry G�bor wrote: > The following patches complete the "Kconfig I18N support" patch by > Arnaldo. First I'd really like to see some documentation on this, which describes the interface how tools/distributions can provide Kconfig I18N support. > - answering (Y/M/N) This one is just silly. Provide a nice helptext, which describes what that means, for xconfig I'm also accepting nice descriptive icons. bye, Roman
Re: Merging relayfs?
Hi, On Mon, 11 Jul 2005, Andrew Morton wrote: > > > Hi Andrew, can you please merge relayfs? It provides a low-overhead > > > logging and buffering capability, which does not currently exist in > > > the kernel. > > > > While the code is pretty nicely in shape it seems rather pointless to > > merge until an actual user goes with it. > > Ordinarily I'd agree. But this is a bit like kprobes - it's a funny thing > which other kernel features rely upon, but those features are often ad-hoc > and aren't intended for merging. I agree with Christoph, I'd like to see a small (and useful) example included, which can be used as reference. relayfs client still need some code of their own to communicate with user space. If I look at the example code I'm not really sure netlink is a good way to go as control channel. kprobes has a rather simple interface, relayfs is more complex and I think it's a good idea to provide some sane and complete example code to copy from. Looking through the patch there are still a few areas I'm concerned about: - the usage of atomic_t look a little silly, there is only a single writer and probably needs some cache line optimisations - I would prefer "unsigned int" over just "unsigned" - the padding/commit arrays can be easily managed by the client - overwrite mode can be implemented via the buffer switch callback In general I'm not against merging, but I have a few ideas for further cleanups/optimisations and it really would help to have some useful example code (e.g. a _simple_ event tracer). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/4] new human-time soft-timer subsystem
Hi, On Thu, 14 Jul 2005, Nishanth Aravamudan wrote: > We no longer use jiffies (the variable) as the basis for determining > what "time" a timer should expire or when it should be added. Instead, > we use a new function, do_monotonic_clock(), which is simply a wrapper > for getnstimeofday(). And suddenly a simple 32bit integer becomes a complex 64bit integer, which requires hardware access to read a timer and additional conversion into ns. Why is suddenly everyone so obsessed with molesting something simple and cute as jiffies? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [0/5+1] menu -> menuconfig part 1
Hi, On Sun, 17 Jul 2005, Bodo Eggert wrote: > These patches change some menus into menuconfig options. > > Reworked to apply to linux-2.6.13-rc3-git3 I like it, but I would prefer to give it first a bit more exposure in -mm, as it does change the menu structure and the behaviour is little different, so I'd like to see if there's a some feedback first from people using it. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Merging relayfs?
Hi, On Thu, 14 Jul 2005, Tom Zanussi wrote: > The netlink control channel seems to work very well, but I can > certainly change the examples to use something different. Could you > suggest something? It just looks like a complicated way to do an ioctl, a control file that you can read/write would be a lot simpler and faster. > > Looking through the patch there are still a few areas I'm concerned about: > > - the usage of atomic_t look a little silly, there is only a single > > writer and probably needs some cache line optimisations > > The only things that are atomic are the counts of produced and > consumed buffers and these are only ever updated or read in the slow > buffer-switch path. They're atomic because if they weren't, wouldn't > it be possible for the client to read an unfinished value if the > producer was in the middle of updating it? No. > > - I would prefer "unsigned int" over just "unsigned" > > - the padding/commit arrays can be easily managed by the client > > Yes, I can move them out and update the examples to reflect that, but > I thought that if this was something that most clients would need to > do, it made some sense to keep it in relayfs and avoid duplication in > the clients. If a lot of clients needs this, there a different ways to do this, e.g. by introducing some helper functions that clients can use. This way you can keep the core simple and allow the client to modify its behaviour. > > - overwrite mode can be implemented via the buffer switch callback > > The buffer switch callback is already where this is handled, unless > you're thinking of something else - one of the first checks in the > buffer switch is relay_buf_full(), which always returns 0 if the > buffer is in overwrite mode. I mean, relayfs doesn't has to know about this, the client itself can do it (e.g. via helper functions). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/12] NTP cleanup work (v. B4)
Hi, On Fri, 15 Jul 2005, john stultz wrote: > In my attempts to rework the timeofday subsystem, it was suggested I > try to avoid mixing cleanups with functional changes. In response to the > suggestion I've tried to break out the majority of the NTP cleanups I've > been working out of my larger patch and try to feed it in piece meal. > > The goal of this patch set is to isolate the in kernel NTP state machine > in the hope of simplifying the current timeofday code. I don't really like, where you taken it with ntp_advance(). With these patches you put half the ntp state machine in there and execute it at every single tick. >From the previous patches I can guess where you want to go with this, but I think it's the wrong direction. The code is currently as is for a reason, it's optimized for tick based system and I don't see a reason to change this for tick based system. If you want to change this for cycle based system, you have to give more control to the arch/timer source, which simply call a different set of functions and the ntp core system basically just acts as a library to the timer source. Tick based timer sources continue to update xtime and cycle based system will modify the cycle multiplier (e.g. what ppc64 does). Don't force everything to use the same set of functions, you'll make it only more complex. Larger ntp state updates don't have to be done more than once a second and leave the details of how the clock is updated to the clock source (just provide some library functions it can use). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Merging relayfs?
Hi, On Mon, 18 Jul 2005, Steven Rostedt wrote: > I'm actually very much against this. Looking at a point of view from the > logdev device. Having a callback to know to continue at every buffer > switch would just be slowing down something that is expected to be very > fast. What exactly would be slowed down? It would just move around some code and even avoid the overwrite mode check. > I don't see the problem with having an overwrite mode or not. Why > can't relayfs know this? The point is to design a simple and flexible relayfs layer, which means not every possible function has to be done in the relayfs layer, as long it's flexible enough to build additional functionality on top of it (for which it can again provide some library functions). bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Merging relayfs?
Hi, On Mon, 18 Jul 2005, Steven Rostedt wrote: > > What exactly would be slowed down? > > It would just move around some code and even avoid the overwrite mode > > check. > > Yes, you're adding a jump to another function via a function pointer, > that would kill the cache line of execution, to avoid a simple check, or > some other way of handling it. RTFS. (deliver_default_callback) > Since I don't want to know the internals > of relayfs, You have to anyway, currently relayfs client need some knowledge about how buffers are managed. > the overwrite mode could be implemented in a more officient way. I wouldn't call the buffer switch routine efficient, yet. > > > I don't see the problem with having an overwrite mode or not. Why > > > can't relayfs know this? > > > > The point is to design a simple and flexible relayfs layer, which means > > not every possible function has to be done in the relayfs layer, as long > > it's flexible enough to build additional functionality on top of it (for > > which it can again provide some library functions). > > The overwrite mode isn't that complex. You don't want to make something > so flexible that it becomes more complex. Assembly is more flexible > than C but I wouldn't want to code a lot with it. A library function > for me is out of the question, since what I build on top of relayfs is > mostly in the kernel. The overwrite mode would then have to be > implemented through another kernel activity. I might as well keep my > own ring buffers and forget about using relayfs, and all my points in > which I argue for it being merged is mute. I must admit I have no clue, what you're talking about here... The keywords above are "_simple_ _and_ _flexible_". bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Merging relayfs?
Hi, On Mon, 18 Jul 2005, Karim Yaghmour wrote: > I guess I just don't get the point here. Why cut something away if many > users will need it. If it's that popular that you're ready to provide a > library function to do it, then why not just leave it to boot? One of the > goals of relayfs is to avoid code duplication with regards to buffering > in general. The road to bloatness is paved with lots of little features. There aren't that many users anyway (none of the examples use that feature). I'd prefer to concentrate on a simple and correct relayfs layer and we can still think about other features as more users appear. Starting a design by implementing every little feature which _might_ be needed is a really bad idea. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/12] NTP cleanup work (v. B4)
Hi, On Wed, 20 Jul 2005, john stultz wrote: > I really don't think the NTP changes I've mailed is very complex. > Please, be specific and point to something you think is an issue and > I'll do my best to fix it. Maybe I should explain, in what direction I would take it. Let's first only take tick based updates, one property I don't want to see go away (and which you remove in the last patch), is to basically update xtime at every tick by (tick_nsec+time_adj) (and maybe fold time_adjust into time_adj), no multiply/divide just adds/shifts. Every second (or maybe even less frequently) we update time_adj, where we even might integrate a better to way to add previous errors due to SHIFT_HZ. To add support for continous time sources, the generic ntp code would just provide [tick,frequency,offset] values and the time source converts it into its internal values. A tick based source calculates [tick_nsec, time_adj] and a continous source calculates the [offset,multiplier]. These values should be recalculated as infrequently as possible and not every single tick as you do with ppc_adjtimex. This also means a continous source updates xtime basically by calling gettimeofday (what ppc64 already almost does) and doesn't use update_wall_time() at all. Maybe I'm missing something, but I don't see a reason to forcibly merge both ways to update the clock, keep them seperate and let the generic ntp code provide the basic parameters which the time source uses to update the clock. The important thing is to precalculate as much as possible, so that the runtime overhead is as low as possible and these precalculations differ between time sources, so what your patches basically do is to remove all of these precalculations and I can't convince myself to see this as a good thing. BTW do you have any user space test code for this? This might be useful to verify that the changes are really correct and a prototype might be a good way to demonstrate the kernel changes. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add schedule_timeout_{interruptible,uninterruptible}{,_msecs}() interfaces
Hi, On Fri, 22 Jul 2005, Arjan van de Ven wrote: > Also I'd rather not add the non-msec ones... either you're raw and use > HZ, or you are "cooked" and use the msec variant.. I dont' see the point > of adding an "in the middle" one. (Yes this means that several users > need to be transformed to msecs but... I consider that progress ;) What's wrong with using jiffies? It's simple and the current timeout system is based on it. Calling it something else doesn't suddenly give you more precision. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add schedule_timeout_{interruptible,uninterruptible}{,_msecs}() interfaces
Hi, On Sat, 23 Jul 2005, Arjan van de Ven wrote: > > What's wrong with using jiffies? > > A lot of the (driver) users want a wallclock based timeout. For that, > miliseconds is a more obvious API with less chance to get the jiffies/HZ > conversion wrong by the driver writer. We have helper functions for that. The point about using jiffies is to make it _very_ clear, that the timeout is imprecise and for most users this is sufficient. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/