from:"Ulrich Windl"

3.0.38: strange boot message: "Time: 165:165:165 Date: 165/165/65" with Xen

2012-09-26 Thread Ulrich Windl

Hi!

I just discovered a strange "<6>[0.123867] Time: 165:165:165  Date: 
165/165/65" boot message in a Xen DomU VM for SLES11 SP2 on AMD Opteron 
(x86_64). The context is:

...
<6>[0.080197] Initializing cgroup subsys net_cls
<6>[0.080199] Initializing cgroup subsys blkio
<6>[0.080204] Initializing cgroup subsys perf_event
<4>[0.080245] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
<4>[0.080245] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(
8)
<6>[0.080293] SMP alternatives: switching to UP code
<6>[0.103716] Brought up 1 CPUs
<6>[0.103788] devtmpfs: initialized
<6>[0.103977] print_constraints: dummy:
<6>[0.123867] Time: 165:165:165  Date: 165/165/65
<6>[0.123908] NET: Registered protocol family 16
<6>[0.124081] SMP alternatives: switching to SMP code
<6>[0.150019] Brought up 4 CPUs
<3>[0.150019] PCI: Fatal: No config space access function found
<6>[0.150019] PCI: setting up Xen PCI frontend stub
...

Maybe it's related to this hypervisor message (xm dmesg) in Xen Dom0 (but the 
RTC is at 70, right?):
(XEN) mm.c:833:d1 Non-privileged (1) attempt to map I/O space 00f0
(XEN) mm.c:833:d3 Non-privileged (3) attempt to map I/O space 00f0
(XEN) mm.c:833:d2 Non-privileged (2) attempt to map I/O space 00f0
(XEN) mm.c:833:d1 Non-privileged (1) attempt to map I/O space 00f0
(XEN) mm.c:833:d2 Non-privileged (2) attempt to map I/O space 00f0

It's not a big issue, but maybe if the RTC cannot be read, it's better to skip 
rather tan outputting a wrong date/time.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

read stalls with large RAM: transparent huges pages, dirty buffers, or I/O (block) scheduler?

2013-10-10 Thread Ulrich Windl

Hi!

We are running some x86_64 servers with large RAM (128GB). Just to imagine: 
With a memory speed of a little more than 9GB/s it takes > 10 seconds to read 
all RAM...

In the past and recently we had problems with read() stalls when the kernel was 
writing back big amounts (like 80GB) of dirty buffers on a somewhat slow 
(40MB/s) device. The problem is old and well-known, it seems, but to really 
solved.

One recommendation was to limit the amount of dirty buffers, which actually did 
not help to really avoid the problem, specifically if new dirty buffers are 
used as soon as they are available (i.e.: some were flushed). I had success 
with limiting the used memory (including dirty pages) with control groups 
(memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig setting 
up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite incomplete 
(no group write permission or ACL setup possible), so the end user can hardly 
use that.

I still don't know whether read stalls are caused by the I/O channel or device 
being saturated, or whether the kernel is waiting for unused buffers to receive 
the read data, but I learned that I/O schedulers (and possibly the block layer 
optimizations) can cause extra delays, too.

We had one situation where a single sector could not be read with direct I/O 
for 10 seconds.

Recently we had the problem again, but it was clear that it was _not_ the 
device being overloaded, nor was it the I/O channel. The read problem was 
reported for a devioce that was almost idle, and the I/O channel (FC) can 
handle much more than the disk system can in both directions. So the problem 
seems to be inside the kernel.

Oracle recommends (in article  1557478.1, without explaining the details) to 
turn off transparent huge pages. Before that I didn't think much about that 
feature. It seems the kernel is not just creating huge pages when they are 
requested explicitly (that's what I had thought), but also implicitly to reduce 
the number of pages to me managed. Collecting smaller pages to combine them for 
huge pages may also involve moving memory around (compaction), it seems. I 
still don't know whether the kernel will also try to compact dirty cache pages 
to huge pages, but we still see read stalls when there are many dirty pages 
(like when copying 400GB of data to a somewhat slow (30MB/s) disk.

Now I wonder what the real solution to the problem (not the numerous 
work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush to 
give read a chance may not be sufficient when read needs to wait for unused 
pages, especially if the disks being read from are faster than those being 
written to.
To my understanding dirty pages have an "age" that is used to decide whether to 
flush them or not. Also the I/O scheduler seems to prefer read requests over 
write requests. What I do not know is whether a read request is sent to the I/O 
scheduler before buffer pages are assigned to the request, or after the pages 
were assigned. So a read request only has the chance to have an "age" once it 
entered the I/O scheduler, right?

So if read and writes had an "age" both, some EDF (earliest deadline first) 
scheduling could be used to perform I/O (which would be controlling buffer 
usage as a side-effect). For transparent huge pages, requests for a huge page 
should also have an age and a priority that is significantly below that of I/O 
buffers. If there exists an efficient algorithm and data model to perform these 
tasks, the problem may be solved.

Unfortunately if many buffers are dirtied at one moment and reads are requested 
significantly later, there may be an additional need for time-slices when doing 
I/O (note: I'm not talking about quotas of some MB, but quotas of time). The 
I/O throughput may vary a lot, and time seems the only way to manage latency 
correctly. To avoid a situation where reads may cause stalling writes (and thus 
the age of dirty buffers growing without bounds), the priority of writes should 
be _carefully_ increased, taking care not to create a "fright train of dirty 
buffers" to be flushed. So maybe "smuggle in" a few dirty buffers between read 
requests. As a high-level flow control (like for the cgroups mechanism), 
processes with a high amount of dirty buffers should be suspended or scheduled 
with very low priority to give the memory and I/O systems a change to process 
the dirty buffers.

For reference: The machine in question is at 3.0.74-0.6.10-default with the 
latest SLES11 SP2 kernel being 3.0.93-0.5.

I'd like to know what the gurus thing about that. I think with increasing RAM 
this issue will become extremely important soon.

Regards,
Ulrich
P.S: Not subscribed to linux-kernel, so keep me on CC:, please

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Antw: read stalls with large RAM: transparent huges pages, dirty buffers, or I/O (block) scheduler?

2013-10-10 Thread Ulrich Windl

I forgot to mention: CPU power is not the problem: We have 2 * 6 Cores (2 
Threads each), making 24 logical CPUs...

>>> Ulrich Windl  schrieb am 10.10.2013 um 
>>> 10:15
in Nachricht <52566237.478 : 161 : 60728>:
> Hi!
> 
> We are running some x86_64 servers with large RAM (128GB). Just to imagine: 
> With a memory speed of a little more than 9GB/s it takes > 10 seconds to read 
> all RAM...
> 
> In the past and recently we had problems with read() stalls when the kernel 
> was writing back big amounts (like 80GB) of dirty buffers on a somewhat slow 
> (40MB/s) device. The problem is old and well-known, it seems, but to really 
> solved.
> 
> One recommendation was to limit the amount of dirty buffers, which actually 
> did not help to really avoid the problem, specifically if new dirty buffers 
> are used as soon as they are available (i.e.: some were flushed). I had 
> success with limiting the used memory (including dirty pages) with control 
> groups (memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig 
> setting up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite 
> incomplete (no group write permission or ACL setup possible), so the end user 
> can hardly use that.
> 
> I still don't know whether read stalls are caused by the I/O channel or 
> device being saturated, or whether the kernel is waiting for unused buffers 
> to receive the read data, but I learned that I/O schedulers (and possibly the 
> block layer optimizations) can cause extra delays, too.
> 
> We had one situation where a single sector could not be read with direct I/O 
> for 10 seconds.
> 
> Recently we had the problem again, but it was clear that it was _not_ the 
> device being overloaded, nor was it the I/O channel. The read problem was 
> reported for a devioce that was almost idle, and the I/O channel (FC) can 
> handle much more than the disk system can in both directions. So the problem 
> seems to be inside the kernel.
> 
> Oracle recommends (in article  1557478.1, without explaining the details) to 
> turn off transparent huge pages. Before that I didn't think much about that 
> feature. It seems the kernel is not just creating huge pages when they are 
> requested explicitly (that's what I had thought), but also implicitly to 
> reduce the number of pages to me managed. Collecting smaller pages to combine 
> them for huge pages may also involve moving memory around (compaction), it 
> seems. I still don't know whether the kernel will also try to compact dirty 
> cache pages to huge pages, but we still see read stalls when there are many 
> dirty pages (like when copying 400GB of data to a somewhat slow (30MB/s) 
> disk.
> 
> Now I wonder what the real solution to the problem (not the numerous 
> work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush 
> to give read a chance may not be sufficient when read needs to wait for 
> unused pages, especially if the disks being read from are faster than those 
> being written to.
> To my understanding dirty pages have an "age" that is used to decide whether 
> to flush them or not. Also the I/O scheduler seems to prefer read requests 
> over write requests. What I do not know is whether a read request is sent to 
> the I/O scheduler before buffer pages are assigned to the request, or after 
> the pages were assigned. So a read request only has the chance to have an 
> "age" once it entered the I/O scheduler, right?
> 
> So if read and writes had an "age" both, some EDF (earliest deadline first) 
> scheduling could be used to perform I/O (which would be controlling buffer 
> usage as a side-effect). For transparent huge pages, requests for a huge page 
> should also have an age and a priority that is significantly below that of 
> I/O buffers. If there exists an efficient algorithm and data model to perform 
> these tasks, the problem may be solved.
> 
> Unfortunately if many buffers are dirtied at one moment and reads are 
> requested significantly later, there may be an additional need for 
> time-slices when doing I/O (note: I'm not talking about quotas of some MB, 
> but quotas of time). The I/O throughput may vary a lot, and time seems the 
> only way to manage latency correctly. To avoid a situation where reads may 
> cause stalling writes (and thus the age of dirty buffers growing without 
> bounds), the priority of writes should be _carefully_ increased, taking care 
> not to create a "fright train of dirty buffers" to be flushed. So maybe 
> "smuggle in" a few dirty buffers between read requests. As a high-level flow 
> control (like for the cgroups mechanism), processes with a high amount of 
>

ext3 corruption in 3.0 kernel (SLES11 SP2 x86_64 (AMD Opteron))

2012-12-07 Thread Ulrich Windl

Hi!

I thought I'd let you know of two ext3 corruptions found on an ADM Opteron 
server running SLES11 SP2 (kernel-xen-3.0.42-0.7.3). Corruptions occurred at 
different times in different files on different machines: Too much to be 
ignored.

The older one looked like this:
[75548.267404] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry 
in directory #205978: rec_len % 4 != 0 - offset=4096, inode=2531699, 
rec_len=41331, name_len=38

And a more recent one looks like this:
kernel: [261958.359401] EXT3-fs error (device dm-0): ext3_add_entry: bad entry 
in directory #85582: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0

As the nodes are running Xen VMM in a cluster, it's possible that node see 
Resets at any time (fencing), but I thought a journaling filesystem would 
either not allow or fix corruption.

In both cases I found this problem when a file could not be created like this 
RPM error message:
Error: RPM failed: error: unpacking of archive failed on file 
/lib/modules/3.0.42-0.7-default/kernel/drivers/media/video/cpia2/cpia2.ko;50c1fafd:
 cpio: open failed - Input/output error

After a reset I had to repair the filesystem manually with these type of errors:
Inode 248552 was part of the orphaned inode list.  FIXED.
Block bitmap differences:
Free blocks count wrong for group

After repair and reboot I still saw:
kernel: [  698.061916] EXT3-fs error (device dm-0): ext3_lookup: deleted inode 
referenced: 68710
kernel: [  698.061916] EXT3-fs error (device dm-0): ext3_lookup: deleted inode 
referenced: 68711

(dm-0 is the root Logical Volume)

CPU-Details (Sun X4100 Server) are:
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 33
model name  : Dual Core AMD Opteron(tm) Processor 285
stepping: 2

(I know this CPU has some bugs with virtualization; is filesystem corruption 
one of them?)

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: using cgroups in real life

2012-11-14 Thread Ulrich Windl

Hi!

I have a question on cgroups (as of Linux 3.0):
The concept is to mount a filesystem, and configure cgroups through it. This 
implies that all the files belong to root (or maybe some other fixed user).

AFAIK, you can chmod() and chown() files, but these bits are only kept in the 
i-node cache, so they may change at any time.

I think this is bad, because if you want to allow users to limit (maybe memory 
usage) by using some predefined cgroup, the user needs at least partial write 
access to that cgroup (to add the PID). Probably this also means the user could 
add any PID (even those processes not owned by him).

The alternative is that a privileged task manages cgroups and PIDs. This is 
difficult, for example, if the process to control does not exist yet (e.g. the 
user logs in and then starts some process). It's getting tricky if the user 
maybe runs some big fat database (which should work at peak performance), and 
later logs in to do a backup of the database (which is not time critical, and 
should not steal all the I/O bandwidth). I wonder how a solution would look 
like to allow the user to limit the bandwidth (maybe use of page cache, too) of 
the backup in an reliable way.

Being paranoid, the user should at most be able to limit his own processes. I 
cannot envision a proper solution with the current interface.

Would anybody share some good ideas with me?

(I'm not subscribed to the kernel list, so please CC:)

Regards,
Ulrich



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.0: blk-cgroup.c: allow '*' for device selection for "throttle.read_bps_device" and alike

2012-11-16 Thread Ulrich Windl

Hi!

I have a wish for Linux 3.x and blkio cgroup subsystem:
Allow to specify any device like: blkio.throttle.read_bps_device = "*:* 
41943040"

Why: With multipathing being effective, you can't predict the device number 
your device will have in advance (I'm talking about "/etc/cgconfig.conf").

Example:
# multipath -ll |grep dm-
CBW_DB_FATA-E2 (3600508b4001085dd00011380) dm-10 HP,HSV200
CBW_CI-E2 (3600508b4001085dd0001140c) dm-12 HP,HSV200
CBW_DB_FATA-E1 (3600508b4001085e3f181) dm-18 HP,HSV200
CBW_CI-E1 (3600508b4001085e3f1f6) dm-19 HP,HSV200
DP_FileLib-E2 (3600508b4001085dd000114be) dm-16 HP,HSV200
CBW_DB_Exe-E2 (3600508b4001085dd0001137b) dm-11 HP,HSV200
CBW_DB_Exe-E1 (3600508b4001085e3f17e) dm-17 HP,HSV200
DP_DB_10k-E2 (3600508b4001085dd00011448) dm-15 HP,HSV200
CBW_DB_BTD-E2 (3600508b4001085dd000115ad) dm-14 HP,HSV200
DP_DB_10k-E1 (3600508b4001085e3f246) dm-20 HP,HSV200
CBW_DB_10k-E2 (3600508b4001085dd000115a5) dm-9 HP,HSV200
CBW_DB_10k-E1 (3600508b4001085e3f35f) dm-13 HP,HSV200

The code is in blkio_policy_parse_and_set() of blk-cgroup.c.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

"floppy0: floppy timeout called", https://bugzilla.novell.com/show_bug.cgi?id=799559

2013-06-06 Thread Ulrich Windl

Hi!

maybe someone wants to have a look at kernel messages that look like debug 
dumps from the floppy driver. These messages fill up syslog unnecessarily. You 
can find the kernel messages in 
https://bugzilla.novell.com/show_bug.cgi?id=799559. Last seen in 
kernel-default-3.7.10-1.11.1.i586 of openSUSE 12.3...

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Wtrlt: Q: NFS: directory XX/YYY contains a readdir loop.Please contact your server vendor.

2013-08-16 Thread Ulrich Windl

Re-sent due to "5.7.1 Content-Policy reject msg: The capital Triple-X in 
subject is way too often associated with junk email, please rephrase. ":

>>> "Ulrich Windl"  schrieb am 16.08.2013 um
10:29 in Nachricht <520e15ef.ed38.00a...@rz.uni-regensburg.de>:
> Hi,
> 
> recently I found out that we his the "NFS: directory in/mdoc contains a 
> readdir loop.Please contact your server vendor." frequently on an NFS-Client 
> running SLES11 SP2 (3.0.80-0.7-default). The NFS server is also SLES11 SP2, 
> and 
> the exported filesystem is ext3 with "dir_index" on.
> 
> SLES support suggested to turn off "dir_index" in ext3, which "should be 
> safe".
> 
> I googled the problem, and I found some (to me) vague description by Ted Tso 
> ("If not readdir() then what?") back in 2011 referring to ext3.
> 
> Now I wonder: Is this problem restricted to just ext3, or to any filesystem?
> 
> We have (and I cannot change it) directories with many files, even if just 
> temporary.
> 
> The statistics say: "122431/524288 files (3.4% non-contiguous), 
> 1230006/2097152 blocks"
> 
> The biggest directory has almost 1MB in size, but just about 16513 directory 
> entries.
> 
> I'm wondering whether "directory compaction" (compact slots of removed 
> entries) would help with the problem. In HP-UX VxFS you could do directory 
> compation online...
> 
> If you can explain the relationship of ext3 and other filesystems with this 
> bug, please reply keeping the CC:
> 
> Thank you,
> Ulrich





Header
Description: Binary data

chown: s-Bits: to clear or not to clear

2013-07-16 Thread Ulrich Windl

Hi folks,

I discovered (SLES11 SP2 with kernel  3.0.80) that a chown executed by root 
(from non-root to non-root user) clears any s-Bits that were set for the old 
owner.

The man page (man 2 chown) says:
   When  the  owner  or  group of an executable file are changed by a non-
   superuser, the S_ISUID and S_ISGID mode bits are cleared.   POSIX  does
   not specify whether this also should happen when root does the chown();
   the Linux behavior depends on the kernel version.  In case  of  a  non-
   group-executable  file (i.e., one for which the S_IXGRP bit is not set)
   the S_ISGID bit indicates mandatory locking, and is not  cleared  by  a
   chown().

As there are good arguments for and against clearing the s-Bits during chown, 
there are probably only good arguments for having an option for chown(1) to 
preserve the s-Bits. What do you think? (I know this is the wrong list for 
discussing utils).

Regards,
Ulrich Windl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Possible mmap() write() problem in SLES11 SP2 kernel

2013-08-01 Thread Ulrich Windl

Hi folks!

I think I'd let you know (maybe I'm wrong, and the kernel is right):

I write a C-program that maps a file into an private writable map. Then I 
modify the area a bit and use one write to write that area back to a file.

This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel  
3.0.80-0.7 the write() fails with EFAULT if the output file is the same as the 
input file.

The strace is amazingly short (I removed the unrelated calls):
open("xxx", O_RDONLY)   = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0
mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000
close(3)= 0
open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
write(3, 0x7f85ac045000, 4414)  = -1 EFAULT (Bad address)
close(3)= 0
munmap(0x7f85ac045000, 4414)= 0

I want to have your attention if this should work, and you get my attention if 
this should not work. Note that the input file is closed before it's opened for 
write again. As the output file is typically shorter than the input, I didn't 
want to use a non-private mapping and a truncate, just in case you wonder...

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: Re: Possible mmap() write() problem in SLES11 SP2 kernel

2013-08-04 Thread Ulrich Windl

>>> Hugh Dickins  schrieb am 04.08.2013 um 00:37 in Nachricht
:
> On Thu, 1 Aug 2013, Ulrich Windl wrote:
>> Hi folks!
>> 
>> I think I'd let you know (maybe I'm wrong, and the kernel is right):
>> 
>> I write a C-program that maps a file into an private writable map. Then I 
> modify the area a bit and use one write to write that area back to a file.
>> 
>> This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel  
> 3.0.80-0.7 the write() fails with EFAULT if the output file is the same as 
> the input file.
> 
> I wonder if you actually did exactly the same on both kernels.

Hi!

thanks for replying! Actually id did the sam a few thousand times (with 
different files and different lengths) in the previous kernel, weher it never 
failed, just as with the newer kernel where it always fails (it seems).

> 
>> 
>> The strace is amazingly short (I removed the unrelated calls):
> 
> Providing that was very helpful.
> 
>> open("xxx", O_RDONLY)   = 3
>> fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0
>> mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000
>> close(3)= 0
>> open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
> 
> The crucial point is the above O_TRUNC when you now open the file for
> writing: that truncates the file to 0-length, which unmaps any pages
> mapped from it into userspace.  Even the privately modified COW pages:

Well, but the mapping is PRIVATE, so I guessed once mapped, changes to the map 
won't affect the file, just as changes to the file won't affect the map. 
Specifically when re-opening the file for writing with O_TRUNC I did not expect 
the map to become invalid. Also note that the unmap still returns no error.
My manual page vaguely says: "It is unspecified whether changes made to the 
file after the mmap() call are visible in the mapped region."
> that often seems surprising, but it is how mmap versus truncate is
> specified to work.
> 
>> write(3, 0x7f85ac045000, 4414)  = -1 EFAULT (Bad address)
> 
> If your program now touched a part of the mapping, it would get
> SIGBUS, there being no pages of underlying object to page in from.
> But since you're accessing the area from within a system call,
> that simply fails with EFAULT.

OK, if things are like this, the older kernel must have been faulty.

> 
>> close(3)= 0
>> munmap(0x7f85ac045000, 4414)= 0
>> 
>> I want to have your attention if this should work, and you get my attention 
> if this should not work.
> 
> It should not work.
> 
>> Note that the input file is closed before it's opened for write again. As 
> the output file is typically shorter than the input, I didn't want to use a 
> non-private mapping and a truncate, just in case you wonder...
> 
> (I didn't understand your logic there.)

The alternative to write() a part of the PRIVATE area would be to work with a 
non-PRIVATE area that is truncated after flushing the changes. In principle the 
same blocks could be written multiple times (when you move data from later 
parts to earlier parts (i.e.: from the far end closer to the beginning)), so I 
thought a PRIVATE mapping plus one write() would avoid that. I had the coice of 
truncate while opening, or to truncate the extra data after write(). I chose 
the first alternative.

Maybe I'll re-design...

Thanks,
Ulrich

> 
> Hugh



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

message (3.0.80) "kernel: [440682.559851] blk_rq_check_limits: over max size limit."

2013-08-06 Thread Ulrich Windl

Hi!

I just did some block device tuning according to some expert's advice which 
resulted in multipath failures. I'm not going to discuss this as I'll have to 
investigate further, but I'd like to point out that the messages like 
"[440682.559851] blk_rq_check_limits: over max size limit." lack the affected 
device!

It's quite hard to debug if you have more than 90 disks attached:
Aug  6 14:58:08 h06 multipathd: 65:240: mark as failed
Aug  6 14:58:08 h06 multipathd: SAP_T11_I03-E3: remaining active paths: 4
Aug  6 14:58:08 h06 multipathd: 68:48: mark as failed
Aug  6 14:58:08 h06 multipathd: SAP_T11_I03-E3: remaining active paths: 3
Aug  6 14:58:08 h06 multipathd: 8:208: mark as failed
Aug  6 14:58:08 h06 multipathd: SAP_T11_I03-E4: remaining active paths: 3
Aug  6 14:58:08 h06 multipathd: 67:80: mark as failed
Aug  6 14:58:08 h06 multipathd: SAP_T11_I03-E4: remaining active paths: 2
Aug  6 14:58:08 h06 kernel: [440682.559851] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.559891] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.559916] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.559928] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.559966] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.559996] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560016] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560037] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560058] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560106] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560149] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560176] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560189] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560223] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560257] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560277] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560296] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560319] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.560333] device-mapper: multipath: Failing 
path 65:240.
Aug  6 14:58:08 h06 kernel: [440682.560346] device-mapper: multipath: Failing 
path 68:48.
Aug  6 14:58:08 h06 kernel: [440682.560429] device-mapper: multipath: Failing 
path 67:80.
Aug  6 14:58:08 h06 kernel: [440682.560436] device-mapper: multipath: Failing 
path 8:208.
Aug  6 14:58:08 h06 kernel: [440682.561345] sd 2:0:0:7: alua: port group 01 
state N non-preferred supports tolusNA
Aug  6 14:58:08 h06 kernel: [440682.561500] sd 3:0:2:7: alua: port group 01 
state N non-preferred supports tolusNA
Aug  6 14:58:08 h06 kernel: [440682.562075] sd 2:0:4:7: alua: port group 01 
state N non-preferred supports tolusNA
Aug  6 14:58:08 h06 kernel: [440682.562257] sd 3:0:4:7: alua: port group 01 
state N non-preferred supports tolusNA
Aug  6 14:58:08 h06 kernel: [440682.562393] sd 3:0:8:7: alua: port group 01 
state N non-preferred supports tolusNA
Aug  6 14:58:08 h06 kernel: [440682.676078] sd 3:0:2:7: alua: port group 01 
switched to state A
Aug  6 14:58:08 h06 kernel: [440682.676091] sd 2:0:0:7: alua: port group 01 
switched to state A
Aug  6 14:58:08 h06 kernel: [440682.676108] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.676112] blk_rq_check_limits: over max size 
limit.
Aug  6 14:58:08 h06 kernel: [440682.676115] blk_rq_check_limits: over max size 
limit.

Is this problem fixed in a newer kernel?

For the curious: I tuned queue/max_sectors_kb for the paths in a multipath 
device, but didn't tune the multipath device itself...

Regards,
Ulrich
P.S.: Plese keep CC: as I'm not subscribed to the list

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

bad time after boot (read from absent hardware?): "Time: 165:165:165 Date: 165/165/65"

2013-05-28 Thread Ulrich Windl

Hi!

Some time ago I discovered strange output in boot messages, just as if the 
kernel trusts junk from hardware that is not present, like the RTC in a 
paravirtualized Xen guest (the guest has no /dev/rtc*). The message says:
<6>[0.123524] Time: 165:165:165  Date: 165/165/65

Obviously, if there were some validity check, this wouldn't pass, so I guess 
there is none!

In Xen's message buffer (hypervisor) I only see this error (that seems 
unrelated):
(XEN) mm.c:833:d6 Non-privileged (6) attempt to map I/O space 00f0

According to my source code the print originates from read_magic_time() in 
/drivers/base/power/trace.c.

I'm running kernel 3.0.74-0.6.8-xen (SLES11 SP2) on x86_64.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Suggestion for improving kernel messages on ext3-mount for consistency

2013-03-28 Thread Ulrich Windl

Hi!

I have a kind of trivial suggestion for improving the kernel messages for 
ext3-fs mounts to be more consistent and useful:

Most messages for ext3-mounting include the device, like:
kernel: [  823.233892] EXT3-fs (dm-7): using internal journal
kernel: [  823.233899] EXT3-fs (dm-7): mounted filesystem with ordered data mode

However some messages do not include a device, even though they seem device 
specific. For example:
kernel: [  823.210989] EXT3-fs: barriers not enabled
kernel: [  823.233218] kjournald starting.  Commit interval 15 seconds

This was observed in the current SLES11 SP2 kernel (3.0.58-0.6.6).

I haven't queried the sources, but it looks like an easy change...

BTW: The kjournal threads are also anonymous in the process list (while xfs 
(for example) names them):
# ps ax |grep journ
  418 ?S  0:00 [kjournald]
 1070 ?S  0:00 [kjournald]
 1071 ?S  0:00 [kjournald]
 1072 ?S  0:00 [kjournald]
 1073 ?S  0:00 [kjournald]
 1074 ?S  0:00 [kjournald]
 1075 ?S  0:00 [kjournald]
 5461 ?S  0:00 [kjournald]
 5499 ?S  0:00 [kjournald]
 5601 ?S  0:00 [kjournald]
 5642 ?S  0:00 [kjournald]
 5648 ?S  0:00 [kjournald]
 5653 ?S  0:00 [kjournald]
 5661 ?S  0:00 [kjournald]
 5873 tty1 S+ 0:00 grep journ

# ps ax |grep xfs
 5506 ?S< 0:00 [xfs_mru_cache]
 5507 ?S< 0:00 [xfslogd]
 5508 ?S< 0:00 [xfsdatad]
 5509 ?S< 0:00 [xfsconvertd]
 5510 ?S  0:00 [xfsbufd/dm-10]
 5511 ?S  0:00 [xfsaild/dm-10]
 5518 ?S  0:00 [xfsbufd/dm-11]
 5519 ?S  0:00 [xfsaild/dm-11]
 5560 ?S  0:00 [xfsbufd/dm-12]
 5561 ?S  0:00 [xfsaild/dm-12]
 5593 ?S  0:00 [xfsbufd/dm-13]
 5594 ?S  0:00 [xfsaild/dm-13]
 5875 tty1 S+ 0:00 grep xfs

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: diskstats for MD-RAID

2012-08-07 Thread Ulrich Windl

Hello!

I have a question based on the SLES11 SP1 kernel (2.6.32.59-0.3-default):
In /proc/diskstats the last four values seem to be zero for md-Devices.

So "%util", "await", and "svctm" from "sar" are always reported as zero.

Ist this a bug or a feature? I'm tracing a fairness problem resulting from an 
I/O bottleneck similar to that described in kernel bugzilla #12309...

(If the kernel has about 80GB dirty buffers (yes: 80GB), reads using the same 
I/O channel seem to starve: The scenario is like this: a FC-SAN disksystem with 
two different types of disks is used to copy from the faster disks to slower 
disks using "cp". The files are some ten GB in size (Oracle database). After 
several minutes (while the "cp" is still runing), unrelated processes accessing 
different disk devices through the same I/O channel suffer from bad response 
times. I guess the kernel does not know about the relationship of different 
disk devices being connected through on I/O channel: If the kernel tries to 
keep each device busy (specifically trying to flush dirty buffers from one disk 
to make available buffers, it really reduces the I/O rate of other disks. 
Despite of that, some layers combine 8-sector-requests to something like 
600-sector requests, which probably also needs additional buffers and it will 
hit the response time. The complete I/O stack is: FC-SAN, multipath (RR), 
MD-RAID1, LVM, ext3)

When replying, please keep me in CC: as I'm not subscribed to the list.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: Seeing the microcode revision in /proc/cpuinfo

2012-08-14 Thread Ulrich Windl

Hi!

After several reboots due to memory errors after excellent power-saving of 
Linux on a HP DL380G7 with Intel Xeon 5650 processors (all in on memory bank), 
I found out the errate "BD104" and "BD123". The former should be fixed in a 
microcode revision "15H".

Now I wonder what microcode revision my CPUs currently have. /proc/cpuinfo 
doesn't show that, and the microcode update is a bit cryptic:

kernel: [   44.422912] microcode: CPU23 sig=0x206c2, pf=0x1, revision=0x14

Does that mean the revision is 0x14 BEFORE or AFTER the microcode update?

Wouldn't you agree that seeing the microcode revision in /proc/cpuinfo would be 
nice?

For those CPUs lacking the feature one could hard-wire the value "none" (which 
would be also "kind of true")...

Regards,
Ulrich
(not subscribed, so please CC: you replies to me)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: Re: Q: Seeing the microcode revision in /proc/cpuinfo

2012-08-14 Thread Ulrich Windl

Hi Borislav,

probably my edge is not bleeding that much than yours ;-)

I don't see "microcode" in 3.0.34-0.7-default for an AMD Opteron, and not in  
2.6.32.59-0.3-default for the Intel Xeon. Both are kernels of SLES11 xon 
x86_64. The first one is the latest you can get for SLES11 SP2.

In openSUSE 12.1 (kernel 3.1.10) it's also still missing.

Anyway, it's nice to see that others also thought this feature is useful.

Thanks & best regards,
Ulrich

>>> Borislav Petkov  schrieb am 14.08.2012 um 15:12 in Nachricht
<20120814131211.ga25...@x1.osrc.amd.com>:
> On Tue, Aug 14, 2012 at 02:35:40PM +0200, Ulrich Windl wrote:
> > Hi!
> > 
> > After several reboots due to memory errors after excellent power-saving of 
> Linux on a HP DL380G7 with Intel Xeon 5650 processors (all in on memory 
> bank), I found out the errate "BD104" and "BD123". The former should be fixed 
> in a microcode revision "15H".
> > 
> > Now I wonder what microcode revision my CPUs currently have. /proc/cpuinfo 
> doesn't show that, and the microcode update is a bit cryptic:
> > 
> > kernel: [   44.422912] microcode: CPU23 sig=0x206c2, pf=0x1, revision=0x14
> > 
> > Does that mean the revision is 0x14 BEFORE or AFTER the microcode update?
> > 
> > Wouldn't you agree that seeing the microcode revision in /proc/cpuinfo 
> would be nice?
> 
> Well, you must be using an old-ish kernel because the microcode revision
> infact *is* in /proc/cpuinfo:
> 
> processor   : 1
> ...
> stepping: 0
> microcode   : 0x528
> 
> This is on 3.6-rc1 and that functionality is upstream since 3.2.
> 
> HTH.
> 
> -- 
> Regards/Gruss,
> Boris.
> 

 
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: Re: /sys and access(2): Correctly implemented?

2012-07-08 Thread Ulrich Windl

>>> Ryan Mallon  schrieb am 09.07.2012 um 01:24 in Nachricht
<4ffa16b6.9050...@gmail.com>:
> On 06/07/12 16:27, Ulrich Windl wrote:
> > Hi!
> > 
> > Recently I found a problem with the command (kernel 3.0.34-0.7-default from 
> SLES 11 SP2, run as root):
> > test -r "$file" && cat "$file"
> > emitting "Permission denied"
> > 
> > Investigating, I found that "test" actually uses "access()" to check for 
> permissions. Unfortunately there are some files in /sys that have 
> "write-only" 
> permission bits set (e.g. /sys/devices/system/cpu/probe).
> > 
> > ~ # ll /sys/devices/system/cpu/probe
> > --w--- 1 root root 4096 Jun 29 12:43 /sys/devices/system/cpu/probe
> > ~ # F=/sys/devices/system/cpu/probe
> > ~ # test "$F" && cat "$F"
> > cat: /sys/devices/system/cpu/probe: Permission denied
> 
> Looks like you have a typo here, I think you wanted "test -r $F", not
> "test $F", the latter will just evaluate "$F" as an expression which
> will be true, and so you get the permission denied error running cat.

Hi!

You are right: It's a typo, but only in the message; the actual test was done 
correctly, and the outcome is quite the same.

> 
> Using "test -r $F" on a write-only sysfs file correctly returns false on
> my machine (Ubuntu 10.04.4 LTS/2.6.32-41-generic).

Not here, unfortunately:
# ll /sys/devices/system/cpu/probe
--w--- 1 root root 4096 Jul  2 11:52 /sys/devices/system/cpu/probe
# F=/sys/devices/system/cpu/probe
# test -r "$F" && cat "$F"
cat: /sys/devices/system/cpu/probe: Permission denied
# uname -a
Linux h07 2.6.32.59-0.3-default #1 SMP 2012-04-27 11:14:44 +0200 x86_64 x86_64 
x86_64 GNU/Linux

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Antw: Re: /sys and access(2): Correctly implemented?

2012-07-09 Thread Ulrich Windl

Hi!

Still the problem seems to be related to the sysfs:
# cd /tmp
# touch testfile
# chmod u=w,go= testfile
# F=/tmp/testfile
# test -r "$F" && cat "$F"

So it seems access(2) works correctly for root and "normal" filesystems. That's 
why I came up with the issue here.

Regards,
Ulrich

>>> Ryan Mallon  schrieb am 09.07.2012 um 09:22 in Nachricht
<4ffa86c5.7090...@gmail.com>:
> On 09/07/12 16:23, Ulrich Windl wrote:
>>>>> Ryan Mallon  schrieb am 09.07.2012 um 01:24 in 
>>>>> Nachricht
> > <4ffa16b6.9050...@gmail.com>:
> >> On 06/07/12 16:27, Ulrich Windl wrote:
> >>> Hi!
> >>>
> >>> Recently I found a problem with the command (kernel 3.0.34-0.7-default 
> >>> from 
> >> SLES 11 SP2, run as root):
> >>> test -r "$file" && cat "$file"
> >>> emitting "Permission denied"
> >>>
> >>> Investigating, I found that "test" actually uses "access()" to check for 
> >> permissions. Unfortunately there are some files in /sys that have 
> "write-only" 
> >> permission bits set (e.g. /sys/devices/system/cpu/probe).
> >>>
> >>> ~ # ll /sys/devices/system/cpu/probe
> >>> --w--- 1 root root 4096 Jun 29 12:43 /sys/devices/system/cpu/probe
> >>> ~ # F=/sys/devices/system/cpu/probe
> >>> ~ # test "$F" && cat "$F"
> >>> cat: /sys/devices/system/cpu/probe: Permission denied
> >>
> >> Looks like you have a typo here, I think you wanted "test -r $F", not
> >> "test $F", the latter will just evaluate "$F" as an expression which
> >> will be true, and so you get the permission denied error running cat.
> > 
> > Hi!
> > 
> > You are right: It's a typo, but only in the message; the actual test was 
> done correctly, and the outcome is quite the same.
> > 
> >>
> >> Using "test -r $F" on a write-only sysfs file correctly returns false on
> >> my machine (Ubuntu 10.04.4 LTS/2.6.32-41-generic).
> > 
> > Not here, unfortunately:
> 
> Oops, I missed the bit about you running as root. I get the same results
> running as root on my machine as you, both for sysfs and regular files.
> 
> It appears that access(2) as the super-user is might be implementation
> defined, see:
> 
>   http://pubs.opengroup.org/onlinepubs/95399/functions/access.html 
>   http://lists.gnu.org/archive/html/bug-bash/2010-07/msg00071.html 
> 
> However, I can't find any concrete information on it for Linux, and the
> manpage doesn't mention anything other the the X_OK bit.
> 
> ~Ryan
> 

 
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: [PATCH 0/5] kfifo cleanup and log based kfifo API

2013-01-08 Thread Ulrich Windl

>>> Yuanhan Liu  schrieb am 08.01.2013 um 15:57 in
Nachricht <1357657073-27352-1-git-send-email-yuanhan@linux.intel.com>:

[...]
> My proposal is to replace kfifo_init with kfifo_alloc, where it
> allocate buffer and maintain fifo size inside kfifo. Then we can
> remove buggy kfifo_init.
[...]

Spontaneously I feel that emitting a critical message if the requested size is 
not a power of two would be a good idea, as well as (in that case) rounding up 
to the next power of two instead of rounding down seems not too stupid ;-)

Sorry, I'm not deeply into recent kernel development.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

pthreads & gdb: zombie threads?

2001-04-03 Thread Ulrich Windl


Hello,

I'm having a strange problem debugging a pthreads application in 2.2.18 
(as per SuSE 7.1):

gdb says the program terminated normally after having started two or 
three LWPs. I can exit gdb then, and I find (ps -ax) one zombie thread 
and two or three other threads. Is it more likely a kernel problem, a 
library problem, or a gdb problem?

Naively I thought when exiting the process, all threads would die...

Ulrich
P.S. Not subscribed here

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: announce: PPSkit patch for Linux 2.4.2 (pre6)

2001-04-09 Thread Ulrich Windl

Hi,  Cycle Counters,

Linux currently tries to synchronize TSCs for consistent time in SMP 
systems. One would not believe what combinations of hardware are tried, 
especially for precision timing. Here's a short answer to my asking-
back about a complaint (the kernel is reporting negative time warps).

As any problem, it can be solved with some overhead, but should it be 
done?

Replies to me too, as I'm not subscribed, please.

Ulrich

On 9 Apr 2001, at 18:39, Andreas Bussjaeger wrote:

> > from the current CPU. All these values seem highly suspect. However a 
> > few more values would be helpful to diagnose the situation.
> 
> I have to tell you that I have one 533 MHz Celeron and one 433 MHz
> Celeron.
> 
> > indicate that the CPUs are 968ms apart (each CPU half from the average).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.2.19: config help text about "TCO timer"

2001-04-11 Thread Ulrich Windl


Hi,

I know I'm late, but Configure.help in 2.2.19 says:

..."The TCO (Total Cost of Ownership) timer is a watchdog"...

I know TCO meaning that, but I can't believe it for a mainboard 
component. Should the user then throw the PC away, or what? Or is it 
more safe to reboot frequently. What has this to do with costs?

Confused,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: No 100 HZ timer!

2001-04-23 Thread Ulrich Windl


IMHO the POSIX is doable to comply with POSIX. Probably not what many 
of the RT freaks expect, but doable. I'm tuning the nanoseconds for a 
while now...

Ulrich

On 17 Apr 2001, at 11:53, george anzinger wrote:

> I was thinking that it might be good to remove the POSIX API for the
> kernel and allow a somewhat simplified interface.  For example, the user

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

patch-proposal: extended adjtime()

2001-04-24 Thread Ulrich Windl


Hello,

someone found out that in Linux adjtime()'s correction is limited to 
something like 2000s (signed 32bit microseconds for i386). This is not 
a true problem, but for those who desperately need/want it, I have a 
patch proposal (incomplete, but essential) to implement the full range 
(maybe even more). The patch tries to keep binary compatibility, too.

Opinions?

Regards,
Ulrich




--- kernel/243time.cMon Apr 16 20:14:27 2001
+++ kernel/xxxtime.cMon Apr 16 20:41:15 2001
@@ -100,7 +100,8 @@
write_lock_irq(&xtime_lock);
xtime.tv_sec = value;
xtime.tv_usec = 0;
-   time_adjust = 0;/* stop active adjtime() */
+   time_adjust.tv_sec = time_adjust.tv_usec = 0;
+   /* stop active adjtime() */
time_status |= STA_UNSYNC;
time_maxerror = NTP_PHASE_LIMIT;
time_esterror = NTP_PHASE_LIMIT;
@@ -225,7 +226,8 @@
  */
 int do_adjtimex(struct timex *txc)
 {
-long ltemp, mtemp, save_adjust;
+long ltemp, mtemp;
+   struct timeval save_adjust;
int result;
 
/* In order to modify anything, you gotta be super-user! */
@@ -295,7 +297,31 @@
if (txc->modes & ADJ_OFFSET) {  /* values checked earlier */
if (txc->modes == ADJ_OFFSET_SINGLESHOT) {
/* adjtime() is independent from ntp_adjtime() */
-   time_adjust = txc->offset;
+
+   /* Try to extend the range for plain old adjtime()
+* to multiple seconds without breaking binary
+* compatibility. A perfect solution is not
+* possible, but this one has a high probability
+* for success. The true solution is a syscall of
+* its own.
+* The offset for ADJ_OFFSET_SINGLESHOT is stored in
+* txc->time (struct timeval) now. To avoid using
+* garbage vaues, it's required to copy
+* `txc->time.tv_usec' also into `txc->offset'. Just
+* to be sure, we also require the magic word
+* EXTENDED_ADJTIME_MAGIC to be written to `txc->status'
+* (it's a value not possible before, and it's
+* overwritten after each call).
+*/
+#define EXTENDED_ADJTIME_MAGIC (0x + ('U' << 24) + ('W' << 16))
+   /* old compatible interface */
+   time_adjust.tv_usec = txc->offset;
+
+   if (txc->offset == txc->time.tv_usec &&
+   txc->status == EXTENDED_ADJTIME_MAGIC) {
+   /* extended part */
+   time_adjust.tv_sec = txc->time.tv_sec;
+   }
}
else if ( time_status & (STA_PLL | STA_PPSTIME) ) {
ltemp = (time_status & (STA_PPSTIME | STA_PPSSIGNAL)) ==
@@ -375,9 +401,11 @@
/* p. 24, (d) */
result = TIME_ERROR;

-   if ((txc->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
-   txc->offset= save_adjust;
-   else {
+   if ((txc->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT) {
+   txc->offset= save_adjust.tv_usec;
+   if (txc->status == EXTENDED_ADJTIME_MAGIC)
+   txc->time = save_adjust;
+   } else {
if (time_offset < 0)
txc->offset = -(-time_offset >> SHIFT_UPDATE);
else
--- kernel/243timer.c   Mon Apr 16 20:34:29 2001
+++ kernel/xxxtimer.c   Mon Apr 16 21:01:29 2001
@@ -58,8 +58,7 @@
 long time_adj; /* tick adjust (scaled 1 / HZ)  */
 long time_reftime; /* time at last adjustment (s)  */
 
-long time_adjust;
-long time_adjust_step;
+struct timeval time_adjust;/* remaining time adjustment */
 
 unsigned long event;
 
@@ -461,8 +460,26 @@
 /* in the NTP reference this is called "hardclock()" */
 static void update_wall_time_one_tick(void)
 {
-   if ( (time_adjust_step = time_adjust) != 0 ) {
-   /* We are doing an adjtime thing. 
+   long time_adjust_step;
+
+   if ((time_adjust.tv_sec | time_adjust.tv_usec) != 0) {
+   time_adjust_step = time_adjust.tv_usec;
+   if (time_adjust_step > 0) {
+   /* if we run out of microseconds, but have more seconds,
+* borrow another second
+*/
+   if (time_adjust_step < tickadj && time_adjust.tv_sec > 0) {
+   time_adjust_step = time_adjust.tv_usec += 100;
+   --time_adjust.tv_sec;
+   }
+   } else {
+   if (time_adjust_step > -tickadj && time_adjust.tv_sec < 0) {
+   time_adjust_step = time_adjust.tv_usec -= 100;
+   ++time_adjust.tv_sec;
+   }
+   }
+   
+   /* We gave to complete the adjtim

2.2.18: static rtc_lock in nvram.c

2001-02-25 Thread Ulrich Windl


Hi,

browsing the sources for some problem I wondered why nvram.c uses a 
static spinlock named rtc_lock, hiding the global one.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.2.18/ext2: special file corruption?

2001-02-26 Thread Ulrich Windl


Hi,

I had an interesting effect: Due to NVdriver I had a lot of system 
freezes, and I had to reboot. Using e2fsck 1.19a (SuSE 7.1) I got the 
message that one specific "Special (device/socket/fifo) inode .. has 
non-zero size. FIXED."

Interestingly I got the message for every reboot. So either the kernel 
corrupts the very same inode every time, or e2fsck does not really fix 
it, or the error simply doesn't exist. I think the kernel doesn't 
temporarily set the size to non-zero, so this seems strange.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.2.18: static rtc_lock in nvram.c

2001-02-26 Thread Ulrich Windl

On 26 Feb 2001, at 9:33, Alan Cox wrote:

> > browsing the sources for some problem I wondered why nvram.c uses a 
> > static spinlock named rtc_lock, hiding the global one.
> 
> It only does that for the atari, where the driver isnt used by other things

Hmm.. are there different nvram.c drivers? I noticed that SuSE 7.1 
loads that driver in i386

Also doesn't look a lot like Atari:
 * This driver allows you to access the contents of the non-volatile 
memory in
 * the mc146818rtc.h real-time clock. This chip is built into all PCs 
and into
 * many Atari machines. In the former it's called "CMOS-RAM", in the 
latter
 * "NVRAM" (NV stands for non-volatile).

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.2.18/ext2: special file corruption?

2001-02-26 Thread Ulrich Windl


On 26 Feb 2001, at 10:48, Andreas Dilger wrote:

> Ulrich Windl writes:
> > I had an interesting effect: Due to NVdriver I had a lot of system 
> > freezes, and I had to reboot. Using e2fsck 1.19a (SuSE 7.1) I got the 
> > message that one specific "Special (device/socket/fifo) inode .. has 
> > non-zero size. FIXED."
> > 
> > Interestingly I got the message for every reboot. So either the kernel 
> > corrupts the very same inode every time, or e2fsck does not really fix 
> > it, or the error simply doesn't exist. I think the kernel doesn't 
> > temporarily set the size to non-zero, so this seems strange.
> 
> It is strange that it thinks ".." is a special inode.  Maybe e2fsck is

Og course NOT: ``..'' is a meta syntax for ellipsis. I couldn't 
remember the inode number.

> fixing the wrong problem (i.e. truncating the directory ".."), and it
> later fixes the zero-length directory...  Could you try two things:
> 
> 1) unmount the filesystem and run e2fsck on the broken filesystem 1 or 2
>times, to see if e2fsck is fixing the problem or not.

I did that, and actually it fixed the very same problem again. On a 
second run it was fixed. So either the "-a -t ext2" prevents the 
changes from being written back if the only problem was that special 
file, or there is some corruption undetected by fsck that in turn 
causes the kernel to corrupt the filesystem again and again, or I don't 
know. Here's the log from my tries:

elf:~ # fsck -f /dev/sda6
Parallelizing fsck version 1.19a (13-Jul-2000)
e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Special (device/socket/fifo) inode 16600 has non-zero size.  Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sda6: * FILE SYSTEM WAS MODIFIED *
/dev/sda6: 35542/86400 files (0.9% non-contiguous), 124965/172690 blocks
elf:~ # fsck -f /dev/sda6
Parallelizing fsck version 1.19a (13-Jul-2000)
e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda6: 35542/86400 files (0.9% non-contiguous), 124965/172690 blocks


> 
> 2) If it is fixing the problem you need to wait until the next time you have
>a system crash, start in single user mode.  If it is NOT fixing the problem
>you can do this right away.  Run "e2fsck -n" to see which inode number is
>corrupt (the -n option means e2fsck will not fix the filesystem), and then
>run "debugfs /dev/X", type "dump " and "ncheck inode_number"
>at the prompt (note you NEED the <> around the inode number for dump).
>Send the output.

I'll keep your message. Maybe you hear again from me.

Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.4.0test12: problems timing events

2001-01-07 Thread Ulrich Windl


Hi,

I tried to time events inside the kernel in 2.4.0test12:

Basically the same code works fine in 2.2.18 with about 1us jitter. 
However in 2.4.0test12 the jitter is around 600ms!

What I did is this: I modified the interrupt routine of the serial 
driver to get a precision time-stamp via do_gettimeofday().

So I guess either interrupts are delayed significantly from time to 
time, or the time routine has been changed to be no longer useful 
within interrupt routines.

If anybody can enlighen me on this, I'd be happy.

I'm not subscribed to linux-kernel, so maybe please CC:

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

suggest: diff-2.4.0-test12_to_2.4.0

2001-01-08 Thread Ulrich Windl


I thought I'd find a diff between 2.4.0test12 (last test release) to 
the final 2.4.0 release, but did not. Wouldn't it be (have been) a good 
idea?

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: suggest: diff-2.4.0-test12_to_2.4.0

2001-01-08 Thread Ulrich Windl


On 8 Jan 2001, at 14:16, Andreas Jaeger wrote:

> >>>>> Ulrich Windl writes:
> 
>  > I thought I'd find a diff between 2.4.0test12 (last test release) to 
>  > the final 2.4.0 release, but did not. Wouldn't it be (have been) a good 
>  > idea?
> 
> Apply:
> patch-2.4.0-prerelease.bz2 and then prerelease-to-final.bz2 to test12
> and you get 2.4.0 final.
> 
> You'll find both in ftp.*.kernel.org/...kernel/v2.4/test-kernels/

And both fit on a 1.44MB floppy. Great! Thanks a lot.

Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.4: header file confusion (interrupts)

2001-01-08 Thread Ulrich Windl


Inspecting some code I found out that in 2.4.0test12

request_irq() is declared in sched.h, and not in interrupt.h,

SA_SHIRQ is declared in asm/signal.h, and not in interrupt.h

Isn't that a bit confusing? Maybe for 2.5 let's re-sort some things to 
clean up dependencies...

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

some issues for 2.4.0

2001-01-09 Thread Ulrich Windl


Hello,

I have some issues on Linux-2.4.0:

During boot the (slightly modified, see later) kernel says:

<4>Linux version 2.4.0-NANO (root@elf) (gcc version 2.95.2 19991024 (release)) #1 Mon 
Jan 8 
22:04:48 MET 2001
[...]
<4>PCI: PCI BIOS revision 2.10 entry at 0xfb280, last bus=1
<4>PCI: Using configuration type 1
<4>PCI: Probing PCI hardware
<4>Unknown bridge resource 0: assuming transparent

??? What does the message above mean?

<4>PCI: Using IRQ router VIA [1106/0596] at 00:07.0
<6>Activating ISA DMA hang workarounds.

The DMI reports some funny values for my low-price board (the vendor
did not ship a DMI utility as Asus did for my old one):

<6>DMI 2.2 present.
<6>39 structures occupying 1055 bytes.
<6>DMI table at 0x000F0800.
<4>BIOS Vendor: Award Software International, Inc.
<4>BIOS Version: 4.51 PG
<4>BIOS Release: 06/19/00
<4>System Vendor: VIA Technologies, Inc..
<4>Product Name: VT82C693BX.
<4>Version  .
<4>Serial Number  .

??? Aren't they (above two lines) funny?

<4>Board Vendor: Shuttle Inc..
<4>Board Name: HOT-AV11 693-596-W977.
<4>Board Version: 2A6LGH2A.

[...]

As reported for 2.4.0-test12 there seems to be a problem timing events
within an interrupt (e.g. serial): The jitter is quite high. I'm
timing pulses generated from a GPS clock every second to estimate the
clock error. I'll show the first few updates.  Let's show some facts
first, and they state a suspect. The pair is seconds:nanoseconds
for the captured timestamps. My pulse is roughly 200ms+800ms:

979070631:649924277
979070632:49920873
979070633:649922851
979070634:49921630
979070635:649923125
979070636:49920800

??? Oops! Time jumped back!

979070633:354954544
979070633:754953483
979070635:354954708
979070635:754954209
979070637:354955615
979070637:754953649
979070639:354955938
979070639:754953328

??? Again!

979070637:59988575
979070637:459985921
979070639:59986981
979070639:459985930
979070641:59986854
979070641:459985908
979070643:59987006
979070643:459987393
979070645:59987262
979070645:136458168

979070642:765020874
979070643:165018428
979070644:765019464
979070645:165018406
979070646:765019339
979070647:165018295
979070648:765019475
979070649:165018274

979070646:470052764
979070646:870050956
979070648:470053050
979070648:870051264
979070650:470052609
979070650:870051691
979070652:470052047
979070652:870050772

979070650:175085546
979070650:575083574
979070652:175084550
979070652:575083463
979070654:175085050
979070654:575084190
979070656:175084787
979070656:575083420
979070658:175084652
979070658:251540985
979070655:880118226
979070656:280115991
979070657:880118654
979070658:280116032
979070659:880118844
979070660:280115978
979070661:880123413
979070662:280115897

979070659:585150248
979070659:985148519
979070661:585149737
979070661:985148498
979070663:585150396
979070663:985148476
979070665:585150361
979070665:985148365

979070663:290189552
979070663:690181048
979070665:290182834
979070665:690181774
979070667:290182445
979070667:690181783
979070669:290182466
979070669:690181672
979070671:290182951

I either think that some overflow happens, or that some spinlock is
really busy. You can find the patch used in
ftp://ftp.kernel.org/pub/linux/daemons/ntp/PPS/PPS-2.4.0-pre3.tar.bz2

My CPU is identified as:
elf:/tmp # cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 6
model name  : Celeron (Mendocino)
stepping: 5
cpu MHz : 501.149
cache size  : 128 KB
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 
mmx fxsr
bogomips: 999.42


Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.2.18: writing an R/O floppy

2001-01-09 Thread Ulrich Windl


Hi,

I don't know if it's possible to make fd a read-only device if the 
inserted media is write-protected, but I had a strange problem:

I had inserted a write protected floppy and accessed it via autofs as 
vfat in 2.2.18. It worked. Some time later it had expired (and I'm not 
sure whether I had changed floppies in the meantime).

When I tried an "mdel a:*", it did terminate without message, but a 
later "mdir a:" showed all the files there. The kernel had 
unsuccessfully tried to write to the floppy however.

It's a bit hard to reproduce that, but I could guess that the disc-
change ore write-protect status was not updated in some case.

Maybe it rings some bell for one of you; if not, never mind.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

patch: 2.4.0/2.5.0: nanoseconds time resolution

2001-01-21 Thread Ulrich Windl


Hello,

I have spend some time making a patch against the Linux kernel to 
switch to nanoseconds time resolution together with several time-
related updates. I really need support for architectures other than 
i386, specifically a routine that has a very fine and accurate time 
resolution (just using ns == 1000*us isn't the best choice).

For the 2.4.0 patch the ia64, sh, mips64, and parisc architectures are 
completely not done, and the other architectures are either untested or 
done sub-optiomal.

Therefore I put together a simple "hacking document" (see attachment) 
to guide you when trying to port the code.  More text can be found in 
Documentation/kernel-time.txt after the patch, or in the distribution 
for Linux 2.2 (PPSkit-1.0.2.tar*) So please spend an hour or two to 
help me out there. I hope I'm not forced to drop the project.

Unless you can convince me not to have a /proc/sys/kernel/time 
directory, I'd also suggest to accept the patch for 
/usr/src/linux/include/sysctl.h for the standard kernel. Currently I 
have allocated "50" for the "time" entry. I'd like to have a stable 
number for the future.

Regards,
Ulrich Windl




=
A sketch on what to consider when implementing the new time framework
on new architectures (like ia64, mips64, parisc, sh).
=
(See
http://ftp.kernel.org/pub/linux/daemons/ntp/PPS/PPS-2.4.0-pre3.tar.bz2
for an implementation for i386)

* Add new config variables to `config.in' and `defconfig' (CONFIG_NTP,
  CONFIG_NTP_PPS, CONFIG_NTP_PPS_SERIAL)

* use `' instead of `' or `' to access
  kernel time.

* The kernel knows how to convert kernel time to CMOS time, don't mess
  with time zones yourself

* time is kept in nanoseconds. `do_fast_gettimeoffset()' is replaced
  with `do_exact_nanotime()' that returns nanoseconds passed since
  occurrence of the last timer interrupt. `do_slow_gettimeoffset()' is
  replaced with `do_poor_nanotime()' accordingly.

* `do_gettimeofday()' and `do_settimeofday()' are implemented in the
  architecture-independent module, messing with all the status
  updates.  The common code uses the `do_nanotime()' callback to call
  the architectures' code (allowing code selection during runtime or
  boot-up).

* `set_rtc_mmss()' is called `update_rtc()' now, and it sets the
  complete date and time (not just minutes). A new `ktime_to_rtc()'
  converts kernel time to broken down time components suitable to
  write to CMOS RTC.  `mktime()' is also architecture-independent
  now. The new `rtc_to_ktime()' is used after reading the RTC to get
  kernel time.

* a new `timevar_init()' initializes all the time variables.

* `struct timex' has been changed significantly while trying to
  preserve binary compatibility as far as possible.

* time routines are in `kernel/time.c' now, and `xtime', the kernel's
  representation of time, is protected by `rwlock_t xtime_lock'.  A
  new `rtc_runs_localtime' determines if time-zone corrections have to
  be made for RTC time updates. A new data type `l_fp', a 64bit
  quantity, is used for some internal time variables (needed by the
  NTP clock model).

* a new sysctl interface allows controlling of some time variables,
  most notably the time zone and `rtc_runs_localtime'.  While
  adjusting `time_tick' (the former `tick') is deprecated for NTP
  applications, it allows fine compensation of systematic clock
  errors.

* When the kernel time is set, the RTC update procedure is triggered.

* Old routines are implemented using POSIX-alike `do_clock_gettime()'
  and `do_clock_settime()'. There's also a `do_clock_getres()' that
  gives quite realistic (not optimistic) estimates.

* `adjtimex()' has been significantly reworked, just as most of the
  other time-keeping routines.

* Updating the RTC is controlled by new variables: `rtc_update_slave',
  when non-zero, controls after how many seconds the RTC has to be
  updated. Internally `last_rtc_update' keeps the time of the last
  update.  Upon update the `rtc_update_slave' is cleared on success.

Re: patch: 2.4.0/2.5.0: nanoseconds time resolution

2001-01-22 Thread Ulrich Windl

On 22 Jan 2001, at 22:55, Albert D. Cahalan wrote:

> > Therefore I put together a simple "hacking document" (see attachment) 
> > to guide you when trying to port the code.  More text can be found in 
> > Documentation/kernel-time.txt after the patch, or in the distribution 
> > for Linux 2.2 (PPSkit-1.0.2.tar*) So please spend an hour or two to 
> > help me out there. I hope I'm not forced to drop the project.
> 
> URL for the patch? BTW, this is something for the 2.5.xx series.

The URL for the patch is on top of the hacking document, thinking that 
those who don't read it won't need the URL.

Yes the patch is intended for 2.5 if you all want it. However it 
applies to 2.4.0 for those who need it right now. As stated it requires 
some extra work that can't be done by myself alone.

> 
> > * time is kept in nanoseconds.
> 
> Nice, I'd imagine. Would that be 64-bit nanoseconds since 1970?

Compatibility: Just using timespec instead of timeval at the user-level.
Seconds are still 32bit on 32 bit machines.

> 
> > `do_fast_gettimeoffset()' is replaced
> >   with `do_exact_nanotime()' that returns nanoseconds passed since
> >   occurrence of the last timer interrupt. `do_slow_gettimeoffset()' is
> >   replaced with `do_poor_nanotime()' accordingly.
> 
> Ugh. Those names are awful. Why would anyone use do_poor_nanotime()
> when they could have something better?

That's exactly the point: For a i486 you must use the timer's counter 
register to interpolate between interrupts, but for the Pentium you can 
use the cycle counter of the CPU. When making a kernel for a 
distribution, you can't know whether the system will have a Pentium, so 
the decision is made during boot. (Just as it was before)

The old naming put stress on speed of the routines (I guess), while I 
put stress on the accuracy. So "poor" means "poor accuracy".

> 
> > * Updating the RTC is controlled by new variables: `rtc_update_slave',
> >   when non-zero, controls after how many seconds the RTC has to be
> >   updated. Internally `last_rtc_update' keeps the time of the last
> >   update.  Upon update the `rtc_update_slave' is cleared on success.
> 
> What about leap seconds on network and non-UNIX filesystems?  >:-)

You mean to say that a leap second is an implicit time update? I can 
Implement it without any trouble, if you all can agree that the idea is 
acceptable. BTW: Same applies for RTCs using local time, and we switch 
from/to DST: The kernel doesn't have the tables, so you (or cron) must 
update the /proc/sys/kernel/time/timezone.

I'd be glad if these were the only problems you had. ;-)

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.2.16: How to freeze the kernel

2000-11-24 Thread Ulrich Windl


Hello,

this is for your interest, amusement, and for "what not to do":

I managed to freeze the kernel (2.2.16 from SuSE Linux 7.0) in a way 
that I could not even switch virtual consoles. Completely silent 
eberything...

It all started when Windows/95 ruined another CD-R while trying to 
write an image to the media. So I decided to try it with Linux, using 
the same CD writer.

I plugged the device to the so far unused SCSI channel and used the 
"add-sigle-device" method to avoid reboot, and I succeeded:

kgate kernel: scsi singledevice 0 0 4 0
kgate kernel:   Vendor: WAITECModel: WT624 Rev: 7.0F
kgate kernel:   Type:   CD-ROM ANSI SCSI 
revision: 0
kgate kernel: Detected scsi CD-ROM sr1 at scsi0, channel 0, id 4, lun 0
kgate kernel: (scsi0:0:4:0) Synchronous at 10.0 Mbyte/sec, offset 15.
kgate kernel: sr1: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda 
tray

Then I used "cdrecord-1.8.1" to simulate writing at "speed=8". It 
worked so far, but there was a warning about possible problems with 
"simulated fixation", and actually several minutes nothing happened 
while the simulated fixation was expected to take place.

At some point I hit ^C, returning to the prompt. As the device did not 
seem to be ready, I thought "remove the device and reconnect", so I did 
"remove-single-device" (possibly while a command was still "busy"). The 
remove suceeded, but a second later everything had stopped!

Should a device with busy commands be able to be removed? I guess no...

The last message in the syslog was:

kgate kernel: scsi : aborting command due to timeout : pid 8358,
 scsi0, channel 0, id 4, lun 0 UNKNOWN(0x5b) 00 02 00 00 00 00 00 00 00

At that point I pressed "RESET", and interestingly the builtin BIOS of 
the Adaptec 2740 (EISA) hung while trying to detect the device.

Only after powering down both, the CD writer and the machine (a HP 
Netserver LD Pro), the BIOS detected the device again. So I guess 
something badly hung...

The driver being used was
Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.31/3.2.4

After that, everything worked fine.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.4.0test11: some issues and a possible show stopper

2000-12-03 Thread Ulrich Windl


Reading the article in the German computer magazine c't that Linux 2-4 
is scheduled for release in December, and that Linux complained people 
do not want to test the new kernel, I decided to test it.

The Hardware was: Spacewalker/Shuttle AV11 (VIA Apollo Pro chipset), 
Intel Celeron-500 ("boxed"), 128MB PC133 SD-ROM (Infineon, no crap),
EIDE IBM hardisk (4GB, supporting UDMA 33).

[If you need more detail, I can provide them]

First the lesser important issues:

During config I noticed that documentation is missing for CONFIG_INPUT 
(which is later required for USB), CONFIG_NLS_UTF8 (which is probably 
even less clear as 88591 or CP850).

Some source files produced an assembler warning about an Indirect lcall 
without '*'

When booting, the kernel said
"Unknow bridge resource 0; assuming transparent". I don't know what 
this means.

When typing "cat /proc/kmsg" I noticed that the process is not 
interuptible.

Loading the keymap failed, but it seems SuSE Linux 7.0 is not quite 3.4-
ready (util-linux, modutils, e2fsprogs too old).

I also got "EXT2 check option not supported", "can't locate module 
"vfat", probably because of old modutils however).

During some heavy disk I/O I got the impression that buffer writes are 
delayed significantly, and that reading can be delayed by several 
seconds when there is "writing back dirty buffers".

Finally I got a "gzip -t" CRC error on the kernel tar archive that was 
without error when tried with 2.2.17. This is the possible show 
stopper. The syslog messages did not report any problem (harddisk 
operating in UDMA 33 mode, using a proper cable).

Documentation/sysctl/kernel.txt still is 2.2.10!

After hacking the kernel I got a conflict between  
and , but it was too late to investigate. (I had done 
over 4 hours merging rejected diffs, and I was tired from pressing C-d 
C-d C-n in Emacs ;-))

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

poll: nanoseconds in 2.5?

2000-12-06 Thread Ulrich Windl


Hello,

maybe some of you know that I patched an early 2.2 kernel (2.1.131 or 
so) to provide nanoseconds to the customers, i.e. xtime has tv_nsec.

The patch is available throughout 2.2 (including 2.2.17).

I merged the patch into 2.4test11, it compiles and boots so far.

Now I wonder if there's interest to integrate my code to an early 2.5. 
I will have to clean up some obsolete stuff, and order a few things 
first.

I will need strong support for the non i386 architectures however (I 
only have a Pentium for testing).

Interestingly some of my changes are already in 2.4: Moving the time 
stuff out from kernel/sched.c, joining mktime(), etc.

If there is interest, please say so. I could provide an early alpha-
quality patch by monday, maybe even this friday if someone wants to 
test it or implement another architecture.

(The 2.2 stuff is named PPSkit-1.0.1 and can be found in 
/pub/linux/daemons/ntp/PPS on most mirrors of quality ;-)

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.2.16: OOPS 2 & VFS panic

2000-08-29 Thread Ulrich Windl

On 30 Aug 2000, at 8:49, Mike Galbraith wrote:

> On Wed, 30 Aug 2000, Ulrich Windl wrote:
> 
> > The syslog (2.5kB) with surrounding messages is attached.
> 
> No it's not :)

8-(

It happened because of forwarding the bounced message from 
vger.rutgers.edu.

Now it is attached. Sorry.

Ulrich

> 

 Panic

2.2.16: OOPS 2 & VFS panic

2000-08-29 Thread Ulrich Windl


Hello,

I had a kernel panic with 2.2.16 yesterday. Because of this rare 
occasion, I immediately checked my RAM (memtest86), but the RAM is OK, 
there was no thunderstorm, no handy (mobile phone) nearby, the CPU and 
RAM not overclocked, all chipsets Genuine Intel. I only have two memory 
chips from China, but that should be close enough to Taiwan ;-)

Maybe it helps for reproduction: After Boot the system did a periodic 
fsck, the I installed a program, thereby changing CDs twice. 
Immediately after installation the system behaved odd and the panic 
came along.

And yes, I've been running that kernel on that machine before without 
problems. Only kswapd seemed instable in 2.2.16. My machine, a P100, 
has 64MB RAM.

The syslog (2.5kB) with surrounding messages is attached.

Regards,
Ulrich
P.S. The library issue is probably due to SuSE-7.0.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Linux & nanoseconds

2000-09-04 Thread Ulrich Windl


Hello,

I revised the code that calculates the nanoseconds in Linux, and I 
thought I'll drop some notes here:

As I found out there is a systematic error of 167ns per tick, almost 
17us per second. This is because of the timer chip that is used to 
generate the interrupts for "100 Hz".

Linux uses the register of the timer chip to interpolate time. The 
current implementation introduces an error of another 160ns (or 
around). I haven't checked, but maybe they just compensate each other.

Finally the current resolution gotten from the TSC (PCC) is limited to 
16 cycles (or 160ns for a 100MHz CPU). Linux uses the TSC to 
interpolate between (within) ticks only. This is for historical reasons 
and possibly because of APM (ACPI?) varying the CPU speed. The current 
implementation does not recalibrate for variable CPU speed, so the tick 
interpolation is condensed towards the start of a tick of the CPU is 
getting slower. However if the timer chip gets slower at the same rate, 
the inperpolation is fine, but the absolute time is wrong.

I always tried to fix the most urgent problem; there's still potential 
to improve. I need your experience as well.

The current modification (based on PPSkit-0.9.3) is "nanofix.diff.gz", 
both located on your favourite Linux mirror in 
pub/linux/daemons/ntp/PPS (or very similar).

Regards,
Ulrich


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

kernel: eepro100: wait_for_cmd_done timeout!

2000-11-08 Thread Ulrich Windl


Hello,

I'm seeing the message periodically:

Nov  8 09:52:59 kgate last message repeated 5 times
Nov  8 11:26:54 kgate kernel: eepro100: wait_for_cmd_done timeout!
Nov  8 11:56:12 kgate kernel: eepro100: wait_for_cmd_done timeout!
Nov  8 14:38:45 kgate kernel: eepro100: wait_for_cmd_done timeout!
Nov  8 14:38:47 kgate last message repeated 3 times
Nov  8 14:56:11 kgate kernel: eepro100: wait_for_cmd_done timeout!
Nov  8 14:57:01 kgate last message repeated 10 times
Nov  8 21:32:15 kgate kernel: eepro100: wait_for_cmd_done timeout!
Nov  8 22:57:46 kgate kernel: eepro100: wait_for_cmd_done timeout!

The source contains:

/* How to wait for the command unit to accept a command.
   Typically this takes 0 ticks. */
static inline void wait_for_cmd_done(long cmd_ioaddr)
{
int wait = 1000;
do   ;
while(inb(cmd_ioaddr) && --wait >= 0);
#ifndef final_version
if (wait < 0)
printk(KERN_ALERT "eepro100: wait_for_cmd_done 
timeout!\n");
#endif
}

My machine is a HP Netserver LD Pro with a 200MHz Pentium Pro. I guess 
a fast machine will only allow a very short time for the above loop. 
Shouldn't it be fixed?

The hardware is this:
01:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] 
(rev 02)
Subsystem: Hewlett-Packard Company Ethernet Pro 10/100TX
Flags: bus master, medium devsel, latency 66, IRQ 9
Memory at fe8fe000 (32-bit, prefetchable)
I/O ports at ece0
Memory at fea0 (32-bit, non-prefetchable)


kgate kernel: eepro100.c:v1.09j-t 9/29/99 Donald Becker 
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
kgate kernel: eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by 
Andrey V. Savochkin <[EMAIL PROTECTED]> and others
kgate kernel: eth0: OEM i82557/i82558 10/100 Ethernet, 00:60:B0:6D:F1:AE, 
IRQ 9.
kgate kernel:   Board assembly 673610-001, Physical connectors present: 
RJ45
kgate kernel:   Primary interface chip i82555 PHY #1.
kgate kernel:   General self-test: passed.
kgate kernel:   Serial sub-system self-test: passed.
kgate kernel:   Internal registers self-test: passed.
kgate kernel:   ROM checksum self-test: passed (0x49caa8d6).
kgate kernel:   Receiver lock-up workaround activated.

The software is Linux-2.2.16 (SuSE 7.0).

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.4.0test11: "nanoseconds patch" (prerelease) available

2000-12-10 Thread Ulrich Windl


Hi,

related to my question about having nanoseconds in xtime for Linux 2.5, 
two (or three) people were interested, or at least managed to route 
their message to me. As promised I have made an early release patch 
against 2.4.0test11 available at

ftp.kernel.org:/pub/linux/daemons/ntp/PPS/pps-2.4-pre1.tar.bz2 (63kB, 
patch + digital signature)

The modified sources compile, link and boot (for arch/i386), but 
consider this code as alpha quality, and don't use it for production 
use. It is possible that it works perfectly, but I simply don't have 
the experience.

Fixes for any architectures are appreciated. Finally I want to get rid 
of gettimeoffset() and a lot of redundant code.

I noticed that the ATM drivers access xtime directly. If jiffies are 
not fine enough, do_gettimeofday() has to be called for now. If that's 
too slow, we have to think about an alternative.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

i386: gcc & asm(): wrong constraint for "mull"

2000-12-29 Thread Ulrich Windl


Hello,

I noticed (with some inspiration from Andy Kleen) that some asm() 
instructions for the ia32 use the "g" constraint for "mull", where my 
Intel 386 Assembly Language Manual suggests the "MUL" instruction needs 
an r/m operand. So I guess the correct constraint is "rm" in gcc, and 
not "g". That change identical assembly output for gcc-2.95.2, but some 
gcc-2.96.x will try a multiplication with an immediate (constant) 
operand for the "g" constarint, and the as will choke on that.
(Redhat 7.0 ships such a version of gcc).

As I won't be online next week, let me say
regards and a good new year to all!

Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: i386: gcc & asm(): wrong constraint for "mull"

2000-12-29 Thread Ulrich Windl


On 29 Dec 2000, at 5:17, Jakub Jelinek wrote:

> On Fri, Dec 29, 2000 at 10:54:38AM +0100, Ulrich Windl wrote:
> > Hello,
> > 
> > I noticed (with some inspiration from Andy Kleen) that some asm() 
> > instructions for the ia32 use the "g" constraint for "mull", where my 
> > Intel 386 Assembly Language Manual suggests the "MUL" instruction needs 
> > an r/m operand. So I guess the correct constraint is "rm" in gcc, and 
> > not "g". That change identical assembly output for gcc-2.95.2, but some 
> > gcc-2.96.x will try a multiplication with an immediate (constant) 
> > operand for the "g" constarint, and the as will choke on that.
> > (Redhat 7.0 ships such a version of gcc).
> 
> gcc 2.95.2 md.texi sais:
> @cindex @samp{g} in constraint
> @item @samp{g}
> Any register, memory or immediate integer operand is allowed, except for
> registers that are not general registers.
> 
> (2.95.2 was chosen to make it clear it is not something new in gcc).
> That means gcc is really free to choose which of register, memory or
> immediate it puts in and the fact that some gcc version choose one and
> others choose other is perfectly correct.
> Fix the constraints and be happy (at least during the upcoming millenium) :)

Oh, if it wasn't clear: It's what I wanted to say. As I don't have a 
patch ready for that, maybe start at arch/i386/kernel/time.c; there are 
at least two of these "mull" instructions.

Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Compiling 2.2 on Pentium

2000-10-06 Thread Ulrich Windl


Hi,

I noticed that when compiling with gcc-2.95.2 for a Pentium the flag "-
m486" ist still passed to gcc. However gcc-2.95.2 generates different 
code if "-m586" is used (older versions ended at -m486).

Is the makefile intentionally not updated, or was it just forgotten?

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.2.17: CPU features bug for AMD?

2000-10-09 Thread Ulrich Windl


Browsing patch-2.2.17.gz I found this:
linux/arch/i386/kernel/setup.c:
Isn't here an "else" or "break" missing? Otherwise
``x86_cap_flags[16] = "pat"'' is always the case, and extended
AMD features are always present.
@@ -1029,17 +1130,22 @@
case X86_VENDOR_AMD:
if (c->x86 == 5 && c->x86_model == 6)
x86_cap_flags[10] = "sep";
-   x86_cap_flags[16] = "fcmov";
+   if (c->x86 < 6)
+   x86_cap_flags[16] = "fcmov";
+   x86_cap_flags[16] = "pat";
+   x86_cap_flags[22] = "mmxext";
+   x86_cap_flags[24] = "fxsr";
+   x86_cap_flags[30] = "3dnowext";
x86_cap_flags[31] = "3dnow";
break;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.2.16: Verify mounting of /

2000-10-18 Thread Ulrich Windl


Hello,

I have some trouble with an initrd configuration where it seems the 
wrong partition is mounted as root (/), even though it seems fine in 
/etc/fstab, and mount and df all display that it's fine.

I realized that the kernel messages are not as helpful as possible:

<4>VFS: Mounted root (ext2 filesystem) readonly.
<4>change_root: old root has d_count=1
<5>Trying to unmount old root ... okay
<4>Freeing unused kernel memory: 68k freed
<6>Adding Swap: 132088k swap-space (priority -1)

It would be good if the devices involved were displayed, I mean _which_ 
filesystem was mounted or unmounted? The mount of the new root is not 
logged at all.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

FYI: [comp.protocols.time.ntp] announce: Linux PPS support for Kernel 2.4.2

2001-03-13 Thread Ulrich Windl


FYI, a copy...

--- Start of forwarded message ---
From: Ulrich Windl <[EMAIL PROTECTED]>
Newsgroups: comp.protocols.time.ntp
Subject: announce: Linux PPS support for Kernel 2.4.2
Date: 13 Mar 2001 08:04:56 +0100
Organization: University of Regensburg, Germany
Message-ID: <[EMAIL PROTECTED]>

I have uploaded the first working version of a patch for Linux-2.4.2 to
ftp://ftp.kernel.org:/pub/linux/daemons/ntp/PPS/PPS-2.4.2-pre5.diff.bz2

The patch is mostly equivalent to PPSkit-1.0.3, execpt lacking support
for CIOGETEV. Actually it seems that ATOM won't work without CIOGETEV,
but I'll have to re-investigate...

I would appreciate any feedback (see /usr/src/linux/CREDITS for EMail address)

Regards,
Ulrich
--- End of forwarded message ---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.4.2: kernel patch for , nanosleep

2001-03-18 Thread Ulrich Windl


Hello,

originally intended for my PPSkit patch I found out that the "normal" 
kernel might like this patch as well:

nanosleep() currently uses "udelay()" from  as there is no 
"ndelay()". I implemented "ndelay()" for i386 and adjusted the other 
macros. During that I found that some files have or use their own 
"delay()" routines. The original delay is dangerous, because depending 
on the CPU it requires loop cycles or clock cycles as argument, giving 
non-reliable code. Affected sources:

drivers/net/hamradio/yam.c: "delay(100)"
drivers/scsi/wd

I also found that the scaling factor used in the existing code should 
be rounded up (increased by one) for a more exact value.

With the new code there are two possible disadvantages: 1) Reduced 
accuracy, and 2) possible overflow. I hope both are not really a 
problem.

Regards,
Ulrich




Index: arch/i386/kernel/i386_ksyms.c
===
RCS file: /root/LinuxCVS/Kernel/arch/i386/kernel/i386_ksyms.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 i386_ksyms.c
--- arch/i386/kernel/i386_ksyms.c   2001/03/11 13:51:19 1.1.1.3
+++ arch/i386/kernel/i386_ksyms.c   2001/03/17 18:08:20
@@ -82,9 +82,9 @@
 /* Networking helper routines. */
 EXPORT_SYMBOL(csum_partial_copy_generic);
 /* Delay loops */
-EXPORT_SYMBOL(__udelay);
+EXPORT_SYMBOL(__ndelay);
 EXPORT_SYMBOL(__delay);
-EXPORT_SYMBOL(__const_udelay);
+EXPORT_SYMBOL(__const_sndelay);
 
 EXPORT_SYMBOL_NOVERS(__get_user_1);
 EXPORT_SYMBOL_NOVERS(__get_user_2);
Index: arch/i386/lib/delay.c
===
RCS file: /root/LinuxCVS/Kernel/arch/i386/lib/delay.c,v
retrieving revision 1.1.1.2
diff -u -r1.1.1.2 delay.c
--- arch/i386/lib/delay.c   2001/01/08 20:17:36 1.1.1.2
+++ arch/i386/lib/delay.c   2001/03/17 18:12:36
@@ -64,16 +64,27 @@
__loop_delay(loops);
 }
 
-inline void __const_udelay(unsigned long xloops)
+/* convert scaled nanoseconds to execution loops and delay */
+inline void __const_sndelay(unsigned long scaled_nsecs)
 {
int d0;
__asm__("mull %0"
-   :"=d" (xloops), "=&a" (d0)
-   :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy));
-__delay(xloops * HZ);
+   :"=d" (scaled_nsecs), "=&a" (d0)
+   :"1" (scaled_nsecs),"0" (current_cpu_data.loops_per_jiffy));
+__delay(scaled_nsecs * HZ);
 }
 
-void __udelay(unsigned long usecs)
+void __ndelay(unsigned long nsecs)
 {
-   __const_udelay(usecs * 0x10c6);  /* 2**32 / 100 */
+   /* 2**32 / 10 == 4.2946... */
+   if (nsecs > NDELAY_LIMIT) {
+   static  int complaints = 7;
+
+   if (complaints > 0) {
+   --complaints;
+   printk(KERN_ERR "__ndelay(%lu) exceeds limit\n", nsecs);
+   }
+   nsecs = NDELAY_LIMIT;
+   }
+   __const_sndelay((nsecs * 429) / 100);
 }
Index: include/asm-i386/delay.h
===
RCS file: /root/LinuxCVS/Kernel/include/asm-i386/delay.h,v
retrieving revision 1.1.1.2
diff -u -r1.1.1.2 delay.h
--- include/asm-i386/delay.h2001/01/08 20:22:29 1.1.1.2
+++ include/asm-i386/delay.h2001/03/17 17:58:33
@@ -7,14 +7,19 @@
  * Delay routines calling functions in arch/i386/lib/delay.c
  */
  
-extern void __bad_udelay(void);
+extern void __bad_ndelay(void);
 
-extern void __udelay(unsigned long usecs);
-extern void __const_udelay(unsigned long usecs);
-extern void __delay(unsigned long loops);
+extern void __ndelay(unsigned long nsecs);
+extern void __const_sndelay(unsigned long scaled_nsecs);
+extern void __delay(unsigned long xloops);
 
-#define udelay(n) (__builtin_constant_p(n) ? \
-   ((n) > 2 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \
-   __udelay(n))
+#defineNDELAY_LIMIT2000/* 20 ms (2 / HZ)? */
+
+#define ndelay(n) (__builtin_constant_p(n) ? \
+   ((n) > NDELAY_LIMIT ? \
+   __bad_ndelay() : __const_sndelay(((n) * 429ul) / 100)) : \
+   __ndelay(n))
+
+#define udelay(n) ndelay(n * 1000)
 
 #endif /* defined(_I386_DELAY_H) */
Index: kernel/timer.c
===
RCS file: /root/LinuxCVS/Kernel/kernel/timer.c,v
retrieving revision 1.1.1.2.8.1
diff -u -r1.1.1.2.8.1 timer.c
--- kernel/timer.c  2001/03/11 15:29:17 1.1.1.2.8.1
+++ kernel/timer.c  2001/03/17 17:22:57
@@ -592,10 +592,11 @@
/*
 * Short delay requests up to 2 ms will be handled with
 * high precision by a busy wait for all real-time processes.
+* Anything else will be delayed for at least 1/HZ.
 *
 * Its important on SMP not to do this holding locks.
 */
-   udelay((t.tv_nsec + 999) / 1000);
+   ndelay(t.tv_nsec);

2.2.18: e100.c (SuSE 7.1): udelay() used in a wrong way?

2001-03-22 Thread Ulrich Windl


>From the source code of drivers/net/e100.c:

/
 * Name:  Phy82562EHDelayMilliseconds
 *
 * Description:   Stalls execution for a specified number of milliseconds.
 *
 * Arguments: Time - milliseconds to delay
 *
 * Returns:   Nothing
 *
 

***/
void
Phy82562EHDelayMilliseconds(int Time)
{
udelay(Time);
}


AFAIK, udelay() delays microseconds, not milliseconds.

Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.4.x: spinlock problem

2001-02-06 Thread Ulrich Windl


Hello,

I had reported this before: In 2.4.0 getting exact system time from 
interrupt handlers seems inaccurate (in 2.2.18 it works fine). I have 
applied the same modifications to the 2.4 code base as to 2.2.

With 2.4.1 the kernel is incredibly slow, so you can watch the 
individual lines of kernel boot be printed on the screen.

I checked the spinlocks, but could not find a problem. Before I start 
removing the new spinlocks for timer, PIC and RTC, I'd like to hear 
what the gurus might think.

I also tried to find out how I can profile the kernel or trace the 
spinlocks, but that seems to be hardly documented.

Any hints?

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

having a hard time with 2.4.x

2001-02-06 Thread Ulrich Windl


Hello,

I have some news on the topic of timekeeping in Linux-2.4:

As Alan Cox pointed out the ACPI changes between 2.4.0 and 2.4.1 created a 
extremely slow console output (if not more). Configuring away ACPI support 
solved that problem.

However there is still a problem that I cannot explain. I wrote a test program 
for my modified kernel (I did not try the original one). I'll include the 
program plus results (if you want to see the patch go to 
ftp.kernel.org/pub/linux/daemons/ntp/PPS and get PPS-2.4.0-pre3.tar.bz2 (patch 
plus signature)):

#include
#include
#define NTP_NANO
#include

int main()
{
struct timextx;
longlastns = 0;

tx.modes = 0;
while(1)
{
adjtimex(&tx);
printf("%d %d %d\n",
   tx.time.tv_sec, tx.time.tv_nsec,
   tx.time.tv_nsec - lastns);
lastns = tx.time.tv_nsec;
fflush(stdout);
}
}
/*--
The following anomalies were examined by running the program for a few
seconds, redirecting output into a file:
981488742 428870884 428870884
981488742 429242679 371795
981488742 429258279 15600
981488742 429266001 7722
981488742 429273781 7780
981488742 429281142 7361
...this is just the startup, filling the caches; 7us seems acceptable
981488742 442133766 7235
981488742 442155740 21974
981488742 442164248 8508
981488742 442171719 7471
... some occasional jitter seems acceptable, too
981488742 451557086 7280
981488742 451564393 7307
981488742 461600593 10036200
981488742 461609928 9335
981488742 461617263 7335
...here we lost 10ms, possibly due to scheduling
981488742 991589894 7317
981488742 991597171 7277
981488743 1628395 -989968776
981488743 1636937 8542
...the new second seems to begin a bit early; I'm missing about 8ms
981488743 991650854 7411
981488743 991658147 7293
981488744 1724546 -989933601
981488744 1732344 7798
...this is quite reproducible
981488751 294943079 7327
981488751 294950364 7285
981488751 294957703 7339
981488751 294964994 7291
981488751 294964995 1
981488751 294964996 1
981488751 294964997 1
981488751 294964998 1
981488751 294964999 1
...here something strange happened: time refused to advance, forcing
...the code to generate synthetic time (add 1ns). Here comes the end:
981488751 294967294 1
981488751 294967295 1
981488747 0 -294967295
981488747 37804096 37804096
981488747 37811711 7615
981488747 37819006 7295
...time went back by four seconds! This happened again:
981488752 294967292 1
981488752 294967293 1
981488752 294967294 1
981488752 294967295 1
981488748 0 -294967295
981488748 100304297 100304297
981488748 100311973 7676
981488748 100319231 7258
...but sometimes the second overflows correctly:
981488748 87866 7315
981488748 95152 7286
981488749 2417 -92735
981488749 9995 7578
981488749 17227 7232
...
981488749 91971 30023
981488750 747 -91224
981488750 8405 7658
981488750 15531 7126

Here is a simplified sample with microseconds instead, after having removed
two spinlocks (as they are in 2.2.18):

981487863 665701 665701
981487863 666048 347
981487863 666062 14
981487863 666071 9
981487863 666078 7
...start as usual
981487863 668825 7
981487863 668832 7
981487863 668855 23
981487863 668863 8
...some jitter
981487863 673861 7
981487863 673869 8
981487863 673876 7
981487863 683930 10054
981487863 683938 8
981487863 683946 8
...and scheduling
981487863 993871 8
981487863 993879 8
981487864 3905 -989974
981487864 3913 8
981487864 3920 7
...we still lack 10ms during overflow...
981487869 293860 7
981487869 293867 7
981487869 293875 8
981487869 293875 0
981487869 293875 0
...then time also refuses to advance
981487869 293941 0
981487869 293941 0
981487866 28937 -265004
981487866 28946 9
981487866 28954 8
...eventually loosing a few seconds

Possible explanation for negative time: 2^32/5 == 8.5,
i.e. the low 32bit of the TSC will overflow every 8.5 seconds on a
500MHz CPU, probably causing a bad interpolation between ticks.
But: Why doesn't the effect occur with kernel 2.2??? 

 *--*/

Regards,
Ulrich
P.S.: I'm not subscribed here, so CC: is appreciated.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.4 kernel & gcc code generation: a bug?

2001-02-06 Thread Ulrich Windl


Trying to find out what got broken in kernel 2.4, I was so clueless as 
to compare assembly output for 2.2.18 with 2.4.1. However the assembler 
is quite different, as 2.4 uses the more advanced optimizations of gcc-
2.95.2. Anyway:

1) spinlocks look strange in 2.2(!):

.globl rtc_lock
.typertc_lock,@object
.sizertc_lock,0
rtc_lock:
.globl i8253_lock

while in 2.4.1 they look like this:

.globl rtc_lock
.align 4
.typertc_lock,@object
.sizertc_lock,4
rtc_lock:
.long 0
.globl i8253_lock


2) gcc seems to fail to save registers that are marked "spilled" in 
inline asm's constraints, like rdtsc():

/* nanoseconds since last timer interrupt (using the CPU cycle-counter) */
static inline unsigned long do_exact_nanotime(void)
{
register unsigned long eax asm("ax");
register unsigned long edx asm("dx");
unsigned long result;


rdtsc(eax, edx);/* Read the Time Stamp Counter 
*/

/* .. relative to previous jiffy (32 bits is enough) */
eax -= last_tsc_low;/* tsc_low delta */

/*
 * Time offset = (tsc_low delta << 4) * exact_nanotime_quotient
 * = (tsc_low delta << 4) * (nsecs_per_clock)
 * = (tsc_low delta << 4) * (nsecs_per_jiffy /
 *  clocks_per_jiffy)
 *
 * Using a mull instead of a divl saves up to 31 clock cycles
 * in the critical path.
 */
__asm__("mull %2"
:"=a" (eax), "=d" (edx)
:"rm" (exact_nanotime_quotient),
 "0" (eax << 4));

/* our adjusted time offset in nanoseconds */
result = nanodelay_at_last_interrupt + edx;
return result;
}

.text
.align 4
.typedo_exact_nanotime,@function
do_exact_nanotime:
#APP
rdtsc
#NO_APP
subl last_tsc_low,%eax
sall $4,%eax
#APP
mull exact_nanotime_quotient
#NO_APP
movl nanodelay_at_last_interrupt,%eax
addl %edx,%eax
ret
.Lfe7:
.sizedo_exact_nanotime,.Lfe7-do_exact_nanotime
.local  last_rtc_update
.comm   last_rtc_update,4,4
.comm   timer_ack,4,4
.ident  "GCC: (GNU) 2.95.2 19991024 (release)"

#endif


You'll notice that %edx is not pushed at the start of the function. 
Unless the caller saves that, edx will be spilled. Depending on the 
level of optimization this can be bad. Am I wrong?

Regards,
Ulrich
P.S: Not subscribed here, so plese CC: if possible.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

throttling kernel messages: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1282)

2005-04-12 Thread Ulrich Windl

Hi,

I'm affected by the (in)famous bug:
Apr 12 07:03:02 mailgate kernel: recvmsg bug: copied D640F0D1 seq D640F679
Apr 12 07:03:02 mailgate kernel: KERNEL: assertion (flags & MSG_PEEK) failed at 
net/ipv4/tcp.c (1282)
Apr 12 07:03:02 mailgate kernel: recvmsg bug: copied D640F0D1 seq D640F679
Apr 12 07:03:02 mailgate kernel: KERNEL: assertion (flags & MSG_PEEK) failed at 
net/ipv4/tcp.c (1282)

(Kernel of SuSE Linux 9.2, 2.6.8-24.13-default #1 Fri Mar 18 10:19:42 UTC 2005 
i686 i686 i386 GNU/Linux)

The kernel spits out hundreds to thousand messages per second, making klogd and 
syslogd quite busy, and my messages file stopped growing at 2GB.

I'd suggest to enable throttling for this message, or trigger a panic/reboot, 
or 
maybe even fix the bug or message. ;-)

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

(Fwd) about timer in linux kernel.

2001-05-21 Thread Ulrich Windl

Maybe one of the people having written the code want to explain...

Thanks, Ulrich
--- Forwarded message follows ---
From:   "meng-ju" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject:about timer in linux kernel.
Date sent:  Fri, 18 May 2001 16:58:55 -0700

Hi! Mr. Ulrich Windl,

I want to know how timer works in kernel.
When we call add_timer(), it will call add_timer_internal to add it to its list.
Now I am confused how the system checks if it is expired or not?
In run_timer_list(), 
Why it uses tv1.vec + tv1.index to find out the expiration point while in 
add_timer_internal(), the expiration timer minus timer_jiffies?
I don't understand what roles jiffies, timer_jiffies and tv1.index play.
Thanks for your patient and answering.

static inline void run_timer_list(void)
{
 spin_lock_irq(&timerlist_lock);
 while ((long)(jiffies - timer_jiffies) >= 0) {
 struct list_head *head, *curr;
 if (!tv1.index) {
 int n= 1;
 do {
cascade_timers(tvecs[n]);
 } while (tvecs[n]->index == 1 && ++n < NOOF_TVECS);
 }
repeat:
head = tv1.vec + tv1.index;
curr = head->next;
if (curr != head) {
struct timer_list *timer;
void (*fn)(unsigned long);
unsigned long data;

timer = list_entry(curr, struct timer_list, list);
fn = timer->function;
data= timer->data;

detach_timer(timer);
timer->list.next = timer->list.prev = NULL;
timer_enter(timer);
spin_unlock_irq(&timerlist_lock);
fn(data);
spin_lock_irq(&timerlist_lock);
timer_exit();
goto repeat;
}
++timer_jiffies;
   tv1.index = (tv1.index + 1) & TVR_MASK;
}

Meng-Ju

--- End of forwarded message ---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)

2005-08-10 Thread Ulrich Windl

On 10 Aug 2005 at 22:32, Lee Revell wrote:

> On Wed, 2005-08-10 at 19:13 -0700, john stultz wrote:
> > All,
> > Here's the next rev in my rework of the current timekeeping subsystem.
> > No major changes, only some cleanups and further splitting the larger
> > patches into smaller ones.
> 
> Last I heard this made gettimeofday() 20% slower on x86.  Is this still
> the case?

If it's only 20% for an increase in resolution of 10%, it's quite good ;-)

Regards,
Ulrich


> 
> Lee
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)

2005-08-16 Thread Ulrich Windl

On 16 Aug 2005 at 11:25, Christoph Lameter wrote:

> You mentioned that the NTP code has some issues with time interpolation 
> at the KS. This is due to the NTP layer not being aware of actual time 
> differences between timer interrupts that the interpolator knows about. If 
> the NTP layer would be aware of the actual intervals measured by the 
> timesource (or interpolator) then presumably time could be adjusted in a 
> more accurate way.

Hi,

whatever the implementation is, at some point there must exist an interface go 
get 
and set "normal time", free of any jumps and jitter. That "frontend time" will 
be 
used a a base of correction. Basically that means time should be as monotonic 
and 
jitter free as possible for any measurement interval you like.

Otherwise when extrapolating the time-error, it (NTP) will try to 
overcompensate 
(or undercompensate), making the whole thing instable.

Here's a sample from some ancient NTP distribution (pre-nanosecond), but you'll 
get the idea what to check:

more util/jitter.c
/*
 * This program can be used to calibrate the clock reading jitter of a
 * particular CPU and operating system. It first tickles every element
 * of an array, in order to force pages into memory, then repeatedly calls
 * gettimeofday() and, finally, writes out the time values for later
 * analysis. From this you can determine the jitter and if the clock ever
 * runs backwards.
 */
#include 
#include 

#define NBUF 20002

void
main()
{
struct timeval ts, tr;
struct timezone tzp;
long temp, j, i, gtod[NBUF];

gettimeofday(&ts, &tzp);

/*
 * Force pages into memory
 */
for (i = 0; i < NBUF; i ++)
gtod[i] = 0;

/*
 * Construct gtod array
 */
for (i = 0; i < NBUF; i ++) {
gettimeofday(&tr, &tzp);
gtod[i] = (tr.tv_sec - ts.tv_sec) * 100 + tr.tv_usec;
}

/*
 * Write out gtod array for later processing with S
 */
for (i = 0; i < NBUF - 2; i++) {
/*
printf("%lu\n", gtod[i]);
*/
gtod[i] = gtod[i + 1] - gtod[i];
printf("%lu\n", gtod[i]);
}

/*
 * Sort the gtod array and display deciles
 */
for (i = 0; i < NBUF - 2; i++) {
for (j = 0; j <= i; j++) {
if (gtod[j] > gtod[i]) {
temp = gtod[j];
gtod[j] = gtod[i];
gtod[i] = temp;
}
}
}
fprintf(stderr, "First rank\n");
for (i = 0; i < 10; i++)
fprintf(stderr, "%10ld%10ld\n", i, gtod[i]);
fprintf(stderr, "Last rank\n");
for (i = NBUF - 12; i < NBUF - 2; i++)
fprintf(stderr, "%10ld%10ld\n", i, gtod[i]);
}

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)

2005-08-17 Thread Ulrich Windl

On 16 Aug 2005 at 18:17, john stultz wrote:

[...]
> Maybe to focus this productively, I'll try to step back and outline the
> goals at a high level and you can address those. 
> 
> My Assumptions:
> 1. adjtimex() sets/gets NTP state values

One of the greatest mistakes in the past which still affects us now was the 
decision to piggy-back ntp_adjtime and ntp_gettime on top of adjtime() and thus 
creating adjtimex(). Only to save a system-call number or two. WE REALLY SHOULD 
GET RID OF THAT going back to Linux 0.something.

> 2. Every tick we adjust those state values

... which require it. 

> 3. Every tick we use those values to make a nanosecond adjustment to
> time.

...or even more frequent. In my code I tried to scale the tick interpolation as 
well, thus effectively making adjustments even within timer ticks (so far the 
theory...). I was assuming however that ticks and interpolation clocks are 
derived 
from one single source and would "float" the same way relative to each other.

> 4. Those state values are otherwise unused.

What is "otherwise"? Outside the "NTP clock model", or "between ticks"?

> 
> Goals:
> 1. Isolate NTP code to clean up the tick based timekeeping, reducing the
> spaghetti-like code interactions.

First you need a new clock model that's compatible with NTP. Then you can 
consider 
how to implement the NTP stuff. So the clock even without NTP has to be 
strictly 
monotonic for any interval it is read, be it nanoseconds, microseconds, 
milliseconds, seconds, minutes, hours, days, ... The clock delta (=increase of 
time) over time should be as constant as possible (i.e. time shouldn't go up 
like 
stairs).

> 2. Add interfaces to allow for continuous, rather then tick based,
> adjustments (much how ppc64 does currently, only shareable).

Adjustments to the clock _model_ are asynchronous by definition, while 
adjustments 
to the clock itself are, well, periodic. Whatever the period.

Maybe this helps and can be agreed on.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)

2005-08-23 Thread Ulrich Windl

On 24 Aug 2005 at 1:54, Roman Zippel wrote:

[...]
> error) >> shift". The difference between system time and reference 
> time is really important. gettimeofday() returns the system time, NTP 
> controls the reference time and these two are synchronized regularly.
[...]

Roman,

I'm having a problem with your wording: NTP _does_ control the "system time" 
(system clock), because it's the only clock it can use. The "reference time" is 
usually remote or elsewhere (multiple sources). Local NTP does not control the 
remote reference time(s).

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-17 Thread Ulrich Windl

On 15 Mar 2005 at 10:25, john stultz wrote:

> On Mon, 2005-03-14 at 21:37 -0800, Christoph Lameter wrote:
> > Note that similarities exist between the posix clock and the time sources.
> > Will all time sources be exportable as posix clocks?
> 
> At this point I'm not familiar enough with the posix clocks interface to
> say, although its probably outside the scope of the initial timeofday
> rework.

I'd be happy to see the required POSIX clocks at nanosecond resolution for the 
initial version. Add-Ons may follow later.

> 
> Do you have a link that might explain the posix clocks spec and its
> intent?

There's a book named like "POSIX.4: Programming for the real world" by Bill 
Gallmeister (I think).

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFD(time): squeezing and stretching the tick

2001-05-10 Thread Ulrich Windl


For i386 with TSC, the kernel calibrates how much CPU cycles will fit 
between two timer interrupts. That value corresponds to 1 
microseconds. Ideally.

In practice however the timer interrupts do not happen exactly every 
1 us (for hardware reasons).  When interpolating time between ticks 
that calibration value is used.

When using NTP (or adjusting tick manually) the value added every tick 
may be different from 1us.

If that value is larger, the time seems to jump ahead at the beginning 
of each tick; if the value is smaller, the time may seem to get stuck, 
get slow, or jump back at the beginning of a new tick.

Therefore I added experimental code to scale the value used for tick 
interpolation according to these corrections. As it seems to me, the 
clock quality improves, and the performance penalty only appears when 
the correction value changes.

I haven't done the non-TSC case or other architectures. For 
microseconds it may seem neglectible, but not for nanoseconds.

If anybody has an interesting opinion on this, please Mail.

Regards,
Ulrich
P.S. Not subscribed here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.2.18: severe performance problem (high load, low mem, idle CPU)

2001-05-14 Thread Ulrich Windl


Hello,

we experienced a severe performance problem on a PentiumPro 200 MHz, 
64MB RAM, 128MB swap:

Due to many processes being started in a short time, the system load 
went up to 53, and the 9GB SCSI disk was working heavily. At that time 
I suspected no severe problem, and I was busy doing something else. 
However after almost three hours the system load was still at about 40 
with the old processes not yet finished. (The processes typically take 
2 to 5 seconds to finish, and need about 4MB memory each).

At that point I became active.

In top I was surprised that the CPU claimed to be more than 90% idle, 
while the swap space was exceeded. But the memory wasn't really tight; 
cached and buffers were about 12MB together. So basically the situation 
should have gone away. Should, but didn't.

The kernel running was that from SuSE Linux 7.1 (Linux version 2.2.18 
([EMAIL PROTECTED]) (gcc version 2.95.2 19991024 (release)) #1 Fri 
Jan 19 22:10:35 GMT 2001). So maybe the defect is an "enhancement" done 
by SuSE. Anyway:

In top I noticed that the processes to finish were all mostly swapped 
out, and they showed a zero in the "PRI" column. Usually runnable 
processes have more "fuel" there. It seems to me swapped out processes 
did not get their fules reloaded. The processes all had a "D" status 
(blocked on I/O). Also it seemed that processes that share a lot of 
data are not favoured enough when paging in. If a page is shared 10 
times, paging that one in would help 10 processes. Instead the kernel 
seemed to swap in and out a few kB wihout getting any process done.

I decided to kill a few non-essential processes to improve the 
situation. No help. I added an extra 32MB swapfile, so the buffers and 
shared went up to ver 30MB, but still no process finished. The CPU 
still was quite "idle".

So I decided to kill the processes in question. After several seconds, 
no process had terminated however. (Maybe due to the code to handle the 
signal being paged out). Then I did a kill -9 to the processes which 
finally helped.

So to summarize:
1) paged out processes seem not to get enough CPU
2) paged out shared pages seem not to get enough priority to be swapped 
in
3) On low memory situations the schedulting algorithm seems to perform 
poor

For 3) I sould imagine doing a round-robing scheduling with extended 
time-slice (while still being fair, i.e. run them rarely but longer) 
for massively swapped out processes, hoping that one of them will 
finish soon. That way maybe more of the working set will be paged in, 
enabbling some progress.

I don't have the top screen saved, but I have a ps -aux. The 40 
processes being paged out were all displayed with a %CPU of "0.0".
The ps command with 7.4% CPU was the highest value. The kernel pager 
also seemed to be non-busy:

USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME COMMAND
root 1  0.0  0.0   400   52 ?SMar22   0:22 init [3]
root 2  0.0  0.0 00 ?SW   Mar22   0:03 [kflushd]
root 3  0.0  0.0 00 ?SW   Mar22   0:01 [kupdate]
root 4  0.0  0.0 00 ?SW   Mar22   6:58 [kswapd]
root 5  0.0  0.0 00 ?SW<  Mar22   0:00 
[mdrecoveryd]
...
daemon   32528  0.0  2.0  4984 1352 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32529  0.0  2.0  4984 1352 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32531  0.0  2.7  5008 1760 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32533  0.0  2.5  4984 1640 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32539  0.0  3.1  5008 2044 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32540  0.0  1.9  4984 1276 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32542  0.0  1.4  4984  948 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32547  0.0  2.1  4984 1404 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32548  0.0  2.1  4984 1380 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32549  0.0  1.9  4984 1284 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32550  0.0  1.1  4984  768 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32555  0.0  2.3  4984 1504 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32556  0.0  1.8  4984 1224 ?D14:42   0:04 
/etc/mail/dirty-h
daemon   32557  0.0  1.9  4984 1244 ?D14:42   0:04 
/etc/mail/dirty-h
...

These were some of the processes that should have finished.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] new timeofday core subsystem (v. A2)

2005-01-24 Thread Ulrich Windl

On 24 Jan 2005 at 15:24, Christoph Lameter wrote:

> On Mon, 24 Jan 2005, john stultz wrote:
> 
> > +/* __monotonic_clock():
> > + * private function, must hold system_time_lock lock when being
> > + * called. Returns the monotonically increasing number of
> > + * nanoseconds since the system booted (adjusted by NTP 
> > scaling)
> > + */
> > +static nsec_t __monotonic_clock(void)
> > +{
> > +   nsec_t ret, ns_offset;
> > +   cycle_t now, delta;
> > +
> > +   /* read timesource */
> > +   now = read_timesource(timesource);
> > +
> > +   /* calculate the delta since the last clock_interrupt */
> > +   delta = (now - offset_base) & timesource->mask;
> > +
> > +   /* convert to nanoseconds */
> > +   ns_offset = cyc2ns(timesource, delta, NULL);
> > +
> > +   /* apply the NTP scaling */
> > +   ns_offset = ntp_scale(ns_offset);
> 
> The monotonic clock is the time base for the gettime and gettimeofday
> functions. This means ntp_scale() is called every time that the kernel or
> an application access time.

It depends on what you want: There is little sense in implementing a nanosecond 
clock model with NTP when the result is wrong by several microseconds IMHO. You 
don't know what the time is used for, so just get the best you can. Thos only 
wanting some crude time, could be happy with the jiffies (or their equivalent), 
right?

Regards,
Ulrich

> 
> As pointed out before this will dramatically worsen the performance
> compared to the current code base.
> 
> ntp_scale() also will make it difficult to implement optimized arch
> specific version of function for timer access.
> 
> The fastcalls would have to be disabled on ia64 to make this work. Its
> likely extremely difficult to implement a fastcall if it involves
> ntp_scale().
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] new timeofday core subsystem (v. A2)

2005-01-25 Thread Ulrich Windl

On 24 Jan 2005 at 17:54, Christoph Lameter wrote:

> On Mon, 24 Jan 2005, john stultz wrote:
> 
> > We talked about this last time. I do intend to re-work ntp_scale() so
> > its not a function call, much as you describe above.
> >
> > hopelessly endeavoring,
> 
> hehe But seriously: The easiest approach may be to modify the time
> sources to allow a fine tuning of the scaling factor. That way ntp_scale
> may be moved into tick processing where it would adjust the scaling of the
> time sources up or downward. Thus no ntp_scale in the monotonic clock
> processing anymore.

It depends what you want to have between ticks: If your ticks are too wide, the 
clock will do a little jump forward at the start of a new tick; if they are too 
narrow, the clock will jump back a bit at the start of a new tick (assuming 
tick 
interpolation and tick generation are correlated. (The old kernel code uses a 
constant to scale the timer's register to a tick. However if the time is too 
fast 
or slow, the interpolation will also be). Those being blessed with a GPS or 
better 
clock will be able to demonstrate the quality of the code as well as the tuning 
possibilities against frequency errors.

> 
> Monotonic clocks could be calculated
> 
> monotime = ns_at_last_tick + (time_source_cycles_since_tick *
> current_scaling_factor) >> shift_factor.
> 
> This would also be easy to implement in asm if necessary.
> 
> tick processing could then increment or decrement the current scaling
> factor to minimize the error between ticks. It could also add
> nanoseconds to ns_at_last_tick to correct the clock forward.

Is that what corresponds to "adjust_nanoscale()" in my PPSkit?

> 
> With the appropiate shift_factor one should be able to fine tune time much
> more accurately than ntp_scale would do. Over time the necessary
> corrections could be minimized to just adding a few ns once in a while.
> 

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)

2007-07-26 Thread Ulrich Windl

Hi,

I have a question: The Qlogic ISP2422 chip is said to handle PCI-X 266MHz. So 
does 
the HP Itanium2 server rx6600. Basically that was the reason to select that 
server. The FC-HBA is in a 266 MHz capable slot. However when booting SLES10 
SP1 
for IA64, the logs say:

<6>QLogic Fibre Channel HBA Driver
<6>GSI 49 (level, low) -> CPU 3 (0x0300) vector 51
<6>ACPI: PCI Interrupt :0f:01.0[A] -> GSI 49 (level, low) -> IRQ 51
<6>qla2xxx :0f:01.0: Found an ISP2422, irq 51, iobase 0xc000b004
[...]
<6>qla2xxx :0f:01.0: LOOP UP detected (4 Gbps).
<6>qla2xxx :0f:01.0: Topology - (F_Port), Host Loop address 0x0
<6>scsi0 : qla2xxx
<6>qla2xxx :0f:01.0:
<4> QLogic Fibre Channel HBA Driver: 8.01.07-k3
<4>  QLogic HP AB378-60001 -
<4>  ISP2422: PCI-X Mode 2 (133 MH4.00.26 [IP]  @ :0f:01.0 hdma+, host#=0, 
fw=4.00.26 [IP]
<5>  Vendor: HPModel: HSV200Rev: 6100
<5>  Type:   RAID   ANSI SCSI revision: 02
<5> 0:0:0:0: Attached scsi generic sg0 type 12

Now does Linux support the speed of 266 MHz, and is it just displayed 
incorrectly, 
or doesn't Linux support the speed of 266MHz yet?

"lspci -v" says:
0f:01.0 Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre Channel to PCI-X 
HBA 
(rev 02)
Subsystem: Hewlett-Packard Company Unknown device 12d6
Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 51
I/O ports at 6000 [size=256]
Memory at b004 (64-bit, non-prefetchable) [size=4K]
Expansion ROM at b000 [disabled] [size=256K]
Capabilities: [44] Power Management version 2
Capabilities: [4c] PCI-X non-bridge device
Capabilities: [64] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable-
Capabilities: [74] Vital Product Data

Please CC: any replies to my address as I'm not subscribed to the kernel list.

Regards,
Ulrich Windl

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)

2007-07-30 Thread Ulrich Windl

On 27 Jul 2007 at 9:46, Andrew Patterson wrote:

> On Thu, 2007-07-26 at 23:23 -0700, Andrew Vasquez wrote:
> > On Thu, 26 Jul 2007, Andrew Patterson wrote:
> > 
> > > On Thu, 2007-07-26 at 15:36 +0200, Ulrich Windl wrote:
> > > > Hi,
> > > > 
> > > > I have a question: The Qlogic ISP2422 chip is said to handle PCI-X 
> > > > 266MHz. So does 
> > > > the HP Itanium2 server rx6600. Basically that was the reason to select 
> > > > that 
> > > > server. The FC-HBA is in a 266 MHz capable slot. However when booting 
> > > > SLES10 SP1 
> > > > for IA64, the logs say:
> > 
> > There's a mixup here in terminology...  The QLA2460 card which you
> > have does in fact support 'PCI-X 266'...
> > 
> > > > <6>QLogic Fibre Channel HBA Driver
> > > > <6>GSI 49 (level, low) -> CPU 3 (0x0300) vector 51
> > > > <6>ACPI: PCI Interrupt :0f:01.0[A] -> GSI 49 (level, low) -> IRQ 51
> > > > <6>qla2xxx :0f:01.0: Found an ISP2422, irq 51, iobase 
> > > > 0xc000b004
> > > > [...]
> > > > <6>qla2xxx :0f:01.0: LOOP UP detected (4 Gbps).
> > > > <6>qla2xxx :0f:01.0: Topology - (F_Port), Host Loop address 0x0
> > > > <6>scsi0 : qla2xxx
> > > > <6>qla2xxx :0f:01.0:
> > > > <4> QLogic Fibre Channel HBA Driver: 8.01.07-k3
> > > > <4>  QLogic HP AB378-60001 -
> > > > <4>  ISP2422: PCI-X Mode 2 (133 MH4.00.26 [IP]  @ :0f:01.0 hdma+, 
> > > > host#=0, 
> > > > fw=4.00.26 [IP]
> > 
> > The 33/66/100/133 values refer to the bus-clock speed at which the
> > card is operating.  As is seen here (although a bit truncated --
> > separate issue, I'll try to see if I can reproduce this on one of my
> > HPQ rigs), the card is inserted into a PCI-X Mode-2 capable 133MHz
> > (bus clock) slot.  When operating under this mode, each data-phase
> > between two devices is divided into 2 sub-phases, effectively doubling
> > the transfer-data-rate to 266Mhz.
> 
> I guess the proper terminology would be 266 MT/s (Mega
> Transfers/second). Looking through the PSI_SIG PCI-X 2.0 marketing
> blurbs, they use MHz a lot when referring to MT/S. So I would still
> consider this to be a minor bug.  The user wants to know the transfer
> rate, not the actual frequency of the bus.  Maybe just print out the
> mode used instead, e.g., "PCI-X 266"?
[...]

To be concrete: In the Installation Manual for the rx6600 (P/N AB464-9001A), 
figure 1-1 ("I/O Subsystem Block Diagram") the names being used are "PCIx-66", 
"PCIx-133", and "PCIx267". However in the text following the text refers to "66 
MHz PCI/PCI-X slots" ("PCI/PCI-X 133 MHz", "PCI/PCI-X 266 MHz").

The Qlogic data sheet for the "ISP2422" also calls the bus interface "64-bit, 
PCI-
X 2.0 266-MHz DDR", and the document's subtitle is "Dual Port 4-Gbps Fibre 
Channel 
(FC) to PCI-X 2.0 266-MHz Controller".

Confusing?

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)

2007-07-31 Thread Ulrich Windl

On 31 Jul 2007 at 9:50, Andrew Vasquez wrote:

> > On Fri, 27 Jul 2007, Andrew Patterson wrote:
> > 
> > > On Thu, 2007-07-26 at 23:23 -0700, Andrew Vasquez wrote:
> > >
> > > > The 33/66/100/133 values refer to the bus-clock speed at which the
> > > > card is operating.  As is seen here (although a bit truncated --
> > > > separate issue, I'll try to see if I can reproduce this on one of my
> > > > HPQ rigs), the card is inserted into a PCI-X Mode-2 capable 133MHz
> > > > (bus clock) slot.  When operating under this mode, each data-phase
> > > > between two devices is divided into 2 sub-phases, effectively doubling
> > > > the transfer-data-rate to 266Mhz.
> > > 
> > > I guess the proper terminology would be 266 MT/s (Mega
> > > Transfers/second). Looking through the PSI_SIG PCI-X 2.0 marketing
> > > blurbs, they use MHz a lot when referring to MT/S. So I would still
> > > consider this to be a minor bug.  The user wants to know the transfer
> > > rate, not the actual frequency of the bus.  Maybe just print out the
> > > mode used instead, e.g., "PCI-X 266"?
> 
> Given PCI-X Mode-2 can run at different bus-clock speeds, how about
> this as an alternative?
> 
>   PCI-X 266 (133Mhz)

To pick up the idea, why not "133" and "133x2" (DDR, Dual Data Rate)?

> 
> it's a bit more descriptive than
> 
>   PCI-X Mode 2 (133Mhz)
> 
> then again, I don't want to beat this thing to death...
> 
> ---
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index c488996..26f7e54 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -283,9 +283,9 @@ qla24xx_pci_info_str(struct scsi_qla_host *ha, char *str)
>   } else {
>   strcat(str, "-X ");
>   if (pci_bus & BIT_2)
> - strcat(str, "Mode 2");
> + strcat(str, "266");
>   else
> - strcat(str, "Mode 1");
> + strcat(str, "133");
>   strcat(str, " (");
>   strcat(str, pci_bus_modes[pci_bus & ~BIT_2]);
>   }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.19: slight performance optimization for lib/string.c's strstrip()

2006-12-08 Thread Ulrich Windl

Hi,

my apologies for disobeying all the rules for submitting patches, but I'll 
suggest 
a performance optimization for strstrip() in lib/string.c:

Original routine:
char *strstrip(char *s)
{
   size_t size;
   char *end;

   size = strlen(s);

   if (!size)
   return s;

   end = s + size - 1;
   while (end >= s && isspace(*end))
   end--;
   *(end + 1) = '\0';

   while (*s && isspace(*s))
   s++;

   return s;
}
EXPORT_SYMBOL(strstrip);


Suggested replacement:

char *strstrip(char *s)
{
   size_t size;
   char *end;

   while (*s && isspace(*s))
   s++;
   if (!*s)
   return s;
   size = strlen(s);

   end = s + size - 1;
   while (end > s && isspace(*end))
   end--;
   *(end + 1) = '\0';

   return s;
}
EXPORT_SYMBOL(strstrip);

Comments: There's no need to scan the initial banks at the start of the string 
twice (using strlen(), and then looking for initial blanks), and we know that 
the 
first character of the string cannot be a blank when we are removing trailing 
blanks after having removed leading blanks. Also we do not need to call 
strlen() 
to detect an empty string.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

TCP checksum change in RPC replies within XEN, NFS lockup (SLES10)

2006-11-29 Thread Ulrich Windl

Hello,

my apologies for not being sure whom to tell this problem, but it is very 
strange. 
Let me tell the story:

I'm using XEN (3.0.2) with SLES10 (x86_64, SunFire X4100). On one machine I 
have 
three virtual machines ("DomU") that are very identically configured (SLES10 
x86_64 also). There is also a SLES9 (i386) acting as a multi-homed NFS server.

I can mount and access a read-only NFS filesystem on the server from Dom0 
(hypervisor), and from two of the three DomUs without any problem, but from the 
third DomU mount hangs and is unkillable (except kill -9). This is how I 
started 
to debug the problem.

To make things short: I haven't found the solution, but a some problems: 
Running 
tcpdump/etheral on the client (inside DomU), and on the NFS server, I found out 
that a significant number (almost all) of RPC reply packets have an invalid TCP 
cjhecksum on the NFS server, but not on the NFS client. When actually comparing 
the packets, I found that only the checksum is different. Example:

--- nfs-client9.txt 2006-11-29 12:56:59.176133729 +0100
+++ nfs-server8.txt 2006-11-29 12:56:25.812623453 +0100
@@ -1,10 +1,10 @@
 No. TimeSourceDestination   Protocol 
Info
-  9 15:10:15.488963 132.199.176.153   132.199.177.13Portmap  
V2 
DUMP Reply (Call In 7)[Unreassembled Packet]
+  8 15:10:15.497059 132.199.176.153   132.199.177.13Portmap  
V2 
DUMP Reply (Call In 6)[Unreassembled Packet [incorrect TCP checksum]]
 
   00 16 3e f3 45 0d 00 c0 9f 27 44 a6 08 00 45 00   ..>.E'D...E.
 0010  01 c4 d0 3f 40 00 40 06 fd be 84 c7 b0 99 84 c7   [EMAIL 
PROTECTED]@.
 0020  b1 0d 00 6f 94 48 89 33 9b af 3f e4 5a 65 80 18   ...o.H.3..?.Ze..
-0030  16 a0 27 8a 00 00 01 01 08 0a 5a a1 4b 92 01 2f   ..'...Z.K../
+0030  16 a0 6c ec 00 00 01 01 08 0a 5a a1 4b 92 01 2f   ..l...Z.K../
 0040  a9 da 00 00 01 8c 65 e9 c4 df 00 00 00 01 00 00   ..e.
 0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   
 0060  00 01 00 01 86 a0 00 00 00 02 00 00 00 06 00 00   


I' NOT saysing that _all_ TCP checksums are bad, but significantly those RPC 
reply 
packets seem to be affected. OK so far.

I don't know why the NFS mount is actually hanging, but the last packet 
exchange 
seems to be:

Server sends ACK to RPC reply with bad checksum:
Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 1023 (1023), 
Seq: 
2306928188, Ack: 1069064470, Len: 0

Client receives:

Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 1023 (1023), 
Seq: 
2306928188, Ack: 1069064470, Len: 0

Some time later I see the client sending an NFS GETATTR packet, but that's 
probably when the kill came in (31 seconds later).

The other odd thing is that the multihomes NFS server has a route to 
132.199.0.0 
for both ethernet interfaces, but only one of those, eth0, has IP 
132.199.176.153.
However when an ARP request is sent for 132.199.176.153, there are two ansers:

ARP 132.199.176.153 is at 00:c0:9f:27:44:a6
ARP 132.199.176.153 is at 00:02:b3:d9:91:a7

Only the first one is correct. However that problem may be unrelated.

Back to the issue, I doubt that XEN will just overwrite the TCP checksum of 
some 
specific RPC packets. Personally I could image is much more that there is some 
problem in the RPC processing that might cause this. Sorry for the poor problem 
description.

Just in case, the Novell/SUSE kernel versions are:
Client: 2.6.16.21-0.25-xen
Server: 2.6.5-7.282-default

Upon request I could provide the packet files as well as two PDFs that show the 
packet flow.

Regards,
Ulrich
P.S: I'm subscribed to xen-users, but not to the kernel list, so maybe CC: me 
for 
kernel-list replies. Thanks!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: Re: MBR partitions slow?

2016-08-31 Thread Ulrich Windl

>>> Mark D Rustad  schrieb am 31.08.2016 um 17:32 in 
>>> Nachricht
:
> Ulrich Windl  wrote:
> 
>> So without partition the throughput is about twice as high! Why?
> 
> My first thought is that by starting at block 0 the accesses were aligned  
> with the flash block size of the device. By starting at a partition, the  
> accesses probably were not so aligned.

Hi!

Thanks for answering. Yes, you are right: Usually I use fdisk to create 
partitions, and the tool does proper aligning for the partitions. In my case 
YaST insisted on having a partition before creating a filesystem, so I created 
on within YaST, and that partition turned out to be badly aligned (I think Yast 
uses cfdisk internally). I'm sorry that I didn't think about that earlier!

Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related...

Regards,
Ulrich

> 
> --
> Mark Rustad, mrus...@gmail.com

Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 Thread Ulrich Windl

Hi,

since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which seem 
network-related: Most notably slapd does not run any more, and my 
sendmail-milter 
based virus scanner terminates now and then with kernel segfault.

Current kernel form SLES10 SP1 is: 

# cat /proc/version
Linux version 2.6.16.53-0.8-smp ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 
(prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007

The effects in syslog are:
Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0008 
rip 
0042c17a rsp 7fffea55de00 error 4
Aug 31 15:14:57 kgate1 kernel: slapd[5296]: segfault at 55981000 rip 
2ad35ffee46b rsp 7fff4bf58c28 error 4
Aug 31 15:17:13 kgate1 kernel: powersaved[4747]: segfault at 0008 
rip 
0042c17a rsp 7434e260 error 4
Aug 31 17:50:48 kgate1 kernel: slapd[5561]: segfault at 55986000 rip 
2b8fa3cf3483 rsp 7fff08252808 error 4
Sep  3 09:02:04 kgate1 kernel: slapd[22654]: segfault at 55992000 rip 
2afd6f7b4483 rsp 7fff3c790458 error 4
Sep  3 13:14:45 kgate1 kernel: slapd[28324]: segfault at 55962000 rip 
2b5c0ae00483 rsp 7fffa1144e58 error 4
Sep  7 07:48:26 kgate1 kernel: hscan[1142]: segfault at 0003 rip 
2afac0581650 rsp 41000928 error 4
Sep  7 09:12:24 kgate1 kernel: slapd[6022]: segfault at 559b3000 rip 
2b1c15539483 rsp 7fff96a0c978 error 4
Sep 10 17:02:35 kgate1 kernel: hscan[6795]: segfault at 0004 rip 
2b59c0300650 rsp 42002928 error 4
Sep 11 08:43:43 kgate1 kernel: hscan[3456]: segfault at 0004 rip 
2adcd625d650 rsp 43004928 error 4
Sep 11 10:45:38 kgate1 kernel: hscan[28343]: segfault at 0003 rip 
2b17020de650 rsp 42803928 error 4

I know that this kind of report is not very helpful to you guys, but Novell 
does 
not allow me to report a kernel bug directly. (I've told the person who may to 
do 
so, but I'm unsure whether something is in progress already).

Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
Linux version 2.6.16.53-0.8-default ([EMAIL PROTECTED]) (gcc version 4.1.2 
20070115 
(prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 Thread Ulrich Windl

On 11 Sep 2007 at 15:01, Eric Dumazet wrote:

[...]
> > Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
> > Linux version 2.6.16.53-0.8-default ([EMAIL PROTECTED]) (gcc version 4.1.2 
> > 20070115 
> > (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007
> 
> Are you sure ?

Not any more ;-)

> 
> segfaulting are sysloged only on 64bits kernel.
> 
> Maybe your slapd/hscan processes are doing bad things, that make them 
> core dump without notice on a 32bits kernel.

I'm using the senddmail milter library that does the socket communication. So 
any 
bad things should be searched there.

I tend to think that the same program when being compiled as a 32-bit 
executable 
does not cause these segfaults on a 64 bit kernel.

I also tried to use ksymoops to get a disassembly of the corresponding kernel 
code, but the result did not look good to me.

Is there a deeper reason why the kernel does not provide more info (like a call 
trace) on segfaults?

Will an strace of the program (multi-threaded, unfortunately, just as slapd 
(most 
likely)) be helpful?

When I tried it for slapd, the (rest of the) strace was:
9931  socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
9931  connect(3, {sa_family=AF_INET, sin_port=htons(427), sin_addr=inet_addr("12
7.0.0.1")}, 16) = 0
9931  setsockopt(3, SOL_SOCKET, SO_RCVLOWAT, [18], 4) = 0
9931  setsockopt(3, SOL_SOCKET, SO_SNDLOWAT, [18], 4) = -1 ENOPROTOOPT (Protocol
 not available)
9931  mmap(NULL, 1434435584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1
, 0) = 0x2ae32000
9931  --- SIGSEGV (Segmentation fault) @ 0 (0) ---

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 Thread Ulrich Windl

On 11 Sep 2007 at 15:01, Eric Dumazet wrote:

> On Tue, 11 Sep 2007 11:30:38 +0200
> "Ulrich Windl" <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which 
> > seem 
> > network-related: Most notably slapd does not run any more, and my 
> > sendmail-milter 
> > based virus scanner terminates now and then with kernel segfault.
> > 
> > Current kernel form SLES10 SP1 is: 
> > 
> > # cat /proc/version
> > Linux version 2.6.16.53-0.8-smp ([EMAIL PROTECTED]) (gcc version 4.1.2 
> > 20070115 
> > (prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007
> > 
> > The effects in syslog are:
> > Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 
> > 0008 rip 
> > 0042c17a rsp 7fffea55de00 error 4
[...]
> segfaulting are sysloged only on 64bits kernel.
> 
> Maybe your slapd/hscan processes are doing bad things, that make them 
> core dump without notice on a 32bits kernel.

A very wild guess: AFAIK SUSE Distributions are XENified recently, that is they 
have libraries that treat thread local storage differently from the default. If 
these programs (powersaved, slapd, hscan) are all multithreaded, could it be 
that 
the cause of the problem is in that area?

If not, any clues on debugging/tracing? There's a 
/usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".

I also learned that the error code is only documented for i386 arch (thanks to 
Emacs ediff):
 * error_code:
 *  bit 0 == 0 means no page found, 1 means protection fault
 *  bit 1 == 0 means read, 1 means write
 *  bit 2 == 0 means kernel, 1 means user-mode

So the problem (error 4) looks a bit like a read on a NULL-pointer dereference, 
right? And the "rip" is user space, correct?

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 Thread Ulrich Windl

On 11 Sep 2007 at 17:04, Al Viro wrote:

> On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote:
> 
> > If not, any clues on debugging/tracing? There's a 
> > /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".
> 
> That would be because it has fsck-all to do with the kernel.  Get the
> coredump, then use gdb to deal with it.

Ok, but why is the message there at all? I think in Windows/XP the offending 
code 
and the registers are shown in such occasions. I'd say either drop the message, 
or 
improve it. It's also difficult to find the code after the program is gone due 
to 
mapping of shared libraries. I managed to get a core dump of the application 
however, and I did modify some code. I'll report once I have results.

Maybe it's "mea culpa" for my program, but powersaved and slapd are still to be 
examined.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Formatting of /proc/meminfo

2015-06-23 Thread Ulrich Windl

Hi!

My eyes just discovered this mis-alignment for x86_64 machines:
--- 
# cat /proc/meminfo
MemTotal:   81366016 kB
MemFree:36484504 kB
Buffers: 1018764 kB
Cached: 38230264 kB
[...]
VmallocTotal:   34359738367 kB
VmallocUsed:   92792 kB
VmallocChunk:   34359544432 kB
HardwareCorrupted: 0 kB
DirectMap4k:132623356 kB
DirectMap2M:   0 kB
---

It seems the very big numbers for "Vmalloc" are OK, so I suggest to update the 
formatting. The current code looks like this (/usr/src/linux/fs/proc/meminfo.c):
---
seq_printf(m,
"MemTotal:   %8lu kB\n"
"MemFree:%8lu kB\n"
"Buffers:%8lu kB\n"
[...]
---

So the field should be widened by three digits at least (%11lu kB). Maybe one 
could make the field width variable, depending on 32/64 bit (it would look like 
a waste on 32 bit platforms).

Maybe the code would be friendlier to changes if there was one seq_printf() per 
value. Then one could use something like
seq_printf(m, "%-16s%8lu kB\n", "MemTotal:", K(i.totalram))
instead, I guess... You could put the format (%-16s%8lu kB\n) in a constant 
also, allowing a change at one point to affect every item...
Probably gcc will optimize the code anyway, so there won't be much difference 
regarding performance.

Regards,
Ulrich Windl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

heads up: clock_gettime(CLOCK_MONOTONIC, ...) is not monotonic with Xen PVM

2020-08-11 Thread Ulrich Windl

quot;, __func__,
   deltas[u].tv_nsec);
printf("%s: largest delta is 0.%09ld\n", __func__,
  deltas[CLOCK_SAMPLES - 2].tv_nsec);
if ( deltas[u].tv_nsec > tsp->tv_nsec )
tsp->tv_nsec = deltas[0].tv_nsec;
leave:  return(result);
}

/* main */
int main(int argc, char *argv[])
{
int result  = 0;
ts_tres;

get_res(&res);
printf("resolution is 0.%09ld\n", res.tv_nsec);
return(result);
}
- it is intentional that the program aborts on error
Output from a newer machine:
get_res: resolution is 0.1
get_res: smallest delta is 0.00030
get_res: largest delta is 0.00050
resolution is 0.00030

Regards,
Ulrich Windl
(Keep me on CC if I should read your replies)

Q on ioctl BLKGETSIZE

2014-03-18 Thread Ulrich Windl

Hi!

I'm wondering (on a x86_64 SLES11 system):

"man 4 sd" says:
---
   BLKGETSIZE
  Returns the device size  in  sectors.   The  ioctl(2)  parameter
  should be a pointer to a long.
---

/usr/src/linux/block/ioctl.c (3.0.101-0.15) reads:
---
case BLKGETSIZE:
size = i_size_read(bdev->bd_inode);
if ((size >> 9) > ~0UL)
return -EFBIG;
return put_ulong(arg, size >> 9);
---

Three questions:
1) Shouldn't the manual page says that the sector size of always 512 Bytes, 
even on new disks with larger sectors?
2) Should the real sector size be used for new disks?
3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 
kB)?

I'm not subscribed to LKML, so please keep me CC'd when answering.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: setting the process name for ps

2014-04-03 Thread Ulrich Windl

Hi!

Currently one has to fiddle with argv[] in-place when trying to change the 
process name "cmd") in Linux. However if you want to change the thread name 
("comm"), there is a syscall (prctl(PR_SET_NAME, ...)) for it.

For comparison, in HP-UX there is also a syscall to change the process name for 
ps:
---
#include

union pstun psu;

psu.pst_command = "foobar";
pstat(PSTAT_SETCMD, psu, strlen("foobar") - 1, 0, 0);
---

To be fair, HP-XU also has syscalls to get processes, threads and arguments:
pstat_getlwp()
pstat_getproc()
pstat_getcommandline()

As Linux is different, I wonder whether there are any plans to provide a 
syscall to change the process name.

For those who aren't afraid of ugly code, here's a quick-and-dirty example how 
to change the process name in Linux (apologies, you guys know, but those who 
Google may not:
---
#include
#include
#include
#include
#include

static  int delay(void)
{
struct timespec ts;

ts.tv_sec = 10;
ts.tv_nsec = 0;
return nanosleep(&ts, NULL);
}

int main(int argc, char *argv[])
{
int l = strlen(argv[0]);

if ( argc > 1 )
l += 1 + strlen(argv[1]);
if (l  < 20 ) {
printf("provide a long argument\n");
return 1;
}
printf("look: unchanged\n"); delay();
sprintf(argv[0], "proc %d", getpid());
printf("look: process title\n"); delay();
return 0;
}
---

As I'm not subscribed to LKML, please keep me CC'd on you replies!

Thanks & regards,
Ulrich Windl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: uniqueness of pthread_t

2014-03-24 Thread Ulrich Windl

Hi!

I'm programming a little bit with pthreads in Linux. As I understand pthread_t 
is an opaque type (a pointer address?) that cannot be mapped to the kernel's 
TID easily. Anyway: Is it expected that when one thread terminates and another 
thread is created (in fact the same thread again), that the p_thread my be 
resused?

It seems it just happened when debugging my program (or I made a programming 
mistake (code compiles clean with -Wall):
---
[...]
cleanup_thread: 1 used connection entries
 5888 [^D] (null) -> 192.168.255.18/80
cleanup_thread: leaves with -1: No child processes
8407: [8413](172.20.64.67/54560) handles socket 4
cleanup_thread 139852035340032 terminated
created cleanup_thread 139852035340032
accepting connection (1 handlers)
[...]
---
libgthread-2_0-0-2.22.5-0.8.12.1 (SLES11 SP3 on x86_64)

The code handling the threads looks like this (so the results should be 
reliable, I guess):

if ( tid != 0 && pthread_tryjoin_np(tid, &ret) == 0 )
{
DEBUG(1) printf("cleanup_thread %ld terminated\\
n",
(long) tid);
tid = 0;
}
[...]
if ( tid != 0 && pthread_tryjoin_np(tid, &ret) == 0 )
{
DEBUG(1) printf("cleanup_thread %ld terminated\\
n",
(long) tid);
tid = 0;
}

(The code above also runs strictly sequential)

And while I'm at it: There's a bug in the man page:
---
PTHREAD_TRYJOIN_NP(3)  Linux Programmer's Manual PTHREAD_TRYJOIN_NP(3)



NAME
   pthread_tryjoin_np,  pthread_timedjoin_np  -  try to join with a termi-
   nated thread

SYNOPSIS
   #define _GNU_SOURCE
   #include 

   int pthread_tryjoin_np(pthread_t thread, void **retval);

   int pthread_timedjoin_np(pthread_t thread, void **retval,
const struct timespec *abstime);

   Compile and link with -pthread.
---

You must include  after defining  _GNU_SOURCE, and you must do this 
early in your includes...

I'm not subscribed d to LKML, so please CC your answers to me.

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Antw: Re: Some problems with HP DL380 G8 BIOS and SLES11 SP3

2014-08-18 Thread Ulrich Windl

>>> Don Zickus  schrieb am 18.08.2014 um 14:44 in Nachricht
<20140818124404.gl49...@redhat.com>:
> On Mon, Aug 18, 2014 at 08:12:44AM +0200, Ulrich Windl wrote:
>> >>> Don Zickus  schrieb am 14.08.2014 um 19:46 in 
>> >>> Nachricht
>> <20140814174658.gv49...@redhat.com>:
>> > On Wed, Aug 13, 2014 at 05:22:17PM +0200, Ulrich Windl wrote:
>> >> Hello!
>> >> 
>> >> Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are 
>> > some kernel messages that indicate a bug either in the kernel or in the HP 
>> > BIOS. Maybe someone can explain, so I can try to get it fixed whatever 
> party 
>> > broke it...
>> >> 
>> >> Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version 
>> >> 4.3.4 
>> > [gcc-4_3-branch revision 152973]" (latest).
>> >> HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest)
>> > 
>> > Yes, it is because you are letting the firmware dynamically control your
>> > cpu frequency.  In order to accomplish they need to use a perf counter or
>> > two, hence the conflict.  Set the firmware setting to OS control and the
>> > problem goes away.  Contact HP for those instructions, they are very aware
>> > of this problem and recommend OS control to all high end servers.
>> 
>> Hi!
>> 
>> Thanks for answering, but the BIOS has set power management to "OS control" 
> (see attachment). So I guess it must be something different.
> 
> Hmm, sounds like it.  Regardless, the error message indicates the counters
> are in use most likely by the BIOS.  So you can ask HP what is going on.
> 
> I assume this is a normal bootup and not a kdump crash kernel, correct?

Yes, it's a normal boot. I'm afraid the standard hardware support at HP does 
not care much about such issues (I remember those Xeon bugs that caused memory 
errors during longer idle phases (in the G7 server) that are fixed be recent 
microcode updates: HP changed memory modules, and they changed the board, but 
it took very long until they updated the BIOS).

Is there any more information I can provide to narrow down the problem?

Regards,
Ulrich

> 
> Cheers,
> Don
> 
>> 
>> Regards,
>> Ulrich
>> 
>> > 
>> > Cheers,
>> > Don
>> > 
>> >> 
>> >> During ACPI init I see:
>> >> [...]
>> >> Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB)
>> >> ACPI: RSDP 000f4f00 00024 (v02 HP)
>> >> ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002   322? 
>> > 162E)
>> >> ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002   322? 
>> > 162E)
>> >> ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 
>> > (2011041
>> >> 3/tbfadt-611)
>> >> ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 
>> > (20110413/
>> >> tbfadt-611)
>> >> ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL 
>> > 20030228)
>> >> ACPI: FACS bddac140 00040
>> >> ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001   322? 
>> > 162E)
>> >> ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001  
>> > )
>> >> [...]
>> >> 
>> >> HPET id 0 under DRHD base 0xf4ffe000
>> >> BIOS requests to not use x2apic
>> >> Use 'intremap=no_x2apic_optout' to override BIOS request
>> >> Enabled IRQ remapping in xapic mode
>> >> x2apic not enabled, IRQ remapping is in xapic mode
>> >> Switched APIC routing to physical flat.
>> >> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>> >> CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04
>> >> Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken 
>> >> BIOS 
>> > detec
>> >> ted, complain to your hardware vendor.
>> >> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
>> >> Intel PMU driver.
>> >> ... version:3
>> >> ... bit width:  48
>> >> ... generic registers:  4
>> >> ... value mask: 
>> >> ... max period: 7fff
>> >> ... fixed-purpose events:   3
>> >> ... event mask: 0007000f
>> >> N

Some problems with HP DL380 G8 BIOS and SLES11 SP3

2014-08-13 Thread Ulrich Windl

Hello!

Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are some 
kernel messages that indicate a bug either in the kernel or in the HP BIOS. 
Maybe someone can explain, so I can try to get it fixed whatever party broke 
it...

Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version 4.3.4 
[gcc-4_3-branch revision 152973]" (latest).
HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest)

During ACPI init I see:
[...]
Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB)
ACPI: RSDP 000f4f00 00024 (v02 HP)
ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002   322? 162E)
ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002   322? 162E)
ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (2011041
3/tbfadt-611)
ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 (20110413/
tbfadt-611)
ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL 20030228)
ACPI: FACS bddac140 00040
ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001   322? 162E)
ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001  )
[...]

HPET id 0 under DRHD base 0xf4ffe000
BIOS requests to not use x2apic
Use 'intremap=no_x2apic_optout' to override BIOS request
Enabled IRQ remapping in xapic mode
x2apic not enabled, IRQ remapping is in xapic mode
Switched APIC routing to physical flat.
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04
Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken BIOS detec
ted, complain to your hardware vendor.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Intel PMU driver.
... version:3
... bit width:  48
... generic registers:  4
... value mask: 
... max period: 7fff
... fixed-purpose events:   3
... event mask: 0007000f
NMI watchdog enabled, takes one hw-pmu counter.
Booting Node   0, Processors  #1
[...]

 pci:00: Requesting ACPI _OSC control (0x1d)
 pci:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x00
ACPI _OSC control for PCIe not granted, disabling ASPM
[...]

 pci:20: Requesting ACPI _OSC control (0x1d)
 pci:20: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x00
ACPI _OSC control for PCIe not granted, disabling ASPM
[...]

Regards,
Ulrich
P.S. Please CC: me, as I'm not on LKML...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Antw: Re: Some problems with HP DL380 G8 BIOS and SLES11 SP3

2014-08-17 Thread Ulrich Windl

>>> Don Zickus  schrieb am 14.08.2014 um 19:46 in Nachricht
<20140814174658.gv49...@redhat.com>:
> On Wed, Aug 13, 2014 at 05:22:17PM +0200, Ulrich Windl wrote:
>> Hello!
>> 
>> Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are 
> some kernel messages that indicate a bug either in the kernel or in the HP 
> BIOS. Maybe someone can explain, so I can try to get it fixed whatever party 
> broke it...
>> 
>> Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version 4.3.4 
> [gcc-4_3-branch revision 152973]" (latest).
>> HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest)
> 
> Yes, it is because you are letting the firmware dynamically control your
> cpu frequency.  In order to accomplish they need to use a perf counter or
> two, hence the conflict.  Set the firmware setting to OS control and the
> problem goes away.  Contact HP for those instructions, they are very aware
> of this problem and recommend OS control to all high end servers.

Hi!

Thanks for answering, but the BIOS has set power management to "OS control" 
(see attachment). So I guess it must be something different.

Regards,
Ulrich

> 
> Cheers,
> Don
> 
>> 
>> During ACPI init I see:
>> [...]
>> Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB)
>> ACPI: RSDP 000f4f00 00024 (v02 HP)
>> ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002   322? 
> 162E)
>> ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002   322? 
> 162E)
>> ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 
> (2011041
>> 3/tbfadt-611)
>> ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 
> (20110413/
>> tbfadt-611)
>> ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL 
> 20030228)
>> ACPI: FACS bddac140 00040
>> ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001   322? 
> 162E)
>> ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001  
> )
>> [...]
>> 
>> HPET id 0 under DRHD base 0xf4ffe000
>> BIOS requests to not use x2apic
>> Use 'intremap=no_x2apic_optout' to override BIOS request
>> Enabled IRQ remapping in xapic mode
>> x2apic not enabled, IRQ remapping is in xapic mode
>> Switched APIC routing to physical flat.
>> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>> CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04
>> Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken BIOS 
> detec
>> ted, complain to your hardware vendor.
>> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
>> Intel PMU driver.
>> ... version:3
>> ... bit width:  48
>> ... generic registers:  4
>> ... value mask: 
>> ... max period: 7fff
>> ... fixed-purpose events:   3
>> ... event mask: 0007000f
>> NMI watchdog enabled, takes one hw-pmu counter.
>> Booting Node   0, Processors  #1
>> [...]
>> 
>>  pci:00: Requesting ACPI _OSC control (0x1d)
>>  pci:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 
> 0x00
>> ACPI _OSC control for PCIe not granted, disabling ASPM
>> [...]
>> 
>>  pci:20: Requesting ACPI _OSC control (0x1d)
>>  pci:20: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 
> 0x00
>> ACPI _OSC control for PCIe not granted, disabling ASPM
>> [...]
>> 
>> Regards,
>> Ulrich
>> P.S. Please CC: me, as I'm not on LKML...
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org 
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
>> Please read the FAQ at  http://www.tux.org/lkml/

FYI: unclean Intel RAID reported as "clean"

2014-10-20 Thread Ulrich Windl

Hi!

I detected a problem with an Intel (imsm, ICH) RAID1 reported as "clean" by 
Linux, while the BIOS and Windows claimed the RAID is in state "rebuild". This 
was for an older kernel, and the bug had been reported to openSUSE bugzilla as 
bug #902000. Anyone interested can find the details there. I thought the 
problem is important enough to let you know...

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

3.16.3: misdetected nsc-ircc version 01 in VMware?

2014-10-15 Thread Ulrich Windl

Hello,

a short note: Using the release candidate of openSUSE 13.2 (GNOME live medium), 
I see this when booting the kernel in VMware:
Oct 15 12:07:00 linux kernel: nsc-ircc, Found chip at base=0x02e
Oct 15 12:07:00 linux kernel: nsc-ircc, Wrong chip version 01


I doubt the VMware has an infrared chip emulated, so I guess the chip detection 
is wrong.

VMware BIOS is detected as:
Vendor: "Phoenix Technologies LTD"
Version: "6.00"
Date: "08/16/2013"

The PCI-hardware looks like this:
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge 
(rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge 
(rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 
10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01)
03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

The Linxu kernel string is:
Linux linux.site 3.16.3-1.gd2bbe7f-desktop #1 SMP PREEMPT Thu Sep 18 06:32:16 
UTC 2014 (d2bbe7f) x86_64 x86_64 x86_64 GNU/Linux

Regards,
Ulrich


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Q: EDAC/kprintf/Xen issue (long logs inline)

2014-10-07 Thread Ulrich Windl

Hi!

I have a somewhat strange isse on a Xen host running SLES11 SP3 on a HP DL380 
G7 server (two 5-core Xeon 5650 CPUs): At some time the system had RAM 
problems, and in one case the messages seemed to overwrite each other as seen 
in syslog. I wonder whether the locking of kprintf() is broken. See yourself:

Mar 14 10:06:40 h05 kernel: [679593.489003] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489010] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489014] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489019] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489023] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489027] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:40 h05 kernel: [679593.489031] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
[...and so on...]
Mar 14 10:06:41 h05 kernel: [679593.501561] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501568] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501575] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501583] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501590] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501597] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501604] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501611] EDAC MC1: CE row 6, channel 0, 
label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501618] EDAC MC1: CE rohannel 0, label "": 
Corrected error (Socket=1hannel 0, label "": Corrected error (Socket=1 chhannel 
0, label "": Corrected error (Socket=1hanne
l 0, label "": Corrected error (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501647] EDAhannel 0, label "": Corrected 
error (Socket=1 channel 0, label "": Corrected error (Socket=hannel 0, label 
"": Corrected error (Socket=1 chhannel 0, label
"": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 
hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected 
error (Socket=1 chahannel 0, label "": Corrected
 error (Socket=1 hannel 0, label "": Corrected error (Socket=1hannel 0, label 
"": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 
chhannel 0, label "": Corrected error (Socket=1
hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected 
error (Socket=1 channel 0, label "": Corrected error (Socket=1 chhannel 0, 
label "": Corrected error (Socket=1 hannel 0, label
 "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 
chahannel 0, label "": Corrected error (Socket=1hannel 0, label "": Corrected 
error (Socket=1 hannel 0, label "": Corrected e
rror (Socket=1
Mar 14 10:06:41 h05 kernel:  hannel 0, label "": Corrected error (Socket=1 
chhannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected 
error (Socket=1 hannel 0, label "": Corrected err
or (Socket=1 channel=2 dimm=0)
Mar 14 10:06:41 h05 kernel: [679593.501830] EDAC MC1: CE row 6, channehannel 0, 
label "": Corrected error (Socket=1 chanhannel 0, label "": Corrected error 
(Socket=1 channel 0, label "": Corrected error
 (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": 
Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 
hannel 0, label "": Corrected error (Socket=1 h
annel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected 
error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label 
"": Corrected error (Socket=1 hannel 0, label
"": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 
hannel 0, label "": Corrected error (Socket=hannel 0, label "": Corrected error 
(Socket=1 channel 0, label "": Corrected er
ror (Socket=1 channel 0, label "": Corrected error (Socket=1 hannel 0, label 
"": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 
chahannel 0, label "": Corrected error (Socket=1
hannel 0, labe
Mar 14 10:06:41 h05 kernel: l "": Corrected error (Socket=1 hannel 0, label "": 
Corrected error (Socket=1 hannel 0, label "": Corrected erro

RFE: kernel message "rport-2:0-10: blocked FC remote port time out: removing rport"

2014-10-24 Thread Ulrich Windl

Hi!

I'd like to point out that the following Fibre CHannel error message is of 
little practical use, because it lacks essential information:

kernel: [   68.406963]  rport-2:0-10: blocked FC remote port time out: removing 
rport
(Seen in the current SLES11 SP3 kernel 3.0.101-0.40-default)

As the logical port is removed (as the message says), you cannot find out what 
device (i.e. WWN) the problem message refers to.

For existing ports in /sys/class/fc_remote_ports/ you can query port_id, 
port_name, port_state, etc. But once the port is removed, you cannot get any 
information about it.

In a real environment, you cannot easily guess where the problem might be, 
especially as there is no other message about "rport-2:0-10" before it's being 
removed.

I suggest to include the port ID (you can get the HBA WWN from that) and port 
WWN (the target device port) in the erro message if possible.

Regards,
Ulrich Windl


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Xen. How an error message should not look like

2015-02-22 Thread Ulrich Windl

Hi!

This is a somewhat generic subject, so please forgive me. We are having some 
very strange Xen problem in SLES11 SP3 (kernel 3.0.101-0.46-xen).
Eventually I found out that the message
kernel: [615432.648108] vbd vbd-7-51888: 2 creating vbd structure
is not a "progress" message (some vbd structure was created), but an error 
message (the vbd structure was NOT created because of error "2").

So first the user has to recognize that the text actually is an error, then the 
user has to find out what "2" means. Interestingly the most important 
information is missing (block major an minor number of the device for 
vbd_create()). When I did a little digging in the code I found this pearl of 
code (/usr/src/linux/drivers/xen/blkback/xenbus.c):

switch (err) {
case -ENOMEDIUM:
if (!(be->blkif->vbd.type & (VDISK_CDROM | VDISK_REMOVABLE))) {
default:
xenbus_dev_fatal(dev, err, "creating vbd structure");
break;
}
/* fall through */
case 0:
err = xenvbd_sysfs_addif(dev);
if (err) {
vbd_free(&be->blkif->vbd);
xenbus_dev_fatal(dev, err, "creating sysfs entries");
}
break;
}

Itself vbd_create() does not log a message, and neither does 
blkdev_get_by_dev() where the error comes from.

Regards,
Ulrich
P.S. Not subscribed to LKML, so please keep me on CC: to address me


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Lower bound 0.05 on 15-Minute load?

2015-07-02 Thread Ulrich Windl

Hi!

I'm not subscribed, so plese CC: me for your replies.

When graphing the CPU load, I noticed that the 15-minute average never drops 
below 0.05, while the 5-minute load and the 1-minute load does
(Kernel 3.0.101-0.47.52-xen of SLES11 on x86_64).

Ist that a known bug? Interactive call of "uptime" seems to confirm my suspect:
windl> uptime
 10:41am  up 23 days 18:49,  1 user,  load average: 0.08, 0.05, 0.05
windl> uptime
 10:48am  up 23 days 18:56,  1 user,  load average: 0.00, 0.04, 0.05
windl> cat /proc/loadavg
0.00 0.04 0.05 1/108 9704

I'll attach a sample graph.

Regards,
Ulrich

Antw: Re: Lower bound 0.05 on 15-Minute load?

2015-07-02 Thread Ulrich Windl

>>> Martin Steigerwald  schrieb am 02.07.2015 um 11:26 in
Nachricht <1479160.a5Vb4cJSSF@merkaba>:
> On Thursday 02 July 2015 10:50:13 Ulrich Windl wrote:
>> Hi!
> 
> Hi Ulrich,
>  
>> I'm not subscribed, so plese CC: me for your replies.
>> 
>> When graphing the CPU load, I noticed that the 15-minute average never
>> drops below 0.05, while the 5-minute load and the 1-minute load does
>> (Kernel 3.0.101-0.47.52-xen of SLES11 on x86_64).
> 
> Load average is *NOT* the CPU load although this is a very common 
> misconception.

I think the correlation of 1-min, 5-min and 15-min values is independent of the 
actual meaning of the value.

> 
> Load average indicates the amount of processes that are waiting to be 
> scheduled / running (which is CPU saturation) *and* those that are waiting 
> uninterruptable. You can have a high load average without much CPU 
> utilizitation, for example by running 20 find processes on a /home on NFS.
> 
> A high load can be CPU-bound but it doesn't need to be.

I knew.

> 
> So a high load only can indicate that things are running more slowly, but 
> not why, or well the why can be at least two things and does not need to be 
> CPU.

How is that related to my complaint/question?

> 
> Also the load is normalized to CPU cores.

Actually I don't think so, but that's also not related to the issue I reported. 
In know that HP-UX load was the average load of every CPU, while for Linux the 
load seemed to be the sum of all CPU loads, meaning a load of 4 is low for a 
12-CPU machine. But that's all unrelated...

> 
>> Ist that a known bug? Interactive call of "uptime" seems to confirm my
>> suspect: windl> uptime
>>  10:41am  up 23 days 18:49,  1 user,  load average: 0.08, 0.05, 0.05
>> windl> uptime
>>  10:48am  up 23 days 18:56,  1 user,  load average: 0.00, 0.04, 0.05
>> windl> cat /proc/loadavg
>> 0.00 0.04 0.05 1/108 9704
>> 
>> I'll attach a sample graph.
> 
> Why should it be? As you can see in the graph you have higher spikes with 1-
> minute average. As its just a average about one minute it more easily drops 
> below 0,05. But the 5 minute and 15 minute avergage need more time to drop 
> lower, so for it to become lower, you need longer times without spikes in 
> load average.
> 
> So its natural you get "flatter" curves with higher average. Average easily 
> hide things like spikes.

Actually it seems my "mathematical eye" is better than yours: I have another 
graph that shows the problem even more clearly (same kernel and hardware, just 
another machine).

Regards,
Ulrich

Issue with reading sysfs files

2019-10-17 Thread Ulrich Windl

Hi!

I wrote a simple tool to browse sysfs.

However I noticed that there are some files having "r" (read) permission, but 
when you actually try to read from those, I get an I/O error.
So I wonder whether the actual read was forgotten to implement, or the read 
permission should be gone actually.

It seems to be implemented correctly in uevent, like
# ll /sys/module/drm/uevent
--w--- 1 root root 4096 Sep 24 12:24 /sys/module/drm/uevent

but it is not (e.g.) for
# ll 
/sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms
-rw-r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms
# cat 
/sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms
cat: 
'/sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms':
 Input/output error

Here's a summary of such candidates:
.../power/autosuspend_delay_ms
# ll 
/sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec
-rw-r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec
# cat 
/sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec
cat: 
'/sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec':
 Invalid argument

# ll /sys/devices/pci:00/:00:03.1/:01:00.0/resource0
-rw--- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:00/:00:03.1/:01:00.0/resource0
# cat /sys/devices/pci:00/:00:03.1/:01:00.0/resource0
cat: '/sys/devices/pci:00/:00:03.1/:01:00.0/resource0': 
Input/output error

/sys/devices/pci:00/:00:03.1/:01:00.0/resource2_wc

# ll /sys/devices/pci:00/:00:03.1/:01:00.0/rom
-rw--- 1 root root 262144 Sep 26 14:03 
/sys/devices/pci:00/:00:03.1/:01:00.0/rom
# cat /sys/devices/pci:00/:00:03.1/:01:00.0/rom
cat: '/sys/devices/pci:00/:00:03.1/:01:00.0/rom': Invalid argument

# ll /sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id
-r--r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id
# cat /sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id
cat: '/sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id': 
Operation not supported

.../net/em1/phys_port_name
.../net/em1/phys_switch_id

# ll 
/sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve
-rw-r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve
# cat 
/sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve
cat: 
'/sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve':
 No such device

# ll 
/sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer
-rw-r--r-- 1 root root 4096 Oct 17 15:25 
/sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer
# cat 
/sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer
cat: 
'/sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer':
 Invalid argument

.../em_message

# ll 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer
-rw-r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer
# cat 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer
cat: 
'/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer':
 Invalid argument

# ll 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask
-rw-r--r-- 1 root root 4096 Sep 26 14:03 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask
# cat 
/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask
cat: 
'/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask':
 No such device or address

.../block/sda/sda1/trace/enable
.../block/sda/sda1/trace/end_lba
.../block/sda/sda1/trace/pid
.../block/sda/sda1/trace/start_lba

# ll /sys/devices/virtual/net/lo/duplex
-r--r--r-- 1 root root 4096 Sep 24 12:30 /sys/devices/virtual/net/lo/duplex
# cat /sys/devices/virtual/net/lo/duplex
cat: /sys/devices/virtual/net/lo/duplex: Invalid argument

# ll /sys/devices/virtual/net/lo/name_assign_type
-r--r--r-- 1 root root 4096 Sep 24 12:24 
/sys/devices/virtual/net/lo/name_assign_type
# cat /sys/devices/virtual/net/lo/name_assign_type
cat: /sys/devices/virtual/

leap seconds and 61st second in minute

2019-02-13 Thread Ulrich Windl

Hi!

I was currently following some discussion on the topic of leap seconds, and due 
to the basic role of time in the kernel, I'd like to send a "heads up" ("food 
for thought") with some proposal (not to start some useless discussion):

The UNIX timescale running in UTC had (I suppose) the idea that no timezone 
switches will affect it. Unfortunately leap-seconds have not been considered; 
maybe also because at that time everybody would be happy if the system's time 
was correct to a few seconds. I don't know.
However leap seconds are a kind of "time zone switch" conceptually.

Unfortunately the unix time scale does not consider them (as noted in the 
time(2) manual page). That's one part of posix. The other part of POSIX claims 
that during an inserted leap second there should be a 61st second in the 
minute. Unfortunately (AFAIK) there's no interface from kernel's leap second 
handling to glibc allowing it to actually return 60 as the number of seconds. 
(localtime(3) only gets a pointer to time_t)

OTOH in the NTPv4 clock model there is a TAI offset included (which can be 
updated by NTP). AFAIK the kernel also has the timezone offset for some time to 
handle RTCs that run local time. Obviously if the kernel knew the number of 
leap seconds (the correction to the time() timescale) conversion from UNIX 
timescale to TAI would be easy.

So roughly if the kernel exports the time and type of the next_leap second 
scheduled, some future localtime could actually return the 61st second in the 
minute. As it is now applications will all see some magic duplication of the 
60th second. (Maybe that' why Google does "leap smear"). If the kernel API 
would also export the TAI offset, one could even offer a TAI-based time, or, 
maybe even better: The kernel could run on TAI internally, and convert to UNIX 
time scale as needed.

I'll leave exact specification and implementation to the really clever guys.

Regards,
Ulrich
P.S. Not subscribed to the kernel-list so if you want to talk to me keep me on 
CC: preferably

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

Maybe someone can confirm this:
If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 
64, and then define an LVM LV using that multipath map as PV, the LV still has 
a larger max_sectors_kb. If you send a big request (read in my case), the 
kernel will complain:

kernel: [173116.098798] blk_rq_check_limits: over max size limit.

Note that this message does not give any clue to the device being involved, nor 
the size of the IO attempted, nor the limit of the IO size.

My expectation would be that the higher layer reports back an I/O error, and 
the user process receives an I/O error, or, alternatively, the big request is 
split into acceptable chunks before passing it to the lower layers.

However none of the above happens; instead the request seems to block the 
request queue, because later TUR-checks also fail:
kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
multipathd: 66:384: mark as failed
multipathd: NAP_S11: remaining active paths: 1
multipathd: 66:352: mark as failed
multipathd: NAP_S11: Entering recovery mode: max_retries=6
multipathd: NAP_S11: remaining active paths: 0

(somewhat later)
 multipathd: NAP_S11: sdkh - tur checker reports path is up
multipathd: 66:336: reinstated
multipathd: NAP_S11: Recovered to normal mode
kernel: [173117.286712] device-mapper: multipath: Could not failover device 
66:368: Handler scsi_dh_alua error 8.
(I don't know the implications of this)

Of course this error does not appear as long as all devices use the same 
maximum request size, but tests have shown that different SAN disk systems 
prefer different request sizes (as they split large requests internally to 
handle them in chunks anyway).

Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)

Regards,
Ulrich

>>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
>>> 161 :
60728>:
> Hello!
> 
> While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed 
> that I/Os dropped, until everything stood still more or less. Looking into 
> the syslog I found that multipath's TUR-checker considered the paths (FC, 
> BTW) as dead. Amazingly I did not have this problem when I did read-only 
> tests.
> 
> The start looks like this:
> Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: Failing 
> path 8:32.
> Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max 
> size limit.
> Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: Failing 
> path 129:112.
> [...]
> It seems the TUR-checker does some ping-pong-like game: Paths go up and down
> 
> Now for the Linux part: I found the relevant message in blk-core.c 
> (blk_rq_check_limits()).
> First s/agaist/against/ in " *Such request stacking drivers should check 
> those requests agaist", the there's the problem that the message neither 
> outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
> underlying device. That makes debugging somewhat difficult if you customize 
> the block queue settings per device as I did:
> 
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-31 (noop)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTest-31 (128)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/rotational for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/add_random for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> queue/scheduler for FirstTest-32 (noop)
> Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
> queue/max_sectors_kb for FirstTes

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi again!

An addition: Processes doing such I/O seem to be unkillable, and I also cannot 
change the queue parameters while this problem occurs (the process trying to 
write (e.g.: to queue/scheduler) is also blocked. The process status of the 
process doing I/O looks like this:
# cat /proc/2762/status
Name:   randio
State:  D (disk sleep)
Tgid:   2762
Pid:2762
PPid:   53250
TracerPid:  0
Uid:0   0   0   0
Gid:0   0   0   0
FDSize: 0
Groups: 0 105
Threads:1
SigQ:   5/38340
SigPnd: 
ShdPnd: 4102
SigBlk: 
SigIgn: 
SigCgt: 00018000
CapInh: 
CapPrm: 
CapEff: 
CapBnd: 
Cpus_allowed:   ,
Cpus_allowed_list:  0-63
Mems_allowed:   
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001
Mems_allowed_list:  0
voluntary_ctxt_switches:5
nonvoluntary_ctxt_switches: 1

Best regards,
Ulrich

>>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> Maybe someone can confirm this:
> If you have a device (e.g. multipath map) that limits max_sectors_kb to 
> maybe 64, and then define an LVM LV using that multipath map as PV, the LV 
> still has a larger max_sectors_kb. If you send a big request (read in my 
> case), the kernel will complain:
> 
> kernel: [173116.098798] blk_rq_check_limits: over max size limit.
> 
> Note that this message does not give any clue to the device being involved, 
> nor the size of the IO attempted, nor the limit of the IO size.
> 
> My expectation would be that the higher layer reports back an I/O error, and 
> the user process receives an I/O error, or, alternatively, the big request is 
> split into acceptable chunks before passing it to the lower layers.
> 
> However none of the above happens; instead the request seems to block the 
> request queue, because later TUR-checks also fail:
> kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
> kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
> multipathd: 66:384: mark as failed
> multipathd: NAP_S11: remaining active paths: 1
> multipathd: 66:352: mark as failed
> multipathd: NAP_S11: Entering recovery mode: max_retries=6
> multipathd: NAP_S11: remaining active paths: 0
> 
> (somewhat later)
>  multipathd: NAP_S11: sdkh - tur checker reports path is up
> multipathd: 66:336: reinstated
> multipathd: NAP_S11: Recovered to normal mode
> kernel: [173117.286712] device-mapper: multipath: Could not failover device 
> 66:368: Handler scsi_dh_alua error 8.
> (I don't know the implications of this)
> 
> Of course this error does not appear as long as all devices use the same 
> maximum request size, but tests have shown that different SAN disk systems 
> prefer different request sizes (as they split large requests internally to 
> handle them in chunks anyway).
> 
> Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
> 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
> revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 
> >>> 161 
> :
> 60728>:
> > Hello!
> > 
> > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I 
> noticed 
> > that I/Os dropped, until everything stood still more or less. Looking into 
> > the syslog I found that multipath's TUR-checker considered the paths (FC, 
> > BTW) as dead. Amazingly I did not have this problem when I did read-only 
> > tests.
> > 
> > The start looks like this:
> > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> > Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: 
> Failing 
> > path 8:32.
> > Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> > s

Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

2016-12-07 Thread Ulrich Windl

Hi once more!

I managed to get the call traces of involved processes:

1) The process doing read():
Dec  7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State
Dec  7 13:51:16 h10 kernel: [183809.594447]   taskPC 
stack   pid father
Dec  7 13:51:16 h10 kernel: [183809.594750] randio  D 8801703a9d68  
   0  2762  53250 0x0004
Dec  7 13:51:16 h10 kernel: [183809.594758]  880100887ad8 0046 
880100886010 00010900
Dec  7 13:51:16 h10 kernel: [183809.594765]  00010900 00010900 
00010900 880100887fd8
Dec  7 13:51:16 h10 kernel: [183809.594772]  880100887fd8 00010900 
88016bb6a280 88017670c300
Dec  7 13:51:16 h10 kernel: [183809.594778] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.594806]  [] 
io_schedule+0x9c/0xf0
Dec  7 13:51:16 h10 kernel: [183809.594820]  [] 
__lock_page+0x93/0xc0
Dec  7 13:51:16 h10 kernel: [183809.594834]  [] 
truncate_inode_pages_range+0x294/0x460
Dec  7 13:51:16 h10 kernel: [183809.594845]  [] 
__blkdev_put+0x1d7/0x210
Dec  7 13:51:16 h10 kernel: [183809.594856]  [] 
__fput+0xb3/0x200
Dec  7 13:51:16 h10 kernel: [183809.594868]  [] 
filp_close+0x5c/0x90
Dec  7 13:51:16 h10 kernel: [183809.594880]  [] 
put_files_struct+0x7a/0xd0
Dec  7 13:51:16 h10 kernel: [183809.594889]  [] 
do_exit+0x1d0/0x470
Dec  7 13:51:16 h10 kernel: [183809.594897]  [] 
do_group_exit+0x3d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594907]  [] 
get_signal_to_deliver+0x247/0x480
Dec  7 13:51:16 h10 kernel: [183809.594919]  [] 
do_signal+0x71/0x1b0
Dec  7 13:51:16 h10 kernel: [183809.594928]  [] 
do_notify_resume+0x98/0xb0
Dec  7 13:51:16 h10 kernel: [183809.594940]  [] 
int_signal+0x12/0x17
Dec  7 13:51:16 h10 kernel: [183809.594988]  [<7f64f28cbba0>] 0x7f64f28cbb9f

2) The process trying to modify the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.594995] blocktune   D 88014e048000  
   0 58867  1 0x
Dec  7 13:51:16 h10 kernel: [183809.595000]  880128defd18 0086 
880128dee010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595007]  00010900 00010900 
00010900 880128deffd8
Dec  7 13:51:16 h10 kernel: [183809.595013]  880128deffd8 00010900 
88012889a3c0 8801767bc1c0
Dec  7 13:51:16 h10 kernel: [183809.595019] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595026]  [] 
schedule_timeout+0x1b0/0x2a0
Dec  7 13:51:16 h10 kernel: [183809.595040]  [] 
msleep+0x1d/0x30
Dec  7 13:51:16 h10 kernel: [183809.595052]  [] 
__blk_drain_queue+0xbc/0x140
Dec  7 13:51:16 h10 kernel: [183809.595063]  [] 
elv_quiesce_start+0x51/0x90
Dec  7 13:51:16 h10 kernel: [183809.595071]  [] 
elevator_switch+0x4a/0x150
Dec  7 13:51:16 h10 kernel: [183809.595079]  [] 
elevator_change+0x6d/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595086]  [] 
elv_iosched_store+0x27/0x60
Dec  7 13:51:16 h10 kernel: [183809.595096]  [] 
queue_attr_store+0x67/0xc0
Dec  7 13:51:16 h10 kernel: [183809.595106]  [] 
sysfs_write_file+0xcb/0x160
Dec  7 13:51:16 h10 kernel: [183809.595115]  [] 
vfs_write+0xce/0x140
Dec  7 13:51:16 h10 kernel: [183809.595123]  [] 
sys_write+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595131]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595140]  [<7f12b7f70c00>] 0x7f12b7f70bff

3) The process trying to read the queue scheduler:
Dec  7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718  
   0 45053   5957 0x0004
Dec  7 13:51:16 h10 kernel: [183809.595155]  880122f7be00 0082 
880122f7a010 00010900
Dec  7 13:51:16 h10 kernel: [183809.595161]  00010900 00010900 
00010900 880122f7bfd8
Dec  7 13:51:16 h10 kernel: [183809.595167]  880122f7bfd8 00010900 
8801154ea1c0 81a11020
Dec  7 13:51:16 h10 kernel: [183809.595174] Call Trace:
Dec  7 13:51:16 h10 kernel: [183809.595181]  [] 
__mutex_lock_slowpath+0x160/0x1f0
Dec  7 13:51:16 h10 kernel: [183809.595189]  [] 
mutex_lock+0x1a/0x40
Dec  7 13:51:16 h10 kernel: [183809.595196]  [] 
queue_attr_show+0x49/0xb0
Dec  7 13:51:16 h10 kernel: [183809.595203]  [] 
sysfs_read_file+0xfe/0x1c0
Dec  7 13:51:16 h10 kernel: [183809.595212]  [] 
vfs_read+0xc7/0x130
Dec  7 13:51:16 h10 kernel: [183809.595219]  [] 
sys_read+0x53/0xa0
Dec  7 13:51:16 h10 kernel: [183809.595226]  [] 
system_call_fastpath+0x16/0x1b
Dec  7 13:51:16 h10 kernel: [183809.595235]  [<7fb04560dba0>] 0x7fb04560db9f



>>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : 
>>> 161 :
60728>:
> Hi again!
> 
> An addition: Processes doing such I/O seem to be unkillable, and I also 
> cannot change the queue parameters while this problem occurs (the process 
> trying to write (e.g.: to queue/scheduler) is also blocked. The process 
> status of the process doing I/O looks like this:
> # cat /proc/2762/

MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Hello!

(I'm not subscribed to this list, but I'm hoping to get a reply anyway)
While testing some SAN storage system, I needed a utility to erase disks 
quickly. I wrote my own one that mmap()s the block device, memset()s the area, 
then msync()s the changes, and finally close()s the file descriptor.

On one disk I had a primary MBR partition spanning the whole disk, like this 
(output from some of my obscure tools):
disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
(10737418240 bytes)
partition 1 (1-20971520)
Total Sectors =   20971519

When wiping, I started (for no good reason) to wipe partition 1, then I wiped 
the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk 
system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4.
For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
bandwidth is about 9GB/s.

To my surprise I found out that the partition eats significant performance (not 
quite 50%, but a lot):

### Partition
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 
0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s

### Whole disk
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s

Reproducible:
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
time for fstat(): 0.65s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
0x7f1cc17ab000: 0.37s
time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s

So without partition the throughput is about twice as high! Why?

Regards
Ulrich

Antw: MBR partitions slow?

2016-08-30 Thread Ulrich Windl

Update:

I found out the bad performance was caused by partition alignment, and not by 
the pertition per se (YaST created the partition next to the MBR). I compared 
two partitions, number one badly aligned, and number 2 properly aligned. Then I 
got these results:

Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes
64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 16384 bytes / 16777216 bytes
Disk identifier: 0x00016340

Device Boot  Start End  Blocks  
 Id  System
/dev/disk/by-id/dm-name-FirstTest-32-part1   1 5242879 
2621439+  83  Linux
Partition 1 does not start on physical sector boundary.
/dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 
2621440   83  Linux
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s
time for fstat(): 0.60s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 
0x7f826a8a1000: 0.38s
time to zap 2684.4MiB: 11.734121s (228.76 MiB/s)
time to sync 2684.4MiB: 3.515991s (763.47 MiB/s)
time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s
h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2
time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s
time for fstat(): 0.69s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 
0x7fe18823e000: 0.44s
time to zap 2684.4MiB: 4.861062s (552.22 MiB/s)
time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s)
time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s
time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s

So the correctly aligned partition is two to three times faster than the badly 
aligned partition (write-only case), and it's about the performance of an 
unpartitioned disk.

Regards,
Ulrich

>>> Ulrich Windl  schrieb am 30.08.2016 um 
>>> 11:32
in Nachricht <57C552B6.33D : 161 : 60728>:
> Hello!
> 
> (I'm not subscribed to this list, but I'm hoping to get a reply anyway)
> While testing some SAN storage system, I needed a utility to erase disks 
> quickly. I wrote my own one that mmap()s the block device, memset()s the 
> area, then msync()s the changes, and finally close()s the file descriptor.
> 
> On one disk I had a primary MBR partition spanning the whole disk, like this 
> (output from some of my obscure tools):
> disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 
> (10737418240 bytes)
> partition 1 (1-20971520)
> Total Sectors =   20971519
> 
> When wiping, I started (for no good reason) to wipe partition 1, then I 
> wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the 
> disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 
> SP4.
> For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM 
> bandwidth is about 9GB/s.
> 
> To my surprise I found out that the partition eats significant performance 
> (not quite 50%, but a lot):
> 
> ### Partition
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1
time to 
> open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s
time for fstat(): 
> 0.17s
time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 
> 10.7Gib) at 0x7fbc86739000: 0.39s
time to zap 10.7Gib: 52.474054s (204.62 
> MiB/s)
time to sync 10.7Gib: 4.148350s (2588.36 MiB/s)
time to unmap 10.7Gib at 
> 0x7fbc86739000: 0.052170s
time to close 
> /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s
> 
> ### Whole disk
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
time to open 
> /dev/disk/by-id/dm-name-FirstTest-32: 0.22s
time for fstat(): 
> 0.61s
time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 
> 10.7Gib) at 0x7fa2434cc000: 0.37s
time to zap 10.7Gib: 24.580162s (436.83 
> MiB/s)
time to sync 10.7Gib: 1.097502s (9783.51 MiB/s)
time to unmap 10.7Gib at 
> 0x7fa2434cc000: 0.052385s
time to close /dev/disk/by-id/dm-name-FirstTest-32: 
> 0.290470s
> 
> Reproducible:
> h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32
> time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s
> time for fstat(): 0.65s
> time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 
> 0x7f1cc17ab000: 0.37s
> time to zap 10.7Gib: 24.624000s (436.06 MiB/s)
> time to sync 10.7Gib: 1.199741s (8949.79 MiB/s)
> time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s
> time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s
> 
> So without partition the throughput is about twice as high! Why?
> 
> Regards
> Ulrich
> 
>

ioprio_set() & IOPRIO_WHO_PROCESS: Rename?

2015-07-13 Thread Ulrich Windl

Hi!

I noticed that older Manual pages for ioprio_set(2) say IOPRIO_WHO_PROCESS 
modified the process, while I think it should be per thread. Newer manual pages 
say it's per thread, but shouldn't IOPRIO_WHO_PROCESS be declared obsolete then 
and be replaced with a new IOPRIO_WHO_THREAD? (i.e. #define IOPRIO_WHO_PROCESS 
IOPRIO_WHO_THREAD /* and #define IOPRIO_WHO_THREAD */)

Regards,
Ulrich



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 110 matches

Mail list logo