3.0.38: strange boot message: "Time: 165:165:165 Date: 165/165/65" with Xen
Hi! I just discovered a strange "<6>[0.123867] Time: 165:165:165 Date: 165/165/65" boot message in a Xen DomU VM for SLES11 SP2 on AMD Opteron (x86_64). The context is: ... <6>[0.080197] Initializing cgroup subsys net_cls <6>[0.080199] Initializing cgroup subsys blkio <6>[0.080204] Initializing cgroup subsys perf_event <4>[0.080245] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' <4>[0.080245] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy( 8) <6>[0.080293] SMP alternatives: switching to UP code <6>[0.103716] Brought up 1 CPUs <6>[0.103788] devtmpfs: initialized <6>[0.103977] print_constraints: dummy: <6>[0.123867] Time: 165:165:165 Date: 165/165/65 <6>[0.123908] NET: Registered protocol family 16 <6>[0.124081] SMP alternatives: switching to SMP code <6>[0.150019] Brought up 4 CPUs <3>[0.150019] PCI: Fatal: No config space access function found <6>[0.150019] PCI: setting up Xen PCI frontend stub ... Maybe it's related to this hypervisor message (xm dmesg) in Xen Dom0 (but the RTC is at 70, right?): (XEN) mm.c:833:d1 Non-privileged (1) attempt to map I/O space 00f0 (XEN) mm.c:833:d3 Non-privileged (3) attempt to map I/O space 00f0 (XEN) mm.c:833:d2 Non-privileged (2) attempt to map I/O space 00f0 (XEN) mm.c:833:d1 Non-privileged (1) attempt to map I/O space 00f0 (XEN) mm.c:833:d2 Non-privileged (2) attempt to map I/O space 00f0 It's not a big issue, but maybe if the RTC cannot be read, it's better to skip rather tan outputting a wrong date/time. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
read stalls with large RAM: transparent huges pages, dirty buffers, or I/O (block) scheduler?
Hi! We are running some x86_64 servers with large RAM (128GB). Just to imagine: With a memory speed of a little more than 9GB/s it takes > 10 seconds to read all RAM... In the past and recently we had problems with read() stalls when the kernel was writing back big amounts (like 80GB) of dirty buffers on a somewhat slow (40MB/s) device. The problem is old and well-known, it seems, but to really solved. One recommendation was to limit the amount of dirty buffers, which actually did not help to really avoid the problem, specifically if new dirty buffers are used as soon as they are available (i.e.: some were flushed). I had success with limiting the used memory (including dirty pages) with control groups (memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig setting up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite incomplete (no group write permission or ACL setup possible), so the end user can hardly use that. I still don't know whether read stalls are caused by the I/O channel or device being saturated, or whether the kernel is waiting for unused buffers to receive the read data, but I learned that I/O schedulers (and possibly the block layer optimizations) can cause extra delays, too. We had one situation where a single sector could not be read with direct I/O for 10 seconds. Recently we had the problem again, but it was clear that it was _not_ the device being overloaded, nor was it the I/O channel. The read problem was reported for a devioce that was almost idle, and the I/O channel (FC) can handle much more than the disk system can in both directions. So the problem seems to be inside the kernel. Oracle recommends (in article 1557478.1, without explaining the details) to turn off transparent huge pages. Before that I didn't think much about that feature. It seems the kernel is not just creating huge pages when they are requested explicitly (that's what I had thought), but also implicitly to reduce the number of pages to me managed. Collecting smaller pages to combine them for huge pages may also involve moving memory around (compaction), it seems. I still don't know whether the kernel will also try to compact dirty cache pages to huge pages, but we still see read stalls when there are many dirty pages (like when copying 400GB of data to a somewhat slow (30MB/s) disk. Now I wonder what the real solution to the problem (not the numerous work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush to give read a chance may not be sufficient when read needs to wait for unused pages, especially if the disks being read from are faster than those being written to. To my understanding dirty pages have an "age" that is used to decide whether to flush them or not. Also the I/O scheduler seems to prefer read requests over write requests. What I do not know is whether a read request is sent to the I/O scheduler before buffer pages are assigned to the request, or after the pages were assigned. So a read request only has the chance to have an "age" once it entered the I/O scheduler, right? So if read and writes had an "age" both, some EDF (earliest deadline first) scheduling could be used to perform I/O (which would be controlling buffer usage as a side-effect). For transparent huge pages, requests for a huge page should also have an age and a priority that is significantly below that of I/O buffers. If there exists an efficient algorithm and data model to perform these tasks, the problem may be solved. Unfortunately if many buffers are dirtied at one moment and reads are requested significantly later, there may be an additional need for time-slices when doing I/O (note: I'm not talking about quotas of some MB, but quotas of time). The I/O throughput may vary a lot, and time seems the only way to manage latency correctly. To avoid a situation where reads may cause stalling writes (and thus the age of dirty buffers growing without bounds), the priority of writes should be _carefully_ increased, taking care not to create a "fright train of dirty buffers" to be flushed. So maybe "smuggle in" a few dirty buffers between read requests. As a high-level flow control (like for the cgroups mechanism), processes with a high amount of dirty buffers should be suspended or scheduled with very low priority to give the memory and I/O systems a change to process the dirty buffers. For reference: The machine in question is at 3.0.74-0.6.10-default with the latest SLES11 SP2 kernel being 3.0.93-0.5. I'd like to know what the gurus thing about that. I think with increasing RAM this issue will become extremely important soon. Regards, Ulrich P.S: Not subscribed to linux-kernel, so keep me on CC:, please -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Antw: read stalls with large RAM: transparent huges pages, dirty buffers, or I/O (block) scheduler?
I forgot to mention: CPU power is not the problem: We have 2 * 6 Cores (2 Threads each), making 24 logical CPUs... >>> Ulrich Windl schrieb am 10.10.2013 um >>> 10:15 in Nachricht <52566237.478 : 161 : 60728>: > Hi! > > We are running some x86_64 servers with large RAM (128GB). Just to imagine: > With a memory speed of a little more than 9GB/s it takes > 10 seconds to read > all RAM... > > In the past and recently we had problems with read() stalls when the kernel > was writing back big amounts (like 80GB) of dirty buffers on a somewhat slow > (40MB/s) device. The problem is old and well-known, it seems, but to really > solved. > > One recommendation was to limit the amount of dirty buffers, which actually > did not help to really avoid the problem, specifically if new dirty buffers > are used as soon as they are available (i.e.: some were flushed). I had > success with limiting the used memory (including dirty pages) with control > groups (memory:iothrottle, SLES11 SP2), but the control framework (rccgconfig > setting up proper rights for /sys/fs/cgroup/mem/iothrottle/tasks) is quite > incomplete (no group write permission or ACL setup possible), so the end user > can hardly use that. > > I still don't know whether read stalls are caused by the I/O channel or > device being saturated, or whether the kernel is waiting for unused buffers > to receive the read data, but I learned that I/O schedulers (and possibly the > block layer optimizations) can cause extra delays, too. > > We had one situation where a single sector could not be read with direct I/O > for 10 seconds. > > Recently we had the problem again, but it was clear that it was _not_ the > device being overloaded, nor was it the I/O channel. The read problem was > reported for a devioce that was almost idle, and the I/O channel (FC) can > handle much more than the disk system can in both directions. So the problem > seems to be inside the kernel. > > Oracle recommends (in article 1557478.1, without explaining the details) to > turn off transparent huge pages. Before that I didn't think much about that > feature. It seems the kernel is not just creating huge pages when they are > requested explicitly (that's what I had thought), but also implicitly to > reduce the number of pages to me managed. Collecting smaller pages to combine > them for huge pages may also involve moving memory around (compaction), it > seems. I still don't know whether the kernel will also try to compact dirty > cache pages to huge pages, but we still see read stalls when there are many > dirty pages (like when copying 400GB of data to a somewhat slow (30MB/s) > disk. > > Now I wonder what the real solution to the problem (not the numerous > work-arounds) would be. Obviously simply stopping (yield) dirty buffer flush > to give read a chance may not be sufficient when read needs to wait for > unused pages, especially if the disks being read from are faster than those > being written to. > To my understanding dirty pages have an "age" that is used to decide whether > to flush them or not. Also the I/O scheduler seems to prefer read requests > over write requests. What I do not know is whether a read request is sent to > the I/O scheduler before buffer pages are assigned to the request, or after > the pages were assigned. So a read request only has the chance to have an > "age" once it entered the I/O scheduler, right? > > So if read and writes had an "age" both, some EDF (earliest deadline first) > scheduling could be used to perform I/O (which would be controlling buffer > usage as a side-effect). For transparent huge pages, requests for a huge page > should also have an age and a priority that is significantly below that of > I/O buffers. If there exists an efficient algorithm and data model to perform > these tasks, the problem may be solved. > > Unfortunately if many buffers are dirtied at one moment and reads are > requested significantly later, there may be an additional need for > time-slices when doing I/O (note: I'm not talking about quotas of some MB, > but quotas of time). The I/O throughput may vary a lot, and time seems the > only way to manage latency correctly. To avoid a situation where reads may > cause stalling writes (and thus the age of dirty buffers growing without > bounds), the priority of writes should be _carefully_ increased, taking care > not to create a "fright train of dirty buffers" to be flushed. So maybe > "smuggle in" a few dirty buffers between read requests. As a high-level flow > control (like for the cgroups mechanism), processes with a high amount of >
ext3 corruption in 3.0 kernel (SLES11 SP2 x86_64 (AMD Opteron))
Hi! I thought I'd let you know of two ext3 corruptions found on an ADM Opteron server running SLES11 SP2 (kernel-xen-3.0.42-0.7.3). Corruptions occurred at different times in different files on different machines: Too much to be ignored. The older one looked like this: [75548.267404] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #205978: rec_len % 4 != 0 - offset=4096, inode=2531699, rec_len=41331, name_len=38 And a more recent one looks like this: kernel: [261958.359401] EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #85582: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 As the nodes are running Xen VMM in a cluster, it's possible that node see Resets at any time (fencing), but I thought a journaling filesystem would either not allow or fix corruption. In both cases I found this problem when a file could not be created like this RPM error message: Error: RPM failed: error: unpacking of archive failed on file /lib/modules/3.0.42-0.7-default/kernel/drivers/media/video/cpia2/cpia2.ko;50c1fafd: cpio: open failed - Input/output error After a reset I had to repair the filesystem manually with these type of errors: Inode 248552 was part of the orphaned inode list. FIXED. Block bitmap differences: Free blocks count wrong for group After repair and reboot I still saw: kernel: [ 698.061916] EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 68710 kernel: [ 698.061916] EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 68711 (dm-0 is the root Logical Volume) CPU-Details (Sun X4100 Server) are: vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 285 stepping: 2 (I know this CPU has some bugs with virtualization; is filesystem corruption one of them?) Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: using cgroups in real life
Hi! I have a question on cgroups (as of Linux 3.0): The concept is to mount a filesystem, and configure cgroups through it. This implies that all the files belong to root (or maybe some other fixed user). AFAIK, you can chmod() and chown() files, but these bits are only kept in the i-node cache, so they may change at any time. I think this is bad, because if you want to allow users to limit (maybe memory usage) by using some predefined cgroup, the user needs at least partial write access to that cgroup (to add the PID). Probably this also means the user could add any PID (even those processes not owned by him). The alternative is that a privileged task manages cgroups and PIDs. This is difficult, for example, if the process to control does not exist yet (e.g. the user logs in and then starts some process). It's getting tricky if the user maybe runs some big fat database (which should work at peak performance), and later logs in to do a backup of the database (which is not time critical, and should not steal all the I/O bandwidth). I wonder how a solution would look like to allow the user to limit the bandwidth (maybe use of page cache, too) of the backup in an reliable way. Being paranoid, the user should at most be able to limit his own processes. I cannot envision a proper solution with the current interface. Would anybody share some good ideas with me? (I'm not subscribed to the kernel list, so please CC:) Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.0: blk-cgroup.c: allow '*' for device selection for "throttle.read_bps_device" and alike
Hi! I have a wish for Linux 3.x and blkio cgroup subsystem: Allow to specify any device like: blkio.throttle.read_bps_device = "*:* 41943040" Why: With multipathing being effective, you can't predict the device number your device will have in advance (I'm talking about "/etc/cgconfig.conf"). Example: # multipath -ll |grep dm- CBW_DB_FATA-E2 (3600508b4001085dd00011380) dm-10 HP,HSV200 CBW_CI-E2 (3600508b4001085dd0001140c) dm-12 HP,HSV200 CBW_DB_FATA-E1 (3600508b4001085e3f181) dm-18 HP,HSV200 CBW_CI-E1 (3600508b4001085e3f1f6) dm-19 HP,HSV200 DP_FileLib-E2 (3600508b4001085dd000114be) dm-16 HP,HSV200 CBW_DB_Exe-E2 (3600508b4001085dd0001137b) dm-11 HP,HSV200 CBW_DB_Exe-E1 (3600508b4001085e3f17e) dm-17 HP,HSV200 DP_DB_10k-E2 (3600508b4001085dd00011448) dm-15 HP,HSV200 CBW_DB_BTD-E2 (3600508b4001085dd000115ad) dm-14 HP,HSV200 DP_DB_10k-E1 (3600508b4001085e3f246) dm-20 HP,HSV200 CBW_DB_10k-E2 (3600508b4001085dd000115a5) dm-9 HP,HSV200 CBW_DB_10k-E1 (3600508b4001085e3f35f) dm-13 HP,HSV200 The code is in blkio_policy_parse_and_set() of blk-cgroup.c. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
"floppy0: floppy timeout called", https://bugzilla.novell.com/show_bug.cgi?id=799559
Hi! maybe someone wants to have a look at kernel messages that look like debug dumps from the floppy driver. These messages fill up syslog unnecessarily. You can find the kernel messages in https://bugzilla.novell.com/show_bug.cgi?id=799559. Last seen in kernel-default-3.7.10-1.11.1.i586 of openSUSE 12.3... Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Wtrlt: Q: NFS: directory XX/YYY contains a readdir loop.Please contact your server vendor.
Re-sent due to "5.7.1 Content-Policy reject msg: The capital Triple-X in subject is way too often associated with junk email, please rephrase. ": >>> "Ulrich Windl" schrieb am 16.08.2013 um 10:29 in Nachricht <520e15ef.ed38.00a...@rz.uni-regensburg.de>: > Hi, > > recently I found out that we his the "NFS: directory in/mdoc contains a > readdir loop.Please contact your server vendor." frequently on an NFS-Client > running SLES11 SP2 (3.0.80-0.7-default). The NFS server is also SLES11 SP2, > and > the exported filesystem is ext3 with "dir_index" on. > > SLES support suggested to turn off "dir_index" in ext3, which "should be > safe". > > I googled the problem, and I found some (to me) vague description by Ted Tso > ("If not readdir() then what?") back in 2011 referring to ext3. > > Now I wonder: Is this problem restricted to just ext3, or to any filesystem? > > We have (and I cannot change it) directories with many files, even if just > temporary. > > The statistics say: "122431/524288 files (3.4% non-contiguous), > 1230006/2097152 blocks" > > The biggest directory has almost 1MB in size, but just about 16513 directory > entries. > > I'm wondering whether "directory compaction" (compact slots of removed > entries) would help with the problem. In HP-UX VxFS you could do directory > compation online... > > If you can explain the relationship of ext3 and other filesystems with this > bug, please reply keeping the CC: > > Thank you, > Ulrich Header Description: Binary data
chown: s-Bits: to clear or not to clear
Hi folks, I discovered (SLES11 SP2 with kernel 3.0.80) that a chown executed by root (from non-root to non-root user) clears any s-Bits that were set for the old owner. The man page (man 2 chown) says: When the owner or group of an executable file are changed by a non- superuser, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version. In case of a non- group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). As there are good arguments for and against clearing the s-Bits during chown, there are probably only good arguments for having an option for chown(1) to preserve the s-Bits. What do you think? (I know this is the wrong list for discussing utils). Regards, Ulrich Windl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Possible mmap() write() problem in SLES11 SP2 kernel
Hi folks! I think I'd let you know (maybe I'm wrong, and the kernel is right): I write a C-program that maps a file into an private writable map. Then I modify the area a bit and use one write to write that area back to a file. This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel 3.0.80-0.7 the write() fails with EFAULT if the output file is the same as the input file. The strace is amazingly short (I removed the unrelated calls): open("xxx", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0 mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000 close(3)= 0 open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 write(3, 0x7f85ac045000, 4414) = -1 EFAULT (Bad address) close(3)= 0 munmap(0x7f85ac045000, 4414)= 0 I want to have your attention if this should work, and you get my attention if this should not work. Note that the input file is closed before it's opened for write again. As the output file is typically shorter than the input, I didn't want to use a non-private mapping and a truncate, just in case you wonder... Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: Re: Possible mmap() write() problem in SLES11 SP2 kernel
>>> Hugh Dickins schrieb am 04.08.2013 um 00:37 in Nachricht : > On Thu, 1 Aug 2013, Ulrich Windl wrote: >> Hi folks! >> >> I think I'd let you know (maybe I'm wrong, and the kernel is right): >> >> I write a C-program that maps a file into an private writable map. Then I > modify the area a bit and use one write to write that area back to a file. >> >> This worked fine in SLES11 kernel 3.0.74-0.6.10. However with kernel > 3.0.80-0.7 the write() fails with EFAULT if the output file is the same as > the input file. > > I wonder if you actually did exactly the same on both kernels. Hi! thanks for replying! Actually id did the sam a few thousand times (with different files and different lengths) in the previous kernel, weher it never failed, just as with the newer kernel where it always fails (it seems). > >> >> The strace is amazingly short (I removed the unrelated calls): > > Providing that was very helpful. > >> open("xxx", O_RDONLY) = 3 >> fstat(3, {st_mode=S_IFREG|0644, st_size=4416, ...}) = 0 >> mmap(NULL, 4416, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x7f85ac045000 >> close(3)= 0 >> open("xxx", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 > > The crucial point is the above O_TRUNC when you now open the file for > writing: that truncates the file to 0-length, which unmaps any pages > mapped from it into userspace. Even the privately modified COW pages: Well, but the mapping is PRIVATE, so I guessed once mapped, changes to the map won't affect the file, just as changes to the file won't affect the map. Specifically when re-opening the file for writing with O_TRUNC I did not expect the map to become invalid. Also note that the unmap still returns no error. My manual page vaguely says: "It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region." > that often seems surprising, but it is how mmap versus truncate is > specified to work. > >> write(3, 0x7f85ac045000, 4414) = -1 EFAULT (Bad address) > > If your program now touched a part of the mapping, it would get > SIGBUS, there being no pages of underlying object to page in from. > But since you're accessing the area from within a system call, > that simply fails with EFAULT. OK, if things are like this, the older kernel must have been faulty. > >> close(3)= 0 >> munmap(0x7f85ac045000, 4414)= 0 >> >> I want to have your attention if this should work, and you get my attention > if this should not work. > > It should not work. > >> Note that the input file is closed before it's opened for write again. As > the output file is typically shorter than the input, I didn't want to use a > non-private mapping and a truncate, just in case you wonder... > > (I didn't understand your logic there.) The alternative to write() a part of the PRIVATE area would be to work with a non-PRIVATE area that is truncated after flushing the changes. In principle the same blocks could be written multiple times (when you move data from later parts to earlier parts (i.e.: from the far end closer to the beginning)), so I thought a PRIVATE mapping plus one write() would avoid that. I had the coice of truncate while opening, or to truncate the extra data after write(). I chose the first alternative. Maybe I'll re-design... Thanks, Ulrich > > Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
message (3.0.80) "kernel: [440682.559851] blk_rq_check_limits: over max size limit."
Hi! I just did some block device tuning according to some expert's advice which resulted in multipath failures. I'm not going to discuss this as I'll have to investigate further, but I'd like to point out that the messages like "[440682.559851] blk_rq_check_limits: over max size limit." lack the affected device! It's quite hard to debug if you have more than 90 disks attached: Aug 6 14:58:08 h06 multipathd: 65:240: mark as failed Aug 6 14:58:08 h06 multipathd: SAP_T11_I03-E3: remaining active paths: 4 Aug 6 14:58:08 h06 multipathd: 68:48: mark as failed Aug 6 14:58:08 h06 multipathd: SAP_T11_I03-E3: remaining active paths: 3 Aug 6 14:58:08 h06 multipathd: 8:208: mark as failed Aug 6 14:58:08 h06 multipathd: SAP_T11_I03-E4: remaining active paths: 3 Aug 6 14:58:08 h06 multipathd: 67:80: mark as failed Aug 6 14:58:08 h06 multipathd: SAP_T11_I03-E4: remaining active paths: 2 Aug 6 14:58:08 h06 kernel: [440682.559851] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.559891] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.559916] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.559928] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.559966] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.559996] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560016] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560037] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560058] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560106] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560149] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560176] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560189] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560223] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560257] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560277] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560296] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560319] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.560333] device-mapper: multipath: Failing path 65:240. Aug 6 14:58:08 h06 kernel: [440682.560346] device-mapper: multipath: Failing path 68:48. Aug 6 14:58:08 h06 kernel: [440682.560429] device-mapper: multipath: Failing path 67:80. Aug 6 14:58:08 h06 kernel: [440682.560436] device-mapper: multipath: Failing path 8:208. Aug 6 14:58:08 h06 kernel: [440682.561345] sd 2:0:0:7: alua: port group 01 state N non-preferred supports tolusNA Aug 6 14:58:08 h06 kernel: [440682.561500] sd 3:0:2:7: alua: port group 01 state N non-preferred supports tolusNA Aug 6 14:58:08 h06 kernel: [440682.562075] sd 2:0:4:7: alua: port group 01 state N non-preferred supports tolusNA Aug 6 14:58:08 h06 kernel: [440682.562257] sd 3:0:4:7: alua: port group 01 state N non-preferred supports tolusNA Aug 6 14:58:08 h06 kernel: [440682.562393] sd 3:0:8:7: alua: port group 01 state N non-preferred supports tolusNA Aug 6 14:58:08 h06 kernel: [440682.676078] sd 3:0:2:7: alua: port group 01 switched to state A Aug 6 14:58:08 h06 kernel: [440682.676091] sd 2:0:0:7: alua: port group 01 switched to state A Aug 6 14:58:08 h06 kernel: [440682.676108] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.676112] blk_rq_check_limits: over max size limit. Aug 6 14:58:08 h06 kernel: [440682.676115] blk_rq_check_limits: over max size limit. Is this problem fixed in a newer kernel? For the curious: I tuned queue/max_sectors_kb for the paths in a multipath device, but didn't tune the multipath device itself... Regards, Ulrich P.S.: Plese keep CC: as I'm not subscribed to the list -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bad time after boot (read from absent hardware?): "Time: 165:165:165 Date: 165/165/65"
Hi! Some time ago I discovered strange output in boot messages, just as if the kernel trusts junk from hardware that is not present, like the RTC in a paravirtualized Xen guest (the guest has no /dev/rtc*). The message says: <6>[0.123524] Time: 165:165:165 Date: 165/165/65 Obviously, if there were some validity check, this wouldn't pass, so I guess there is none! In Xen's message buffer (hypervisor) I only see this error (that seems unrelated): (XEN) mm.c:833:d6 Non-privileged (6) attempt to map I/O space 00f0 According to my source code the print originates from read_magic_time() in /drivers/base/power/trace.c. I'm running kernel 3.0.74-0.6.8-xen (SLES11 SP2) on x86_64. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Suggestion for improving kernel messages on ext3-mount for consistency
Hi! I have a kind of trivial suggestion for improving the kernel messages for ext3-fs mounts to be more consistent and useful: Most messages for ext3-mounting include the device, like: kernel: [ 823.233892] EXT3-fs (dm-7): using internal journal kernel: [ 823.233899] EXT3-fs (dm-7): mounted filesystem with ordered data mode However some messages do not include a device, even though they seem device specific. For example: kernel: [ 823.210989] EXT3-fs: barriers not enabled kernel: [ 823.233218] kjournald starting. Commit interval 15 seconds This was observed in the current SLES11 SP2 kernel (3.0.58-0.6.6). I haven't queried the sources, but it looks like an easy change... BTW: The kjournal threads are also anonymous in the process list (while xfs (for example) names them): # ps ax |grep journ 418 ?S 0:00 [kjournald] 1070 ?S 0:00 [kjournald] 1071 ?S 0:00 [kjournald] 1072 ?S 0:00 [kjournald] 1073 ?S 0:00 [kjournald] 1074 ?S 0:00 [kjournald] 1075 ?S 0:00 [kjournald] 5461 ?S 0:00 [kjournald] 5499 ?S 0:00 [kjournald] 5601 ?S 0:00 [kjournald] 5642 ?S 0:00 [kjournald] 5648 ?S 0:00 [kjournald] 5653 ?S 0:00 [kjournald] 5661 ?S 0:00 [kjournald] 5873 tty1 S+ 0:00 grep journ # ps ax |grep xfs 5506 ?S< 0:00 [xfs_mru_cache] 5507 ?S< 0:00 [xfslogd] 5508 ?S< 0:00 [xfsdatad] 5509 ?S< 0:00 [xfsconvertd] 5510 ?S 0:00 [xfsbufd/dm-10] 5511 ?S 0:00 [xfsaild/dm-10] 5518 ?S 0:00 [xfsbufd/dm-11] 5519 ?S 0:00 [xfsaild/dm-11] 5560 ?S 0:00 [xfsbufd/dm-12] 5561 ?S 0:00 [xfsaild/dm-12] 5593 ?S 0:00 [xfsbufd/dm-13] 5594 ?S 0:00 [xfsaild/dm-13] 5875 tty1 S+ 0:00 grep xfs Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: diskstats for MD-RAID
Hello! I have a question based on the SLES11 SP1 kernel (2.6.32.59-0.3-default): In /proc/diskstats the last four values seem to be zero for md-Devices. So "%util", "await", and "svctm" from "sar" are always reported as zero. Ist this a bug or a feature? I'm tracing a fairness problem resulting from an I/O bottleneck similar to that described in kernel bugzilla #12309... (If the kernel has about 80GB dirty buffers (yes: 80GB), reads using the same I/O channel seem to starve: The scenario is like this: a FC-SAN disksystem with two different types of disks is used to copy from the faster disks to slower disks using "cp". The files are some ten GB in size (Oracle database). After several minutes (while the "cp" is still runing), unrelated processes accessing different disk devices through the same I/O channel suffer from bad response times. I guess the kernel does not know about the relationship of different disk devices being connected through on I/O channel: If the kernel tries to keep each device busy (specifically trying to flush dirty buffers from one disk to make available buffers, it really reduces the I/O rate of other disks. Despite of that, some layers combine 8-sector-requests to something like 600-sector requests, which probably also needs additional buffers and it will hit the response time. The complete I/O stack is: FC-SAN, multipath (RR), MD-RAID1, LVM, ext3) When replying, please keep me in CC: as I'm not subscribed to the list. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: Seeing the microcode revision in /proc/cpuinfo
Hi! After several reboots due to memory errors after excellent power-saving of Linux on a HP DL380G7 with Intel Xeon 5650 processors (all in on memory bank), I found out the errate "BD104" and "BD123". The former should be fixed in a microcode revision "15H". Now I wonder what microcode revision my CPUs currently have. /proc/cpuinfo doesn't show that, and the microcode update is a bit cryptic: kernel: [ 44.422912] microcode: CPU23 sig=0x206c2, pf=0x1, revision=0x14 Does that mean the revision is 0x14 BEFORE or AFTER the microcode update? Wouldn't you agree that seeing the microcode revision in /proc/cpuinfo would be nice? For those CPUs lacking the feature one could hard-wire the value "none" (which would be also "kind of true")... Regards, Ulrich (not subscribed, so please CC: you replies to me) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: Re: Q: Seeing the microcode revision in /proc/cpuinfo
Hi Borislav, probably my edge is not bleeding that much than yours ;-) I don't see "microcode" in 3.0.34-0.7-default for an AMD Opteron, and not in 2.6.32.59-0.3-default for the Intel Xeon. Both are kernels of SLES11 xon x86_64. The first one is the latest you can get for SLES11 SP2. In openSUSE 12.1 (kernel 3.1.10) it's also still missing. Anyway, it's nice to see that others also thought this feature is useful. Thanks & best regards, Ulrich >>> Borislav Petkov schrieb am 14.08.2012 um 15:12 in Nachricht <20120814131211.ga25...@x1.osrc.amd.com>: > On Tue, Aug 14, 2012 at 02:35:40PM +0200, Ulrich Windl wrote: > > Hi! > > > > After several reboots due to memory errors after excellent power-saving of > Linux on a HP DL380G7 with Intel Xeon 5650 processors (all in on memory > bank), I found out the errate "BD104" and "BD123". The former should be fixed > in a microcode revision "15H". > > > > Now I wonder what microcode revision my CPUs currently have. /proc/cpuinfo > doesn't show that, and the microcode update is a bit cryptic: > > > > kernel: [ 44.422912] microcode: CPU23 sig=0x206c2, pf=0x1, revision=0x14 > > > > Does that mean the revision is 0x14 BEFORE or AFTER the microcode update? > > > > Wouldn't you agree that seeing the microcode revision in /proc/cpuinfo > would be nice? > > Well, you must be using an old-ish kernel because the microcode revision > infact *is* in /proc/cpuinfo: > > processor : 1 > ... > stepping: 0 > microcode : 0x528 > > This is on 3.6-rc1 and that functionality is upstream since 3.2. > > HTH. > > -- > Regards/Gruss, > Boris. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: Re: /sys and access(2): Correctly implemented?
>>> Ryan Mallon schrieb am 09.07.2012 um 01:24 in Nachricht <4ffa16b6.9050...@gmail.com>: > On 06/07/12 16:27, Ulrich Windl wrote: > > Hi! > > > > Recently I found a problem with the command (kernel 3.0.34-0.7-default from > SLES 11 SP2, run as root): > > test -r "$file" && cat "$file" > > emitting "Permission denied" > > > > Investigating, I found that "test" actually uses "access()" to check for > permissions. Unfortunately there are some files in /sys that have > "write-only" > permission bits set (e.g. /sys/devices/system/cpu/probe). > > > > ~ # ll /sys/devices/system/cpu/probe > > --w--- 1 root root 4096 Jun 29 12:43 /sys/devices/system/cpu/probe > > ~ # F=/sys/devices/system/cpu/probe > > ~ # test "$F" && cat "$F" > > cat: /sys/devices/system/cpu/probe: Permission denied > > Looks like you have a typo here, I think you wanted "test -r $F", not > "test $F", the latter will just evaluate "$F" as an expression which > will be true, and so you get the permission denied error running cat. Hi! You are right: It's a typo, but only in the message; the actual test was done correctly, and the outcome is quite the same. > > Using "test -r $F" on a write-only sysfs file correctly returns false on > my machine (Ubuntu 10.04.4 LTS/2.6.32-41-generic). Not here, unfortunately: # ll /sys/devices/system/cpu/probe --w--- 1 root root 4096 Jul 2 11:52 /sys/devices/system/cpu/probe # F=/sys/devices/system/cpu/probe # test -r "$F" && cat "$F" cat: /sys/devices/system/cpu/probe: Permission denied # uname -a Linux h07 2.6.32.59-0.3-default #1 SMP 2012-04-27 11:14:44 +0200 x86_64 x86_64 x86_64 GNU/Linux Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Antw: Re: /sys and access(2): Correctly implemented?
Hi! Still the problem seems to be related to the sysfs: # cd /tmp # touch testfile # chmod u=w,go= testfile # F=/tmp/testfile # test -r "$F" && cat "$F" So it seems access(2) works correctly for root and "normal" filesystems. That's why I came up with the issue here. Regards, Ulrich >>> Ryan Mallon schrieb am 09.07.2012 um 09:22 in Nachricht <4ffa86c5.7090...@gmail.com>: > On 09/07/12 16:23, Ulrich Windl wrote: >>>>> Ryan Mallon schrieb am 09.07.2012 um 01:24 in >>>>> Nachricht > > <4ffa16b6.9050...@gmail.com>: > >> On 06/07/12 16:27, Ulrich Windl wrote: > >>> Hi! > >>> > >>> Recently I found a problem with the command (kernel 3.0.34-0.7-default > >>> from > >> SLES 11 SP2, run as root): > >>> test -r "$file" && cat "$file" > >>> emitting "Permission denied" > >>> > >>> Investigating, I found that "test" actually uses "access()" to check for > >> permissions. Unfortunately there are some files in /sys that have > "write-only" > >> permission bits set (e.g. /sys/devices/system/cpu/probe). > >>> > >>> ~ # ll /sys/devices/system/cpu/probe > >>> --w--- 1 root root 4096 Jun 29 12:43 /sys/devices/system/cpu/probe > >>> ~ # F=/sys/devices/system/cpu/probe > >>> ~ # test "$F" && cat "$F" > >>> cat: /sys/devices/system/cpu/probe: Permission denied > >> > >> Looks like you have a typo here, I think you wanted "test -r $F", not > >> "test $F", the latter will just evaluate "$F" as an expression which > >> will be true, and so you get the permission denied error running cat. > > > > Hi! > > > > You are right: It's a typo, but only in the message; the actual test was > done correctly, and the outcome is quite the same. > > > >> > >> Using "test -r $F" on a write-only sysfs file correctly returns false on > >> my machine (Ubuntu 10.04.4 LTS/2.6.32-41-generic). > > > > Not here, unfortunately: > > Oops, I missed the bit about you running as root. I get the same results > running as root on my machine as you, both for sysfs and regular files. > > It appears that access(2) as the super-user is might be implementation > defined, see: > > http://pubs.opengroup.org/onlinepubs/95399/functions/access.html > http://lists.gnu.org/archive/html/bug-bash/2010-07/msg00071.html > > However, I can't find any concrete information on it for Linux, and the > manpage doesn't mention anything other the the X_OK bit. > > ~Ryan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: [PATCH 0/5] kfifo cleanup and log based kfifo API
>>> Yuanhan Liu schrieb am 08.01.2013 um 15:57 in Nachricht <1357657073-27352-1-git-send-email-yuanhan@linux.intel.com>: [...] > My proposal is to replace kfifo_init with kfifo_alloc, where it > allocate buffer and maintain fifo size inside kfifo. Then we can > remove buggy kfifo_init. [...] Spontaneously I feel that emitting a critical message if the requested size is not a power of two would be a good idea, as well as (in that case) rounding up to the next power of two instead of rounding down seems not too stupid ;-) Sorry, I'm not deeply into recent kernel development. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pthreads & gdb: zombie threads?
Hello, I'm having a strange problem debugging a pthreads application in 2.2.18 (as per SuSE 7.1): gdb says the program terminated normally after having started two or three LWPs. I can exit gdb then, and I find (ps -ax) one zombie thread and two or three other threads. Is it more likely a kernel problem, a library problem, or a gdb problem? Naively I thought when exiting the process, all threads would die... Ulrich P.S. Not subscribed here - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: announce: PPSkit patch for Linux 2.4.2 (pre6)
Hi, Cycle Counters, Linux currently tries to synchronize TSCs for consistent time in SMP systems. One would not believe what combinations of hardware are tried, especially for precision timing. Here's a short answer to my asking- back about a complaint (the kernel is reporting negative time warps). As any problem, it can be solved with some overhead, but should it be done? Replies to me too, as I'm not subscribed, please. Ulrich On 9 Apr 2001, at 18:39, Andreas Bussjaeger wrote: > > from the current CPU. All these values seem highly suspect. However a > > few more values would be helpful to diagnose the situation. > > I have to tell you that I have one 533 MHz Celeron and one 433 MHz > Celeron. > > > indicate that the CPUs are 968ms apart (each CPU half from the average). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.2.19: config help text about "TCO timer"
Hi, I know I'm late, but Configure.help in 2.2.19 says: ..."The TCO (Total Cost of Ownership) timer is a watchdog"... I know TCO meaning that, but I can't believe it for a mainboard component. Should the user then throw the PC away, or what? Or is it more safe to reboot frequently. What has this to do with costs? Confused, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: No 100 HZ timer!
IMHO the POSIX is doable to comply with POSIX. Probably not what many of the RT freaks expect, but doable. I'm tuning the nanoseconds for a while now... Ulrich On 17 Apr 2001, at 11:53, george anzinger wrote: > I was thinking that it might be good to remove the POSIX API for the > kernel and allow a somewhat simplified interface. For example, the user - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
patch-proposal: extended adjtime()
Hello, someone found out that in Linux adjtime()'s correction is limited to something like 2000s (signed 32bit microseconds for i386). This is not a true problem, but for those who desperately need/want it, I have a patch proposal (incomplete, but essential) to implement the full range (maybe even more). The patch tries to keep binary compatibility, too. Opinions? Regards, Ulrich --- kernel/243time.cMon Apr 16 20:14:27 2001 +++ kernel/xxxtime.cMon Apr 16 20:41:15 2001 @@ -100,7 +100,8 @@ write_lock_irq(&xtime_lock); xtime.tv_sec = value; xtime.tv_usec = 0; - time_adjust = 0;/* stop active adjtime() */ + time_adjust.tv_sec = time_adjust.tv_usec = 0; + /* stop active adjtime() */ time_status |= STA_UNSYNC; time_maxerror = NTP_PHASE_LIMIT; time_esterror = NTP_PHASE_LIMIT; @@ -225,7 +226,8 @@ */ int do_adjtimex(struct timex *txc) { -long ltemp, mtemp, save_adjust; +long ltemp, mtemp; + struct timeval save_adjust; int result; /* In order to modify anything, you gotta be super-user! */ @@ -295,7 +297,31 @@ if (txc->modes & ADJ_OFFSET) { /* values checked earlier */ if (txc->modes == ADJ_OFFSET_SINGLESHOT) { /* adjtime() is independent from ntp_adjtime() */ - time_adjust = txc->offset; + + /* Try to extend the range for plain old adjtime() +* to multiple seconds without breaking binary +* compatibility. A perfect solution is not +* possible, but this one has a high probability +* for success. The true solution is a syscall of +* its own. +* The offset for ADJ_OFFSET_SINGLESHOT is stored in +* txc->time (struct timeval) now. To avoid using +* garbage vaues, it's required to copy +* `txc->time.tv_usec' also into `txc->offset'. Just +* to be sure, we also require the magic word +* EXTENDED_ADJTIME_MAGIC to be written to `txc->status' +* (it's a value not possible before, and it's +* overwritten after each call). +*/ +#define EXTENDED_ADJTIME_MAGIC (0x + ('U' << 24) + ('W' << 16)) + /* old compatible interface */ + time_adjust.tv_usec = txc->offset; + + if (txc->offset == txc->time.tv_usec && + txc->status == EXTENDED_ADJTIME_MAGIC) { + /* extended part */ + time_adjust.tv_sec = txc->time.tv_sec; + } } else if ( time_status & (STA_PLL | STA_PPSTIME) ) { ltemp = (time_status & (STA_PPSTIME | STA_PPSSIGNAL)) == @@ -375,9 +401,11 @@ /* p. 24, (d) */ result = TIME_ERROR; - if ((txc->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT) - txc->offset= save_adjust; - else { + if ((txc->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT) { + txc->offset= save_adjust.tv_usec; + if (txc->status == EXTENDED_ADJTIME_MAGIC) + txc->time = save_adjust; + } else { if (time_offset < 0) txc->offset = -(-time_offset >> SHIFT_UPDATE); else --- kernel/243timer.c Mon Apr 16 20:34:29 2001 +++ kernel/xxxtimer.c Mon Apr 16 21:01:29 2001 @@ -58,8 +58,7 @@ long time_adj; /* tick adjust (scaled 1 / HZ) */ long time_reftime; /* time at last adjustment (s) */ -long time_adjust; -long time_adjust_step; +struct timeval time_adjust;/* remaining time adjustment */ unsigned long event; @@ -461,8 +460,26 @@ /* in the NTP reference this is called "hardclock()" */ static void update_wall_time_one_tick(void) { - if ( (time_adjust_step = time_adjust) != 0 ) { - /* We are doing an adjtime thing. + long time_adjust_step; + + if ((time_adjust.tv_sec | time_adjust.tv_usec) != 0) { + time_adjust_step = time_adjust.tv_usec; + if (time_adjust_step > 0) { + /* if we run out of microseconds, but have more seconds, +* borrow another second +*/ + if (time_adjust_step < tickadj && time_adjust.tv_sec > 0) { + time_adjust_step = time_adjust.tv_usec += 100; + --time_adjust.tv_sec; + } + } else { + if (time_adjust_step > -tickadj && time_adjust.tv_sec < 0) { + time_adjust_step = time_adjust.tv_usec -= 100; + ++time_adjust.tv_sec; + } + } + + /* We gave to complete the adjtim
2.2.18: static rtc_lock in nvram.c
Hi, browsing the sources for some problem I wondered why nvram.c uses a static spinlock named rtc_lock, hiding the global one. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.2.18/ext2: special file corruption?
Hi, I had an interesting effect: Due to NVdriver I had a lot of system freezes, and I had to reboot. Using e2fsck 1.19a (SuSE 7.1) I got the message that one specific "Special (device/socket/fifo) inode .. has non-zero size. FIXED." Interestingly I got the message for every reboot. So either the kernel corrupts the very same inode every time, or e2fsck does not really fix it, or the error simply doesn't exist. I think the kernel doesn't temporarily set the size to non-zero, so this seems strange. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18: static rtc_lock in nvram.c
On 26 Feb 2001, at 9:33, Alan Cox wrote: > > browsing the sources for some problem I wondered why nvram.c uses a > > static spinlock named rtc_lock, hiding the global one. > > It only does that for the atari, where the driver isnt used by other things Hmm.. are there different nvram.c drivers? I noticed that SuSE 7.1 loads that driver in i386 Also doesn't look a lot like Atari: * This driver allows you to access the contents of the non-volatile memory in * the mc146818rtc.h real-time clock. This chip is built into all PCs and into * many Atari machines. In the former it's called "CMOS-RAM", in the latter * "NVRAM" (NV stands for non-volatile). Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18/ext2: special file corruption?
On 26 Feb 2001, at 10:48, Andreas Dilger wrote: > Ulrich Windl writes: > > I had an interesting effect: Due to NVdriver I had a lot of system > > freezes, and I had to reboot. Using e2fsck 1.19a (SuSE 7.1) I got the > > message that one specific "Special (device/socket/fifo) inode .. has > > non-zero size. FIXED." > > > > Interestingly I got the message for every reboot. So either the kernel > > corrupts the very same inode every time, or e2fsck does not really fix > > it, or the error simply doesn't exist. I think the kernel doesn't > > temporarily set the size to non-zero, so this seems strange. > > It is strange that it thinks ".." is a special inode. Maybe e2fsck is Og course NOT: ``..'' is a meta syntax for ellipsis. I couldn't remember the inode number. > fixing the wrong problem (i.e. truncating the directory ".."), and it > later fixes the zero-length directory... Could you try two things: > > 1) unmount the filesystem and run e2fsck on the broken filesystem 1 or 2 >times, to see if e2fsck is fixing the problem or not. I did that, and actually it fixed the very same problem again. On a second run it was fixed. So either the "-a -t ext2" prevents the changes from being written back if the only problem was that special file, or there is some corruption undetected by fsck that in turn causes the kernel to corrupt the filesystem again and again, or I don't know. Here's the log from my tries: elf:~ # fsck -f /dev/sda6 Parallelizing fsck version 1.19a (13-Jul-2000) e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09 Pass 1: Checking inodes, blocks, and sizes Special (device/socket/fifo) inode 16600 has non-zero size. Fix? yes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sda6: * FILE SYSTEM WAS MODIFIED * /dev/sda6: 35542/86400 files (0.9% non-contiguous), 124965/172690 blocks elf:~ # fsck -f /dev/sda6 Parallelizing fsck version 1.19a (13-Jul-2000) e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sda6: 35542/86400 files (0.9% non-contiguous), 124965/172690 blocks > > 2) If it is fixing the problem you need to wait until the next time you have >a system crash, start in single user mode. If it is NOT fixing the problem >you can do this right away. Run "e2fsck -n" to see which inode number is >corrupt (the -n option means e2fsck will not fix the filesystem), and then >run "debugfs /dev/X", type "dump " and "ncheck inode_number" >at the prompt (note you NEED the <> around the inode number for dump). >Send the output. I'll keep your message. Maybe you hear again from me. Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.0test12: problems timing events
Hi, I tried to time events inside the kernel in 2.4.0test12: Basically the same code works fine in 2.2.18 with about 1us jitter. However in 2.4.0test12 the jitter is around 600ms! What I did is this: I modified the interrupt routine of the serial driver to get a precision time-stamp via do_gettimeofday(). So I guess either interrupts are delayed significantly from time to time, or the time routine has been changed to be no longer useful within interrupt routines. If anybody can enlighen me on this, I'd be happy. I'm not subscribed to linux-kernel, so maybe please CC: Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
suggest: diff-2.4.0-test12_to_2.4.0
I thought I'd find a diff between 2.4.0test12 (last test release) to the final 2.4.0 release, but did not. Wouldn't it be (have been) a good idea? Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: suggest: diff-2.4.0-test12_to_2.4.0
On 8 Jan 2001, at 14:16, Andreas Jaeger wrote: > >>>>> Ulrich Windl writes: > > > I thought I'd find a diff between 2.4.0test12 (last test release) to > > the final 2.4.0 release, but did not. Wouldn't it be (have been) a good > > idea? > > Apply: > patch-2.4.0-prerelease.bz2 and then prerelease-to-final.bz2 to test12 > and you get 2.4.0 final. > > You'll find both in ftp.*.kernel.org/...kernel/v2.4/test-kernels/ And both fit on a 1.44MB floppy. Great! Thanks a lot. Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4: header file confusion (interrupts)
Inspecting some code I found out that in 2.4.0test12 request_irq() is declared in sched.h, and not in interrupt.h, SA_SHIRQ is declared in asm/signal.h, and not in interrupt.h Isn't that a bit confusing? Maybe for 2.5 let's re-sort some things to clean up dependencies... Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
some issues for 2.4.0
Hello, I have some issues on Linux-2.4.0: During boot the (slightly modified, see later) kernel says: <4>Linux version 2.4.0-NANO (root@elf) (gcc version 2.95.2 19991024 (release)) #1 Mon Jan 8 22:04:48 MET 2001 [...] <4>PCI: PCI BIOS revision 2.10 entry at 0xfb280, last bus=1 <4>PCI: Using configuration type 1 <4>PCI: Probing PCI hardware <4>Unknown bridge resource 0: assuming transparent ??? What does the message above mean? <4>PCI: Using IRQ router VIA [1106/0596] at 00:07.0 <6>Activating ISA DMA hang workarounds. The DMI reports some funny values for my low-price board (the vendor did not ship a DMI utility as Asus did for my old one): <6>DMI 2.2 present. <6>39 structures occupying 1055 bytes. <6>DMI table at 0x000F0800. <4>BIOS Vendor: Award Software International, Inc. <4>BIOS Version: 4.51 PG <4>BIOS Release: 06/19/00 <4>System Vendor: VIA Technologies, Inc.. <4>Product Name: VT82C693BX. <4>Version . <4>Serial Number . ??? Aren't they (above two lines) funny? <4>Board Vendor: Shuttle Inc.. <4>Board Name: HOT-AV11 693-596-W977. <4>Board Version: 2A6LGH2A. [...] As reported for 2.4.0-test12 there seems to be a problem timing events within an interrupt (e.g. serial): The jitter is quite high. I'm timing pulses generated from a GPS clock every second to estimate the clock error. I'll show the first few updates. Let's show some facts first, and they state a suspect. The pair is seconds:nanoseconds for the captured timestamps. My pulse is roughly 200ms+800ms: 979070631:649924277 979070632:49920873 979070633:649922851 979070634:49921630 979070635:649923125 979070636:49920800 ??? Oops! Time jumped back! 979070633:354954544 979070633:754953483 979070635:354954708 979070635:754954209 979070637:354955615 979070637:754953649 979070639:354955938 979070639:754953328 ??? Again! 979070637:59988575 979070637:459985921 979070639:59986981 979070639:459985930 979070641:59986854 979070641:459985908 979070643:59987006 979070643:459987393 979070645:59987262 979070645:136458168 979070642:765020874 979070643:165018428 979070644:765019464 979070645:165018406 979070646:765019339 979070647:165018295 979070648:765019475 979070649:165018274 979070646:470052764 979070646:870050956 979070648:470053050 979070648:870051264 979070650:470052609 979070650:870051691 979070652:470052047 979070652:870050772 979070650:175085546 979070650:575083574 979070652:175084550 979070652:575083463 979070654:175085050 979070654:575084190 979070656:175084787 979070656:575083420 979070658:175084652 979070658:251540985 979070655:880118226 979070656:280115991 979070657:880118654 979070658:280116032 979070659:880118844 979070660:280115978 979070661:880123413 979070662:280115897 979070659:585150248 979070659:985148519 979070661:585149737 979070661:985148498 979070663:585150396 979070663:985148476 979070665:585150361 979070665:985148365 979070663:290189552 979070663:690181048 979070665:290182834 979070665:690181774 979070667:290182445 979070667:690181783 979070669:290182466 979070669:690181672 979070671:290182951 I either think that some overflow happens, or that some spinlock is really busy. You can find the patch used in ftp://ftp.kernel.org/pub/linux/daemons/ntp/PPS/PPS-2.4.0-pre3.tar.bz2 My CPU is identified as: elf:/tmp # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping: 5 cpu MHz : 501.149 cache size : 128 KB fdiv_bug: no hlt_bug : no sep_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips: 999.42 Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.2.18: writing an R/O floppy
Hi, I don't know if it's possible to make fd a read-only device if the inserted media is write-protected, but I had a strange problem: I had inserted a write protected floppy and accessed it via autofs as vfat in 2.2.18. It worked. Some time later it had expired (and I'm not sure whether I had changed floppies in the meantime). When I tried an "mdel a:*", it did terminate without message, but a later "mdir a:" showed all the files there. The kernel had unsuccessfully tried to write to the floppy however. It's a bit hard to reproduce that, but I could guess that the disc- change ore write-protect status was not updated in some case. Maybe it rings some bell for one of you; if not, never mind. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
patch: 2.4.0/2.5.0: nanoseconds time resolution
Hello, I have spend some time making a patch against the Linux kernel to switch to nanoseconds time resolution together with several time- related updates. I really need support for architectures other than i386, specifically a routine that has a very fine and accurate time resolution (just using ns == 1000*us isn't the best choice). For the 2.4.0 patch the ia64, sh, mips64, and parisc architectures are completely not done, and the other architectures are either untested or done sub-optiomal. Therefore I put together a simple "hacking document" (see attachment) to guide you when trying to port the code. More text can be found in Documentation/kernel-time.txt after the patch, or in the distribution for Linux 2.2 (PPSkit-1.0.2.tar*) So please spend an hour or two to help me out there. I hope I'm not forced to drop the project. Unless you can convince me not to have a /proc/sys/kernel/time directory, I'd also suggest to accept the patch for /usr/src/linux/include/sysctl.h for the standard kernel. Currently I have allocated "50" for the "time" entry. I'd like to have a stable number for the future. Regards, Ulrich Windl = A sketch on what to consider when implementing the new time framework on new architectures (like ia64, mips64, parisc, sh). = (See http://ftp.kernel.org/pub/linux/daemons/ntp/PPS/PPS-2.4.0-pre3.tar.bz2 for an implementation for i386) * Add new config variables to `config.in' and `defconfig' (CONFIG_NTP, CONFIG_NTP_PPS, CONFIG_NTP_PPS_SERIAL) * use `' instead of `' or `' to access kernel time. * The kernel knows how to convert kernel time to CMOS time, don't mess with time zones yourself * time is kept in nanoseconds. `do_fast_gettimeoffset()' is replaced with `do_exact_nanotime()' that returns nanoseconds passed since occurrence of the last timer interrupt. `do_slow_gettimeoffset()' is replaced with `do_poor_nanotime()' accordingly. * `do_gettimeofday()' and `do_settimeofday()' are implemented in the architecture-independent module, messing with all the status updates. The common code uses the `do_nanotime()' callback to call the architectures' code (allowing code selection during runtime or boot-up). * `set_rtc_mmss()' is called `update_rtc()' now, and it sets the complete date and time (not just minutes). A new `ktime_to_rtc()' converts kernel time to broken down time components suitable to write to CMOS RTC. `mktime()' is also architecture-independent now. The new `rtc_to_ktime()' is used after reading the RTC to get kernel time. * a new `timevar_init()' initializes all the time variables. * `struct timex' has been changed significantly while trying to preserve binary compatibility as far as possible. * time routines are in `kernel/time.c' now, and `xtime', the kernel's representation of time, is protected by `rwlock_t xtime_lock'. A new `rtc_runs_localtime' determines if time-zone corrections have to be made for RTC time updates. A new data type `l_fp', a 64bit quantity, is used for some internal time variables (needed by the NTP clock model). * a new sysctl interface allows controlling of some time variables, most notably the time zone and `rtc_runs_localtime'. While adjusting `time_tick' (the former `tick') is deprecated for NTP applications, it allows fine compensation of systematic clock errors. * When the kernel time is set, the RTC update procedure is triggered. * Old routines are implemented using POSIX-alike `do_clock_gettime()' and `do_clock_settime()'. There's also a `do_clock_getres()' that gives quite realistic (not optimistic) estimates. * `adjtimex()' has been significantly reworked, just as most of the other time-keeping routines. * Updating the RTC is controlled by new variables: `rtc_update_slave', when non-zero, controls after how many seconds the RTC has to be updated. Internally `last_rtc_update' keeps the time of the last update. Upon update the `rtc_update_slave' is cleared on success.
Re: patch: 2.4.0/2.5.0: nanoseconds time resolution
On 22 Jan 2001, at 22:55, Albert D. Cahalan wrote: > > Therefore I put together a simple "hacking document" (see attachment) > > to guide you when trying to port the code. More text can be found in > > Documentation/kernel-time.txt after the patch, or in the distribution > > for Linux 2.2 (PPSkit-1.0.2.tar*) So please spend an hour or two to > > help me out there. I hope I'm not forced to drop the project. > > URL for the patch? BTW, this is something for the 2.5.xx series. The URL for the patch is on top of the hacking document, thinking that those who don't read it won't need the URL. Yes the patch is intended for 2.5 if you all want it. However it applies to 2.4.0 for those who need it right now. As stated it requires some extra work that can't be done by myself alone. > > > * time is kept in nanoseconds. > > Nice, I'd imagine. Would that be 64-bit nanoseconds since 1970? Compatibility: Just using timespec instead of timeval at the user-level. Seconds are still 32bit on 32 bit machines. > > > `do_fast_gettimeoffset()' is replaced > > with `do_exact_nanotime()' that returns nanoseconds passed since > > occurrence of the last timer interrupt. `do_slow_gettimeoffset()' is > > replaced with `do_poor_nanotime()' accordingly. > > Ugh. Those names are awful. Why would anyone use do_poor_nanotime() > when they could have something better? That's exactly the point: For a i486 you must use the timer's counter register to interpolate between interrupts, but for the Pentium you can use the cycle counter of the CPU. When making a kernel for a distribution, you can't know whether the system will have a Pentium, so the decision is made during boot. (Just as it was before) The old naming put stress on speed of the routines (I guess), while I put stress on the accuracy. So "poor" means "poor accuracy". > > > * Updating the RTC is controlled by new variables: `rtc_update_slave', > > when non-zero, controls after how many seconds the RTC has to be > > updated. Internally `last_rtc_update' keeps the time of the last > > update. Upon update the `rtc_update_slave' is cleared on success. > > What about leap seconds on network and non-UNIX filesystems? >:-) You mean to say that a leap second is an implicit time update? I can Implement it without any trouble, if you all can agree that the idea is acceptable. BTW: Same applies for RTCs using local time, and we switch from/to DST: The kernel doesn't have the tables, so you (or cron) must update the /proc/sys/kernel/time/timezone. I'd be glad if these were the only problems you had. ;-) Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.2.16: How to freeze the kernel
Hello, this is for your interest, amusement, and for "what not to do": I managed to freeze the kernel (2.2.16 from SuSE Linux 7.0) in a way that I could not even switch virtual consoles. Completely silent eberything... It all started when Windows/95 ruined another CD-R while trying to write an image to the media. So I decided to try it with Linux, using the same CD writer. I plugged the device to the so far unused SCSI channel and used the "add-sigle-device" method to avoid reboot, and I succeeded: kgate kernel: scsi singledevice 0 0 4 0 kgate kernel: Vendor: WAITECModel: WT624 Rev: 7.0F kgate kernel: Type: CD-ROM ANSI SCSI revision: 0 kgate kernel: Detected scsi CD-ROM sr1 at scsi0, channel 0, id 4, lun 0 kgate kernel: (scsi0:0:4:0) Synchronous at 10.0 Mbyte/sec, offset 15. kgate kernel: sr1: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray Then I used "cdrecord-1.8.1" to simulate writing at "speed=8". It worked so far, but there was a warning about possible problems with "simulated fixation", and actually several minutes nothing happened while the simulated fixation was expected to take place. At some point I hit ^C, returning to the prompt. As the device did not seem to be ready, I thought "remove the device and reconnect", so I did "remove-single-device" (possibly while a command was still "busy"). The remove suceeded, but a second later everything had stopped! Should a device with busy commands be able to be removed? I guess no... The last message in the syslog was: kgate kernel: scsi : aborting command due to timeout : pid 8358, scsi0, channel 0, id 4, lun 0 UNKNOWN(0x5b) 00 02 00 00 00 00 00 00 00 At that point I pressed "RESET", and interestingly the builtin BIOS of the Adaptec 2740 (EISA) hung while trying to detect the device. Only after powering down both, the CD writer and the machine (a HP Netserver LD Pro), the BIOS detected the device again. So I guess something badly hung... The driver being used was Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.31/3.2.4 After that, everything worked fine. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0test11: some issues and a possible show stopper
Reading the article in the German computer magazine c't that Linux 2-4 is scheduled for release in December, and that Linux complained people do not want to test the new kernel, I decided to test it. The Hardware was: Spacewalker/Shuttle AV11 (VIA Apollo Pro chipset), Intel Celeron-500 ("boxed"), 128MB PC133 SD-ROM (Infineon, no crap), EIDE IBM hardisk (4GB, supporting UDMA 33). [If you need more detail, I can provide them] First the lesser important issues: During config I noticed that documentation is missing for CONFIG_INPUT (which is later required for USB), CONFIG_NLS_UTF8 (which is probably even less clear as 88591 or CP850). Some source files produced an assembler warning about an Indirect lcall without '*' When booting, the kernel said "Unknow bridge resource 0; assuming transparent". I don't know what this means. When typing "cat /proc/kmsg" I noticed that the process is not interuptible. Loading the keymap failed, but it seems SuSE Linux 7.0 is not quite 3.4- ready (util-linux, modutils, e2fsprogs too old). I also got "EXT2 check option not supported", "can't locate module "vfat", probably because of old modutils however). During some heavy disk I/O I got the impression that buffer writes are delayed significantly, and that reading can be delayed by several seconds when there is "writing back dirty buffers". Finally I got a "gzip -t" CRC error on the kernel tar archive that was without error when tried with 2.2.17. This is the possible show stopper. The syslog messages did not report any problem (harddisk operating in UDMA 33 mode, using a proper cable). Documentation/sysctl/kernel.txt still is 2.2.10! After hacking the kernel I got a conflict between and , but it was too late to investigate. (I had done over 4 hours merging rejected diffs, and I was tired from pressing C-d C-d C-n in Emacs ;-)) Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
poll: nanoseconds in 2.5?
Hello, maybe some of you know that I patched an early 2.2 kernel (2.1.131 or so) to provide nanoseconds to the customers, i.e. xtime has tv_nsec. The patch is available throughout 2.2 (including 2.2.17). I merged the patch into 2.4test11, it compiles and boots so far. Now I wonder if there's interest to integrate my code to an early 2.5. I will have to clean up some obsolete stuff, and order a few things first. I will need strong support for the non i386 architectures however (I only have a Pentium for testing). Interestingly some of my changes are already in 2.4: Moving the time stuff out from kernel/sched.c, joining mktime(), etc. If there is interest, please say so. I could provide an early alpha- quality patch by monday, maybe even this friday if someone wants to test it or implement another architecture. (The 2.2 stuff is named PPSkit-1.0.1 and can be found in /pub/linux/daemons/ntp/PPS on most mirrors of quality ;-) Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.16: OOPS 2 & VFS panic
On 30 Aug 2000, at 8:49, Mike Galbraith wrote: > On Wed, 30 Aug 2000, Ulrich Windl wrote: > > > The syslog (2.5kB) with surrounding messages is attached. > > No it's not :) 8-( It happened because of forwarding the bounced message from vger.rutgers.edu. Now it is attached. Sorry. Ulrich > Panic
2.2.16: OOPS 2 & VFS panic
Hello, I had a kernel panic with 2.2.16 yesterday. Because of this rare occasion, I immediately checked my RAM (memtest86), but the RAM is OK, there was no thunderstorm, no handy (mobile phone) nearby, the CPU and RAM not overclocked, all chipsets Genuine Intel. I only have two memory chips from China, but that should be close enough to Taiwan ;-) Maybe it helps for reproduction: After Boot the system did a periodic fsck, the I installed a program, thereby changing CDs twice. Immediately after installation the system behaved odd and the panic came along. And yes, I've been running that kernel on that machine before without problems. Only kswapd seemed instable in 2.2.16. My machine, a P100, has 64MB RAM. The syslog (2.5kB) with surrounding messages is attached. Regards, Ulrich P.S. The library issue is probably due to SuSE-7.0. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Linux & nanoseconds
Hello, I revised the code that calculates the nanoseconds in Linux, and I thought I'll drop some notes here: As I found out there is a systematic error of 167ns per tick, almost 17us per second. This is because of the timer chip that is used to generate the interrupts for "100 Hz". Linux uses the register of the timer chip to interpolate time. The current implementation introduces an error of another 160ns (or around). I haven't checked, but maybe they just compensate each other. Finally the current resolution gotten from the TSC (PCC) is limited to 16 cycles (or 160ns for a 100MHz CPU). Linux uses the TSC to interpolate between (within) ticks only. This is for historical reasons and possibly because of APM (ACPI?) varying the CPU speed. The current implementation does not recalibrate for variable CPU speed, so the tick interpolation is condensed towards the start of a tick of the CPU is getting slower. However if the timer chip gets slower at the same rate, the inperpolation is fine, but the absolute time is wrong. I always tried to fix the most urgent problem; there's still potential to improve. I need your experience as well. The current modification (based on PPSkit-0.9.3) is "nanofix.diff.gz", both located on your favourite Linux mirror in pub/linux/daemons/ntp/PPS (or very similar). Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
kernel: eepro100: wait_for_cmd_done timeout!
Hello, I'm seeing the message periodically: Nov 8 09:52:59 kgate last message repeated 5 times Nov 8 11:26:54 kgate kernel: eepro100: wait_for_cmd_done timeout! Nov 8 11:56:12 kgate kernel: eepro100: wait_for_cmd_done timeout! Nov 8 14:38:45 kgate kernel: eepro100: wait_for_cmd_done timeout! Nov 8 14:38:47 kgate last message repeated 3 times Nov 8 14:56:11 kgate kernel: eepro100: wait_for_cmd_done timeout! Nov 8 14:57:01 kgate last message repeated 10 times Nov 8 21:32:15 kgate kernel: eepro100: wait_for_cmd_done timeout! Nov 8 22:57:46 kgate kernel: eepro100: wait_for_cmd_done timeout! The source contains: /* How to wait for the command unit to accept a command. Typically this takes 0 ticks. */ static inline void wait_for_cmd_done(long cmd_ioaddr) { int wait = 1000; do ; while(inb(cmd_ioaddr) && --wait >= 0); #ifndef final_version if (wait < 0) printk(KERN_ALERT "eepro100: wait_for_cmd_done timeout!\n"); #endif } My machine is a HP Netserver LD Pro with a 200MHz Pentium Pro. I guess a fast machine will only allow a very short time for the above loop. Shouldn't it be fixed? The hardware is this: 01:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 02) Subsystem: Hewlett-Packard Company Ethernet Pro 10/100TX Flags: bus master, medium devsel, latency 66, IRQ 9 Memory at fe8fe000 (32-bit, prefetchable) I/O ports at ece0 Memory at fea0 (32-bit, non-prefetchable) kgate kernel: eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html kgate kernel: eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey V. Savochkin <[EMAIL PROTECTED]> and others kgate kernel: eth0: OEM i82557/i82558 10/100 Ethernet, 00:60:B0:6D:F1:AE, IRQ 9. kgate kernel: Board assembly 673610-001, Physical connectors present: RJ45 kgate kernel: Primary interface chip i82555 PHY #1. kgate kernel: General self-test: passed. kgate kernel: Serial sub-system self-test: passed. kgate kernel: Internal registers self-test: passed. kgate kernel: ROM checksum self-test: passed (0x49caa8d6). kgate kernel: Receiver lock-up workaround activated. The software is Linux-2.2.16 (SuSE 7.0). Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0test11: "nanoseconds patch" (prerelease) available
Hi, related to my question about having nanoseconds in xtime for Linux 2.5, two (or three) people were interested, or at least managed to route their message to me. As promised I have made an early release patch against 2.4.0test11 available at ftp.kernel.org:/pub/linux/daemons/ntp/PPS/pps-2.4-pre1.tar.bz2 (63kB, patch + digital signature) The modified sources compile, link and boot (for arch/i386), but consider this code as alpha quality, and don't use it for production use. It is possible that it works perfectly, but I simply don't have the experience. Fixes for any architectures are appreciated. Finally I want to get rid of gettimeoffset() and a lot of redundant code. I noticed that the ATM drivers access xtime directly. If jiffies are not fine enough, do_gettimeofday() has to be called for now. If that's too slow, we have to think about an alternative. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
i386: gcc & asm(): wrong constraint for "mull"
Hello, I noticed (with some inspiration from Andy Kleen) that some asm() instructions for the ia32 use the "g" constraint for "mull", where my Intel 386 Assembly Language Manual suggests the "MUL" instruction needs an r/m operand. So I guess the correct constraint is "rm" in gcc, and not "g". That change identical assembly output for gcc-2.95.2, but some gcc-2.96.x will try a multiplication with an immediate (constant) operand for the "g" constarint, and the as will choke on that. (Redhat 7.0 ships such a version of gcc). As I won't be online next week, let me say regards and a good new year to all! Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: i386: gcc & asm(): wrong constraint for "mull"
On 29 Dec 2000, at 5:17, Jakub Jelinek wrote: > On Fri, Dec 29, 2000 at 10:54:38AM +0100, Ulrich Windl wrote: > > Hello, > > > > I noticed (with some inspiration from Andy Kleen) that some asm() > > instructions for the ia32 use the "g" constraint for "mull", where my > > Intel 386 Assembly Language Manual suggests the "MUL" instruction needs > > an r/m operand. So I guess the correct constraint is "rm" in gcc, and > > not "g". That change identical assembly output for gcc-2.95.2, but some > > gcc-2.96.x will try a multiplication with an immediate (constant) > > operand for the "g" constarint, and the as will choke on that. > > (Redhat 7.0 ships such a version of gcc). > > gcc 2.95.2 md.texi sais: > @cindex @samp{g} in constraint > @item @samp{g} > Any register, memory or immediate integer operand is allowed, except for > registers that are not general registers. > > (2.95.2 was chosen to make it clear it is not something new in gcc). > That means gcc is really free to choose which of register, memory or > immediate it puts in and the fact that some gcc version choose one and > others choose other is perfectly correct. > Fix the constraints and be happy (at least during the upcoming millenium) :) Oh, if it wasn't clear: It's what I wanted to say. As I don't have a patch ready for that, maybe start at arch/i386/kernel/time.c; there are at least two of these "mull" instructions. Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Compiling 2.2 on Pentium
Hi, I noticed that when compiling with gcc-2.95.2 for a Pentium the flag "- m486" ist still passed to gcc. However gcc-2.95.2 generates different code if "-m586" is used (older versions ended at -m486). Is the makefile intentionally not updated, or was it just forgotten? Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.2.17: CPU features bug for AMD?
Browsing patch-2.2.17.gz I found this: linux/arch/i386/kernel/setup.c: Isn't here an "else" or "break" missing? Otherwise ``x86_cap_flags[16] = "pat"'' is always the case, and extended AMD features are always present. @@ -1029,17 +1130,22 @@ case X86_VENDOR_AMD: if (c->x86 == 5 && c->x86_model == 6) x86_cap_flags[10] = "sep"; - x86_cap_flags[16] = "fcmov"; + if (c->x86 < 6) + x86_cap_flags[16] = "fcmov"; + x86_cap_flags[16] = "pat"; + x86_cap_flags[22] = "mmxext"; + x86_cap_flags[24] = "fxsr"; + x86_cap_flags[30] = "3dnowext"; x86_cap_flags[31] = "3dnow"; break; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.2.16: Verify mounting of /
Hello, I have some trouble with an initrd configuration where it seems the wrong partition is mounted as root (/), even though it seems fine in /etc/fstab, and mount and df all display that it's fine. I realized that the kernel messages are not as helpful as possible: <4>VFS: Mounted root (ext2 filesystem) readonly. <4>change_root: old root has d_count=1 <5>Trying to unmount old root ... okay <4>Freeing unused kernel memory: 68k freed <6>Adding Swap: 132088k swap-space (priority -1) It would be good if the devices involved were displayed, I mean _which_ filesystem was mounted or unmounted? The mount of the new root is not logged at all. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
FYI: [comp.protocols.time.ntp] announce: Linux PPS support for Kernel 2.4.2
FYI, a copy... --- Start of forwarded message --- From: Ulrich Windl <[EMAIL PROTECTED]> Newsgroups: comp.protocols.time.ntp Subject: announce: Linux PPS support for Kernel 2.4.2 Date: 13 Mar 2001 08:04:56 +0100 Organization: University of Regensburg, Germany Message-ID: <[EMAIL PROTECTED]> I have uploaded the first working version of a patch for Linux-2.4.2 to ftp://ftp.kernel.org:/pub/linux/daemons/ntp/PPS/PPS-2.4.2-pre5.diff.bz2 The patch is mostly equivalent to PPSkit-1.0.3, execpt lacking support for CIOGETEV. Actually it seems that ATOM won't work without CIOGETEV, but I'll have to re-investigate... I would appreciate any feedback (see /usr/src/linux/CREDITS for EMail address) Regards, Ulrich --- End of forwarded message --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.2: kernel patch for , nanosleep
Hello, originally intended for my PPSkit patch I found out that the "normal" kernel might like this patch as well: nanosleep() currently uses "udelay()" from as there is no "ndelay()". I implemented "ndelay()" for i386 and adjusted the other macros. During that I found that some files have or use their own "delay()" routines. The original delay is dangerous, because depending on the CPU it requires loop cycles or clock cycles as argument, giving non-reliable code. Affected sources: drivers/net/hamradio/yam.c: "delay(100)" drivers/scsi/wd I also found that the scaling factor used in the existing code should be rounded up (increased by one) for a more exact value. With the new code there are two possible disadvantages: 1) Reduced accuracy, and 2) possible overflow. I hope both are not really a problem. Regards, Ulrich Index: arch/i386/kernel/i386_ksyms.c === RCS file: /root/LinuxCVS/Kernel/arch/i386/kernel/i386_ksyms.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 i386_ksyms.c --- arch/i386/kernel/i386_ksyms.c 2001/03/11 13:51:19 1.1.1.3 +++ arch/i386/kernel/i386_ksyms.c 2001/03/17 18:08:20 @@ -82,9 +82,9 @@ /* Networking helper routines. */ EXPORT_SYMBOL(csum_partial_copy_generic); /* Delay loops */ -EXPORT_SYMBOL(__udelay); +EXPORT_SYMBOL(__ndelay); EXPORT_SYMBOL(__delay); -EXPORT_SYMBOL(__const_udelay); +EXPORT_SYMBOL(__const_sndelay); EXPORT_SYMBOL_NOVERS(__get_user_1); EXPORT_SYMBOL_NOVERS(__get_user_2); Index: arch/i386/lib/delay.c === RCS file: /root/LinuxCVS/Kernel/arch/i386/lib/delay.c,v retrieving revision 1.1.1.2 diff -u -r1.1.1.2 delay.c --- arch/i386/lib/delay.c 2001/01/08 20:17:36 1.1.1.2 +++ arch/i386/lib/delay.c 2001/03/17 18:12:36 @@ -64,16 +64,27 @@ __loop_delay(loops); } -inline void __const_udelay(unsigned long xloops) +/* convert scaled nanoseconds to execution loops and delay */ +inline void __const_sndelay(unsigned long scaled_nsecs) { int d0; __asm__("mull %0" - :"=d" (xloops), "=&a" (d0) - :"1" (xloops),"0" (current_cpu_data.loops_per_jiffy)); -__delay(xloops * HZ); + :"=d" (scaled_nsecs), "=&a" (d0) + :"1" (scaled_nsecs),"0" (current_cpu_data.loops_per_jiffy)); +__delay(scaled_nsecs * HZ); } -void __udelay(unsigned long usecs) +void __ndelay(unsigned long nsecs) { - __const_udelay(usecs * 0x10c6); /* 2**32 / 100 */ + /* 2**32 / 10 == 4.2946... */ + if (nsecs > NDELAY_LIMIT) { + static int complaints = 7; + + if (complaints > 0) { + --complaints; + printk(KERN_ERR "__ndelay(%lu) exceeds limit\n", nsecs); + } + nsecs = NDELAY_LIMIT; + } + __const_sndelay((nsecs * 429) / 100); } Index: include/asm-i386/delay.h === RCS file: /root/LinuxCVS/Kernel/include/asm-i386/delay.h,v retrieving revision 1.1.1.2 diff -u -r1.1.1.2 delay.h --- include/asm-i386/delay.h2001/01/08 20:22:29 1.1.1.2 +++ include/asm-i386/delay.h2001/03/17 17:58:33 @@ -7,14 +7,19 @@ * Delay routines calling functions in arch/i386/lib/delay.c */ -extern void __bad_udelay(void); +extern void __bad_ndelay(void); -extern void __udelay(unsigned long usecs); -extern void __const_udelay(unsigned long usecs); -extern void __delay(unsigned long loops); +extern void __ndelay(unsigned long nsecs); +extern void __const_sndelay(unsigned long scaled_nsecs); +extern void __delay(unsigned long xloops); -#define udelay(n) (__builtin_constant_p(n) ? \ - ((n) > 2 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \ - __udelay(n)) +#defineNDELAY_LIMIT2000/* 20 ms (2 / HZ)? */ + +#define ndelay(n) (__builtin_constant_p(n) ? \ + ((n) > NDELAY_LIMIT ? \ + __bad_ndelay() : __const_sndelay(((n) * 429ul) / 100)) : \ + __ndelay(n)) + +#define udelay(n) ndelay(n * 1000) #endif /* defined(_I386_DELAY_H) */ Index: kernel/timer.c === RCS file: /root/LinuxCVS/Kernel/kernel/timer.c,v retrieving revision 1.1.1.2.8.1 diff -u -r1.1.1.2.8.1 timer.c --- kernel/timer.c 2001/03/11 15:29:17 1.1.1.2.8.1 +++ kernel/timer.c 2001/03/17 17:22:57 @@ -592,10 +592,11 @@ /* * Short delay requests up to 2 ms will be handled with * high precision by a busy wait for all real-time processes. +* Anything else will be delayed for at least 1/HZ. * * Its important on SMP not to do this holding locks. */ - udelay((t.tv_nsec + 999) / 1000); + ndelay(t.tv_nsec);
2.2.18: e100.c (SuSE 7.1): udelay() used in a wrong way?
>From the source code of drivers/net/e100.c: / * Name: Phy82562EHDelayMilliseconds * * Description: Stalls execution for a specified number of milliseconds. * * Arguments: Time - milliseconds to delay * * Returns: Nothing * ***/ void Phy82562EHDelayMilliseconds(int Time) { udelay(Time); } AFAIK, udelay() delays microseconds, not milliseconds. Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.x: spinlock problem
Hello, I had reported this before: In 2.4.0 getting exact system time from interrupt handlers seems inaccurate (in 2.2.18 it works fine). I have applied the same modifications to the 2.4 code base as to 2.2. With 2.4.1 the kernel is incredibly slow, so you can watch the individual lines of kernel boot be printed on the screen. I checked the spinlocks, but could not find a problem. Before I start removing the new spinlocks for timer, PIC and RTC, I'd like to hear what the gurus might think. I also tried to find out how I can profile the kernel or trace the spinlocks, but that seems to be hardly documented. Any hints? Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
having a hard time with 2.4.x
Hello, I have some news on the topic of timekeeping in Linux-2.4: As Alan Cox pointed out the ACPI changes between 2.4.0 and 2.4.1 created a extremely slow console output (if not more). Configuring away ACPI support solved that problem. However there is still a problem that I cannot explain. I wrote a test program for my modified kernel (I did not try the original one). I'll include the program plus results (if you want to see the patch go to ftp.kernel.org/pub/linux/daemons/ntp/PPS and get PPS-2.4.0-pre3.tar.bz2 (patch plus signature)): #include #include #define NTP_NANO #include int main() { struct timextx; longlastns = 0; tx.modes = 0; while(1) { adjtimex(&tx); printf("%d %d %d\n", tx.time.tv_sec, tx.time.tv_nsec, tx.time.tv_nsec - lastns); lastns = tx.time.tv_nsec; fflush(stdout); } } /*-- The following anomalies were examined by running the program for a few seconds, redirecting output into a file: 981488742 428870884 428870884 981488742 429242679 371795 981488742 429258279 15600 981488742 429266001 7722 981488742 429273781 7780 981488742 429281142 7361 ...this is just the startup, filling the caches; 7us seems acceptable 981488742 442133766 7235 981488742 442155740 21974 981488742 442164248 8508 981488742 442171719 7471 ... some occasional jitter seems acceptable, too 981488742 451557086 7280 981488742 451564393 7307 981488742 461600593 10036200 981488742 461609928 9335 981488742 461617263 7335 ...here we lost 10ms, possibly due to scheduling 981488742 991589894 7317 981488742 991597171 7277 981488743 1628395 -989968776 981488743 1636937 8542 ...the new second seems to begin a bit early; I'm missing about 8ms 981488743 991650854 7411 981488743 991658147 7293 981488744 1724546 -989933601 981488744 1732344 7798 ...this is quite reproducible 981488751 294943079 7327 981488751 294950364 7285 981488751 294957703 7339 981488751 294964994 7291 981488751 294964995 1 981488751 294964996 1 981488751 294964997 1 981488751 294964998 1 981488751 294964999 1 ...here something strange happened: time refused to advance, forcing ...the code to generate synthetic time (add 1ns). Here comes the end: 981488751 294967294 1 981488751 294967295 1 981488747 0 -294967295 981488747 37804096 37804096 981488747 37811711 7615 981488747 37819006 7295 ...time went back by four seconds! This happened again: 981488752 294967292 1 981488752 294967293 1 981488752 294967294 1 981488752 294967295 1 981488748 0 -294967295 981488748 100304297 100304297 981488748 100311973 7676 981488748 100319231 7258 ...but sometimes the second overflows correctly: 981488748 87866 7315 981488748 95152 7286 981488749 2417 -92735 981488749 9995 7578 981488749 17227 7232 ... 981488749 91971 30023 981488750 747 -91224 981488750 8405 7658 981488750 15531 7126 Here is a simplified sample with microseconds instead, after having removed two spinlocks (as they are in 2.2.18): 981487863 665701 665701 981487863 666048 347 981487863 666062 14 981487863 666071 9 981487863 666078 7 ...start as usual 981487863 668825 7 981487863 668832 7 981487863 668855 23 981487863 668863 8 ...some jitter 981487863 673861 7 981487863 673869 8 981487863 673876 7 981487863 683930 10054 981487863 683938 8 981487863 683946 8 ...and scheduling 981487863 993871 8 981487863 993879 8 981487864 3905 -989974 981487864 3913 8 981487864 3920 7 ...we still lack 10ms during overflow... 981487869 293860 7 981487869 293867 7 981487869 293875 8 981487869 293875 0 981487869 293875 0 ...then time also refuses to advance 981487869 293941 0 981487869 293941 0 981487866 28937 -265004 981487866 28946 9 981487866 28954 8 ...eventually loosing a few seconds Possible explanation for negative time: 2^32/5 == 8.5, i.e. the low 32bit of the TSC will overflow every 8.5 seconds on a 500MHz CPU, probably causing a bad interpolation between ticks. But: Why doesn't the effect occur with kernel 2.2??? *--*/ Regards, Ulrich P.S.: I'm not subscribed here, so CC: is appreciated. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4 kernel & gcc code generation: a bug?
Trying to find out what got broken in kernel 2.4, I was so clueless as to compare assembly output for 2.2.18 with 2.4.1. However the assembler is quite different, as 2.4 uses the more advanced optimizations of gcc- 2.95.2. Anyway: 1) spinlocks look strange in 2.2(!): .globl rtc_lock .typertc_lock,@object .sizertc_lock,0 rtc_lock: .globl i8253_lock while in 2.4.1 they look like this: .globl rtc_lock .align 4 .typertc_lock,@object .sizertc_lock,4 rtc_lock: .long 0 .globl i8253_lock 2) gcc seems to fail to save registers that are marked "spilled" in inline asm's constraints, like rdtsc(): /* nanoseconds since last timer interrupt (using the CPU cycle-counter) */ static inline unsigned long do_exact_nanotime(void) { register unsigned long eax asm("ax"); register unsigned long edx asm("dx"); unsigned long result; rdtsc(eax, edx);/* Read the Time Stamp Counter */ /* .. relative to previous jiffy (32 bits is enough) */ eax -= last_tsc_low;/* tsc_low delta */ /* * Time offset = (tsc_low delta << 4) * exact_nanotime_quotient * = (tsc_low delta << 4) * (nsecs_per_clock) * = (tsc_low delta << 4) * (nsecs_per_jiffy / * clocks_per_jiffy) * * Using a mull instead of a divl saves up to 31 clock cycles * in the critical path. */ __asm__("mull %2" :"=a" (eax), "=d" (edx) :"rm" (exact_nanotime_quotient), "0" (eax << 4)); /* our adjusted time offset in nanoseconds */ result = nanodelay_at_last_interrupt + edx; return result; } .text .align 4 .typedo_exact_nanotime,@function do_exact_nanotime: #APP rdtsc #NO_APP subl last_tsc_low,%eax sall $4,%eax #APP mull exact_nanotime_quotient #NO_APP movl nanodelay_at_last_interrupt,%eax addl %edx,%eax ret .Lfe7: .sizedo_exact_nanotime,.Lfe7-do_exact_nanotime .local last_rtc_update .comm last_rtc_update,4,4 .comm timer_ack,4,4 .ident "GCC: (GNU) 2.95.2 19991024 (release)" #endif You'll notice that %edx is not pushed at the start of the function. Unless the caller saves that, edx will be spilled. Depending on the level of optimization this can be bad. Am I wrong? Regards, Ulrich P.S: Not subscribed here, so plese CC: if possible. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
throttling kernel messages: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1282)
Hi, I'm affected by the (in)famous bug: Apr 12 07:03:02 mailgate kernel: recvmsg bug: copied D640F0D1 seq D640F679 Apr 12 07:03:02 mailgate kernel: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1282) Apr 12 07:03:02 mailgate kernel: recvmsg bug: copied D640F0D1 seq D640F679 Apr 12 07:03:02 mailgate kernel: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1282) (Kernel of SuSE Linux 9.2, 2.6.8-24.13-default #1 Fri Mar 18 10:19:42 UTC 2005 i686 i686 i386 GNU/Linux) The kernel spits out hundreds to thousand messages per second, making klogd and syslogd quite busy, and my messages file stopped growing at 2GB. I'd suggest to enable throttling for this message, or trigger a panic/reboot, or maybe even fix the bug or message. ;-) Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
(Fwd) about timer in linux kernel.
Maybe one of the people having written the code want to explain... Thanks, Ulrich --- Forwarded message follows --- From: "meng-ju" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Subject:about timer in linux kernel. Date sent: Fri, 18 May 2001 16:58:55 -0700 Hi! Mr. Ulrich Windl, I want to know how timer works in kernel. When we call add_timer(), it will call add_timer_internal to add it to its list. Now I am confused how the system checks if it is expired or not? In run_timer_list(), Why it uses tv1.vec + tv1.index to find out the expiration point while in add_timer_internal(), the expiration timer minus timer_jiffies? I don't understand what roles jiffies, timer_jiffies and tv1.index play. Thanks for your patient and answering. static inline void run_timer_list(void) { spin_lock_irq(&timerlist_lock); while ((long)(jiffies - timer_jiffies) >= 0) { struct list_head *head, *curr; if (!tv1.index) { int n= 1; do { cascade_timers(tvecs[n]); } while (tvecs[n]->index == 1 && ++n < NOOF_TVECS); } repeat: head = tv1.vec + tv1.index; curr = head->next; if (curr != head) { struct timer_list *timer; void (*fn)(unsigned long); unsigned long data; timer = list_entry(curr, struct timer_list, list); fn = timer->function; data= timer->data; detach_timer(timer); timer->list.next = timer->list.prev = NULL; timer_enter(timer); spin_unlock_irq(&timerlist_lock); fn(data); spin_lock_irq(&timerlist_lock); timer_exit(); goto repeat; } ++timer_jiffies; tv1.index = (tv1.index + 1) & TVR_MASK; } Meng-Ju --- End of forwarded message --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)
On 10 Aug 2005 at 22:32, Lee Revell wrote: > On Wed, 2005-08-10 at 19:13 -0700, john stultz wrote: > > All, > > Here's the next rev in my rework of the current timekeeping subsystem. > > No major changes, only some cleanups and further splitting the larger > > patches into smaller ones. > > Last I heard this made gettimeofday() 20% slower on x86. Is this still > the case? If it's only 20% for an increase in resolution of 10%, it's quite good ;-) Regards, Ulrich > > Lee > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)
On 16 Aug 2005 at 11:25, Christoph Lameter wrote: > You mentioned that the NTP code has some issues with time interpolation > at the KS. This is due to the NTP layer not being aware of actual time > differences between timer interrupts that the interpolator knows about. If > the NTP layer would be aware of the actual intervals measured by the > timesource (or interpolator) then presumably time could be adjusted in a > more accurate way. Hi, whatever the implementation is, at some point there must exist an interface go get and set "normal time", free of any jumps and jitter. That "frontend time" will be used a a base of correction. Basically that means time should be as monotonic and jitter free as possible for any measurement interval you like. Otherwise when extrapolating the time-error, it (NTP) will try to overcompensate (or undercompensate), making the whole thing instable. Here's a sample from some ancient NTP distribution (pre-nanosecond), but you'll get the idea what to check: more util/jitter.c /* * This program can be used to calibrate the clock reading jitter of a * particular CPU and operating system. It first tickles every element * of an array, in order to force pages into memory, then repeatedly calls * gettimeofday() and, finally, writes out the time values for later * analysis. From this you can determine the jitter and if the clock ever * runs backwards. */ #include #include #define NBUF 20002 void main() { struct timeval ts, tr; struct timezone tzp; long temp, j, i, gtod[NBUF]; gettimeofday(&ts, &tzp); /* * Force pages into memory */ for (i = 0; i < NBUF; i ++) gtod[i] = 0; /* * Construct gtod array */ for (i = 0; i < NBUF; i ++) { gettimeofday(&tr, &tzp); gtod[i] = (tr.tv_sec - ts.tv_sec) * 100 + tr.tv_usec; } /* * Write out gtod array for later processing with S */ for (i = 0; i < NBUF - 2; i++) { /* printf("%lu\n", gtod[i]); */ gtod[i] = gtod[i + 1] - gtod[i]; printf("%lu\n", gtod[i]); } /* * Sort the gtod array and display deciles */ for (i = 0; i < NBUF - 2; i++) { for (j = 0; j <= i; j++) { if (gtod[j] > gtod[i]) { temp = gtod[j]; gtod[j] = gtod[i]; gtod[i] = temp; } } } fprintf(stderr, "First rank\n"); for (i = 0; i < 10; i++) fprintf(stderr, "%10ld%10ld\n", i, gtod[i]); fprintf(stderr, "Last rank\n"); for (i = NBUF - 12; i < NBUF - 2; i++) fprintf(stderr, "%10ld%10ld\n", i, gtod[i]); } Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)
On 16 Aug 2005 at 18:17, john stultz wrote: [...] > Maybe to focus this productively, I'll try to step back and outline the > goals at a high level and you can address those. > > My Assumptions: > 1. adjtimex() sets/gets NTP state values One of the greatest mistakes in the past which still affects us now was the decision to piggy-back ntp_adjtime and ntp_gettime on top of adjtime() and thus creating adjtimex(). Only to save a system-call number or two. WE REALLY SHOULD GET RID OF THAT going back to Linux 0.something. > 2. Every tick we adjust those state values ... which require it. > 3. Every tick we use those values to make a nanosecond adjustment to > time. ...or even more frequent. In my code I tried to scale the tick interpolation as well, thus effectively making adjustments even within timer ticks (so far the theory...). I was assuming however that ticks and interpolation clocks are derived from one single source and would "float" the same way relative to each other. > 4. Those state values are otherwise unused. What is "otherwise"? Outside the "NTP clock model", or "between ticks"? > > Goals: > 1. Isolate NTP code to clean up the tick based timekeeping, reducing the > spaghetti-like code interactions. First you need a new clock model that's compatible with NTP. Then you can consider how to implement the NTP stuff. So the clock even without NTP has to be strictly monotonic for any interval it is read, be it nanoseconds, microseconds, milliseconds, seconds, minutes, hours, days, ... The clock delta (=increase of time) over time should be as constant as possible (i.e. time shouldn't go up like stairs). > 2. Add interfaces to allow for continuous, rather then tick based, > adjustments (much how ppc64 does currently, only shareable). Adjustments to the clock _model_ are asynchronous by definition, while adjustments to the clock itself are, well, periodic. Whatever the period. Maybe this helps and can be agreed on. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)
On 24 Aug 2005 at 1:54, Roman Zippel wrote: [...] > error) >> shift". The difference between system time and reference > time is really important. gettimeofday() returns the system time, NTP > controls the reference time and these two are synchronized regularly. [...] Roman, I'm having a problem with your wording: NTP _does_ control the "system time" (system clock), because it's the only clock it can use. The "reference time" is usually remote or elsewhere (multiple sources). Local NTP does not control the remote reference time(s). Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A3)
On 15 Mar 2005 at 10:25, john stultz wrote: > On Mon, 2005-03-14 at 21:37 -0800, Christoph Lameter wrote: > > Note that similarities exist between the posix clock and the time sources. > > Will all time sources be exportable as posix clocks? > > At this point I'm not familiar enough with the posix clocks interface to > say, although its probably outside the scope of the initial timeofday > rework. I'd be happy to see the required POSIX clocks at nanosecond resolution for the initial version. Add-Ons may follow later. > > Do you have a link that might explain the posix clocks spec and its > intent? There's a book named like "POSIX.4: Programming for the real world" by Bill Gallmeister (I think). Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFD(time): squeezing and stretching the tick
For i386 with TSC, the kernel calibrates how much CPU cycles will fit between two timer interrupts. That value corresponds to 1 microseconds. Ideally. In practice however the timer interrupts do not happen exactly every 1 us (for hardware reasons). When interpolating time between ticks that calibration value is used. When using NTP (or adjusting tick manually) the value added every tick may be different from 1us. If that value is larger, the time seems to jump ahead at the beginning of each tick; if the value is smaller, the time may seem to get stuck, get slow, or jump back at the beginning of a new tick. Therefore I added experimental code to scale the value used for tick interpolation according to these corrections. As it seems to me, the clock quality improves, and the performance penalty only appears when the correction value changes. I haven't done the non-TSC case or other architectures. For microseconds it may seem neglectible, but not for nanoseconds. If anybody has an interesting opinion on this, please Mail. Regards, Ulrich P.S. Not subscribed here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.2.18: severe performance problem (high load, low mem, idle CPU)
Hello, we experienced a severe performance problem on a PentiumPro 200 MHz, 64MB RAM, 128MB swap: Due to many processes being started in a short time, the system load went up to 53, and the 9GB SCSI disk was working heavily. At that time I suspected no severe problem, and I was busy doing something else. However after almost three hours the system load was still at about 40 with the old processes not yet finished. (The processes typically take 2 to 5 seconds to finish, and need about 4MB memory each). At that point I became active. In top I was surprised that the CPU claimed to be more than 90% idle, while the swap space was exceeded. But the memory wasn't really tight; cached and buffers were about 12MB together. So basically the situation should have gone away. Should, but didn't. The kernel running was that from SuSE Linux 7.1 (Linux version 2.2.18 ([EMAIL PROTECTED]) (gcc version 2.95.2 19991024 (release)) #1 Fri Jan 19 22:10:35 GMT 2001). So maybe the defect is an "enhancement" done by SuSE. Anyway: In top I noticed that the processes to finish were all mostly swapped out, and they showed a zero in the "PRI" column. Usually runnable processes have more "fuel" there. It seems to me swapped out processes did not get their fules reloaded. The processes all had a "D" status (blocked on I/O). Also it seemed that processes that share a lot of data are not favoured enough when paging in. If a page is shared 10 times, paging that one in would help 10 processes. Instead the kernel seemed to swap in and out a few kB wihout getting any process done. I decided to kill a few non-essential processes to improve the situation. No help. I added an extra 32MB swapfile, so the buffers and shared went up to ver 30MB, but still no process finished. The CPU still was quite "idle". So I decided to kill the processes in question. After several seconds, no process had terminated however. (Maybe due to the code to handle the signal being paged out). Then I did a kill -9 to the processes which finally helped. So to summarize: 1) paged out processes seem not to get enough CPU 2) paged out shared pages seem not to get enough priority to be swapped in 3) On low memory situations the schedulting algorithm seems to perform poor For 3) I sould imagine doing a round-robing scheduling with extended time-slice (while still being fair, i.e. run them rarely but longer) for massively swapped out processes, hoping that one of them will finish soon. That way maybe more of the working set will be paged in, enabbling some progress. I don't have the top screen saved, but I have a ps -aux. The 40 processes being paged out were all displayed with a %CPU of "0.0". The ps command with 7.4% CPU was the highest value. The kernel pager also seemed to be non-busy: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 400 52 ?SMar22 0:22 init [3] root 2 0.0 0.0 00 ?SW Mar22 0:03 [kflushd] root 3 0.0 0.0 00 ?SW Mar22 0:01 [kupdate] root 4 0.0 0.0 00 ?SW Mar22 6:58 [kswapd] root 5 0.0 0.0 00 ?SW< Mar22 0:00 [mdrecoveryd] ... daemon 32528 0.0 2.0 4984 1352 ?D14:42 0:04 /etc/mail/dirty-h daemon 32529 0.0 2.0 4984 1352 ?D14:42 0:04 /etc/mail/dirty-h daemon 32531 0.0 2.7 5008 1760 ?D14:42 0:04 /etc/mail/dirty-h daemon 32533 0.0 2.5 4984 1640 ?D14:42 0:04 /etc/mail/dirty-h daemon 32539 0.0 3.1 5008 2044 ?D14:42 0:04 /etc/mail/dirty-h daemon 32540 0.0 1.9 4984 1276 ?D14:42 0:04 /etc/mail/dirty-h daemon 32542 0.0 1.4 4984 948 ?D14:42 0:04 /etc/mail/dirty-h daemon 32547 0.0 2.1 4984 1404 ?D14:42 0:04 /etc/mail/dirty-h daemon 32548 0.0 2.1 4984 1380 ?D14:42 0:04 /etc/mail/dirty-h daemon 32549 0.0 1.9 4984 1284 ?D14:42 0:04 /etc/mail/dirty-h daemon 32550 0.0 1.1 4984 768 ?D14:42 0:04 /etc/mail/dirty-h daemon 32555 0.0 2.3 4984 1504 ?D14:42 0:04 /etc/mail/dirty-h daemon 32556 0.0 1.8 4984 1224 ?D14:42 0:04 /etc/mail/dirty-h daemon 32557 0.0 1.9 4984 1244 ?D14:42 0:04 /etc/mail/dirty-h ... These were some of the processes that should have finished. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A2)
On 24 Jan 2005 at 15:24, Christoph Lameter wrote: > On Mon, 24 Jan 2005, john stultz wrote: > > > +/* __monotonic_clock(): > > + * private function, must hold system_time_lock lock when being > > + * called. Returns the monotonically increasing number of > > + * nanoseconds since the system booted (adjusted by NTP > > scaling) > > + */ > > +static nsec_t __monotonic_clock(void) > > +{ > > + nsec_t ret, ns_offset; > > + cycle_t now, delta; > > + > > + /* read timesource */ > > + now = read_timesource(timesource); > > + > > + /* calculate the delta since the last clock_interrupt */ > > + delta = (now - offset_base) & timesource->mask; > > + > > + /* convert to nanoseconds */ > > + ns_offset = cyc2ns(timesource, delta, NULL); > > + > > + /* apply the NTP scaling */ > > + ns_offset = ntp_scale(ns_offset); > > The monotonic clock is the time base for the gettime and gettimeofday > functions. This means ntp_scale() is called every time that the kernel or > an application access time. It depends on what you want: There is little sense in implementing a nanosecond clock model with NTP when the result is wrong by several microseconds IMHO. You don't know what the time is used for, so just get the best you can. Thos only wanting some crude time, could be happy with the jiffies (or their equivalent), right? Regards, Ulrich > > As pointed out before this will dramatically worsen the performance > compared to the current code base. > > ntp_scale() also will make it difficult to implement optimized arch > specific version of function for timer access. > > The fastcalls would have to be disabled on ia64 to make this work. Its > likely extremely difficult to implement a fastcall if it involves > ntp_scale(). > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A2)
On 24 Jan 2005 at 17:54, Christoph Lameter wrote: > On Mon, 24 Jan 2005, john stultz wrote: > > > We talked about this last time. I do intend to re-work ntp_scale() so > > its not a function call, much as you describe above. > > > > hopelessly endeavoring, > > hehe But seriously: The easiest approach may be to modify the time > sources to allow a fine tuning of the scaling factor. That way ntp_scale > may be moved into tick processing where it would adjust the scaling of the > time sources up or downward. Thus no ntp_scale in the monotonic clock > processing anymore. It depends what you want to have between ticks: If your ticks are too wide, the clock will do a little jump forward at the start of a new tick; if they are too narrow, the clock will jump back a bit at the start of a new tick (assuming tick interpolation and tick generation are correlated. (The old kernel code uses a constant to scale the timer's register to a tick. However if the time is too fast or slow, the interpolation will also be). Those being blessed with a GPS or better clock will be able to demonstrate the quality of the code as well as the tuning possibilities against frequency errors. > > Monotonic clocks could be calculated > > monotime = ns_at_last_tick + (time_source_cycles_since_tick * > current_scaling_factor) >> shift_factor. > > This would also be easy to implement in asm if necessary. > > tick processing could then increment or decrement the current scaling > factor to minimize the error between ticks. It could also add > nanoseconds to ns_at_last_tick to correct the clock forward. Is that what corresponds to "adjust_nanoscale()" in my PPSkit? > > With the appropiate shift_factor one should be able to fine tune time much > more accurately than ntp_scale would do. Over time the necessary > corrections could be minimized to just adding a few ns once in a while. > Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)
Hi, I have a question: The Qlogic ISP2422 chip is said to handle PCI-X 266MHz. So does the HP Itanium2 server rx6600. Basically that was the reason to select that server. The FC-HBA is in a 266 MHz capable slot. However when booting SLES10 SP1 for IA64, the logs say: <6>QLogic Fibre Channel HBA Driver <6>GSI 49 (level, low) -> CPU 3 (0x0300) vector 51 <6>ACPI: PCI Interrupt :0f:01.0[A] -> GSI 49 (level, low) -> IRQ 51 <6>qla2xxx :0f:01.0: Found an ISP2422, irq 51, iobase 0xc000b004 [...] <6>qla2xxx :0f:01.0: LOOP UP detected (4 Gbps). <6>qla2xxx :0f:01.0: Topology - (F_Port), Host Loop address 0x0 <6>scsi0 : qla2xxx <6>qla2xxx :0f:01.0: <4> QLogic Fibre Channel HBA Driver: 8.01.07-k3 <4> QLogic HP AB378-60001 - <4> ISP2422: PCI-X Mode 2 (133 MH4.00.26 [IP] @ :0f:01.0 hdma+, host#=0, fw=4.00.26 [IP] <5> Vendor: HPModel: HSV200Rev: 6100 <5> Type: RAID ANSI SCSI revision: 02 <5> 0:0:0:0: Attached scsi generic sg0 type 12 Now does Linux support the speed of 266 MHz, and is it just displayed incorrectly, or doesn't Linux support the speed of 266MHz yet? "lspci -v" says: 0f:01.0 Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre Channel to PCI-X HBA (rev 02) Subsystem: Hewlett-Packard Company Unknown device 12d6 Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 51 I/O ports at 6000 [size=256] Memory at b004 (64-bit, non-prefetchable) [size=4K] Expansion ROM at b000 [disabled] [size=256K] Capabilities: [44] Power Management version 2 Capabilities: [4c] PCI-X non-bridge device Capabilities: [64] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Capabilities: [74] Vital Product Data Please CC: any replies to my address as I'm not subscribed to the kernel list. Regards, Ulrich Windl - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)
On 27 Jul 2007 at 9:46, Andrew Patterson wrote: > On Thu, 2007-07-26 at 23:23 -0700, Andrew Vasquez wrote: > > On Thu, 26 Jul 2007, Andrew Patterson wrote: > > > > > On Thu, 2007-07-26 at 15:36 +0200, Ulrich Windl wrote: > > > > Hi, > > > > > > > > I have a question: The Qlogic ISP2422 chip is said to handle PCI-X > > > > 266MHz. So does > > > > the HP Itanium2 server rx6600. Basically that was the reason to select > > > > that > > > > server. The FC-HBA is in a 266 MHz capable slot. However when booting > > > > SLES10 SP1 > > > > for IA64, the logs say: > > > > There's a mixup here in terminology... The QLA2460 card which you > > have does in fact support 'PCI-X 266'... > > > > > > <6>QLogic Fibre Channel HBA Driver > > > > <6>GSI 49 (level, low) -> CPU 3 (0x0300) vector 51 > > > > <6>ACPI: PCI Interrupt :0f:01.0[A] -> GSI 49 (level, low) -> IRQ 51 > > > > <6>qla2xxx :0f:01.0: Found an ISP2422, irq 51, iobase > > > > 0xc000b004 > > > > [...] > > > > <6>qla2xxx :0f:01.0: LOOP UP detected (4 Gbps). > > > > <6>qla2xxx :0f:01.0: Topology - (F_Port), Host Loop address 0x0 > > > > <6>scsi0 : qla2xxx > > > > <6>qla2xxx :0f:01.0: > > > > <4> QLogic Fibre Channel HBA Driver: 8.01.07-k3 > > > > <4> QLogic HP AB378-60001 - > > > > <4> ISP2422: PCI-X Mode 2 (133 MH4.00.26 [IP] @ :0f:01.0 hdma+, > > > > host#=0, > > > > fw=4.00.26 [IP] > > > > The 33/66/100/133 values refer to the bus-clock speed at which the > > card is operating. As is seen here (although a bit truncated -- > > separate issue, I'll try to see if I can reproduce this on one of my > > HPQ rigs), the card is inserted into a PCI-X Mode-2 capable 133MHz > > (bus clock) slot. When operating under this mode, each data-phase > > between two devices is divided into 2 sub-phases, effectively doubling > > the transfer-data-rate to 266Mhz. > > I guess the proper terminology would be 266 MT/s (Mega > Transfers/second). Looking through the PSI_SIG PCI-X 2.0 marketing > blurbs, they use MHz a lot when referring to MT/S. So I would still > consider this to be a minor bug. The user wants to know the transfer > rate, not the actual frequency of the bus. Maybe just print out the > mode used instead, e.g., "PCI-X 266"? [...] To be concrete: In the Installation Manual for the rx6600 (P/N AB464-9001A), figure 1-1 ("I/O Subsystem Block Diagram") the names being used are "PCIx-66", "PCIx-133", and "PCIx267". However in the text following the text refers to "66 MHz PCI/PCI-X slots" ("PCI/PCI-X 133 MHz", "PCI/PCI-X 266 MHz"). The Qlogic data sheet for the "ISP2422" also calls the bus interface "64-bit, PCI- X 2.0 266-MHz DDR", and the document's subtitle is "Dual Port 4-Gbps Fibre Channel (FC) to PCI-X 2.0 266-MHz Controller". Confusing? Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: PCI-X @ 266MHz on HP rx6600 (Qlogic 4Gb FC HBA)
On 31 Jul 2007 at 9:50, Andrew Vasquez wrote: > > On Fri, 27 Jul 2007, Andrew Patterson wrote: > > > > > On Thu, 2007-07-26 at 23:23 -0700, Andrew Vasquez wrote: > > > > > > > The 33/66/100/133 values refer to the bus-clock speed at which the > > > > card is operating. As is seen here (although a bit truncated -- > > > > separate issue, I'll try to see if I can reproduce this on one of my > > > > HPQ rigs), the card is inserted into a PCI-X Mode-2 capable 133MHz > > > > (bus clock) slot. When operating under this mode, each data-phase > > > > between two devices is divided into 2 sub-phases, effectively doubling > > > > the transfer-data-rate to 266Mhz. > > > > > > I guess the proper terminology would be 266 MT/s (Mega > > > Transfers/second). Looking through the PSI_SIG PCI-X 2.0 marketing > > > blurbs, they use MHz a lot when referring to MT/S. So I would still > > > consider this to be a minor bug. The user wants to know the transfer > > > rate, not the actual frequency of the bus. Maybe just print out the > > > mode used instead, e.g., "PCI-X 266"? > > Given PCI-X Mode-2 can run at different bus-clock speeds, how about > this as an alternative? > > PCI-X 266 (133Mhz) To pick up the idea, why not "133" and "133x2" (DDR, Dual Data Rate)? > > it's a bit more descriptive than > > PCI-X Mode 2 (133Mhz) > > then again, I don't want to beat this thing to death... > > --- > > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index c488996..26f7e54 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -283,9 +283,9 @@ qla24xx_pci_info_str(struct scsi_qla_host *ha, char *str) > } else { > strcat(str, "-X "); > if (pci_bus & BIT_2) > - strcat(str, "Mode 2"); > + strcat(str, "266"); > else > - strcat(str, "Mode 1"); > + strcat(str, "133"); > strcat(str, " ("); > strcat(str, pci_bus_modes[pci_bus & ~BIT_2]); > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.19: slight performance optimization for lib/string.c's strstrip()
Hi, my apologies for disobeying all the rules for submitting patches, but I'll suggest a performance optimization for strstrip() in lib/string.c: Original routine: char *strstrip(char *s) { size_t size; char *end; size = strlen(s); if (!size) return s; end = s + size - 1; while (end >= s && isspace(*end)) end--; *(end + 1) = '\0'; while (*s && isspace(*s)) s++; return s; } EXPORT_SYMBOL(strstrip); Suggested replacement: char *strstrip(char *s) { size_t size; char *end; while (*s && isspace(*s)) s++; if (!*s) return s; size = strlen(s); end = s + size - 1; while (end > s && isspace(*end)) end--; *(end + 1) = '\0'; return s; } EXPORT_SYMBOL(strstrip); Comments: There's no need to scan the initial banks at the start of the string twice (using strlen(), and then looking for initial blanks), and we know that the first character of the string cannot be a blank when we are removing trailing blanks after having removed leading blanks. Also we do not need to call strlen() to detect an empty string. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
TCP checksum change in RPC replies within XEN, NFS lockup (SLES10)
Hello, my apologies for not being sure whom to tell this problem, but it is very strange. Let me tell the story: I'm using XEN (3.0.2) with SLES10 (x86_64, SunFire X4100). On one machine I have three virtual machines ("DomU") that are very identically configured (SLES10 x86_64 also). There is also a SLES9 (i386) acting as a multi-homed NFS server. I can mount and access a read-only NFS filesystem on the server from Dom0 (hypervisor), and from two of the three DomUs without any problem, but from the third DomU mount hangs and is unkillable (except kill -9). This is how I started to debug the problem. To make things short: I haven't found the solution, but a some problems: Running tcpdump/etheral on the client (inside DomU), and on the NFS server, I found out that a significant number (almost all) of RPC reply packets have an invalid TCP cjhecksum on the NFS server, but not on the NFS client. When actually comparing the packets, I found that only the checksum is different. Example: --- nfs-client9.txt 2006-11-29 12:56:59.176133729 +0100 +++ nfs-server8.txt 2006-11-29 12:56:25.812623453 +0100 @@ -1,10 +1,10 @@ No. TimeSourceDestination Protocol Info - 9 15:10:15.488963 132.199.176.153 132.199.177.13Portmap V2 DUMP Reply (Call In 7)[Unreassembled Packet] + 8 15:10:15.497059 132.199.176.153 132.199.177.13Portmap V2 DUMP Reply (Call In 6)[Unreassembled Packet [incorrect TCP checksum]] 00 16 3e f3 45 0d 00 c0 9f 27 44 a6 08 00 45 00 ..>.E'D...E. 0010 01 c4 d0 3f 40 00 40 06 fd be 84 c7 b0 99 84 c7 [EMAIL PROTECTED]@. 0020 b1 0d 00 6f 94 48 89 33 9b af 3f e4 5a 65 80 18 ...o.H.3..?.Ze.. -0030 16 a0 27 8a 00 00 01 01 08 0a 5a a1 4b 92 01 2f ..'...Z.K../ +0030 16 a0 6c ec 00 00 01 01 08 0a 5a a1 4b 92 01 2f ..l...Z.K../ 0040 a9 da 00 00 01 8c 65 e9 c4 df 00 00 00 01 00 00 ..e. 0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0060 00 01 00 01 86 a0 00 00 00 02 00 00 00 06 00 00 I' NOT saysing that _all_ TCP checksums are bad, but significantly those RPC reply packets seem to be affected. OK so far. I don't know why the NFS mount is actually hanging, but the last packet exchange seems to be: Server sends ACK to RPC reply with bad checksum: Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 1023 (1023), Seq: 2306928188, Ack: 1069064470, Len: 0 Client receives: Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 1023 (1023), Seq: 2306928188, Ack: 1069064470, Len: 0 Some time later I see the client sending an NFS GETATTR packet, but that's probably when the kill came in (31 seconds later). The other odd thing is that the multihomes NFS server has a route to 132.199.0.0 for both ethernet interfaces, but only one of those, eth0, has IP 132.199.176.153. However when an ARP request is sent for 132.199.176.153, there are two ansers: ARP 132.199.176.153 is at 00:c0:9f:27:44:a6 ARP 132.199.176.153 is at 00:02:b3:d9:91:a7 Only the first one is correct. However that problem may be unrelated. Back to the issue, I doubt that XEN will just overwrite the TCP checksum of some specific RPC packets. Personally I could image is much more that there is some problem in the RPC processing that might cause this. Sorry for the poor problem description. Just in case, the Novell/SUSE kernel versions are: Client: 2.6.16.21-0.25-xen Server: 2.6.5-7.282-default Upon request I could provide the packet files as well as two PDFs that show the packet flow. Regards, Ulrich P.S: I'm subscribed to xen-users, but not to the kernel list, so maybe CC: me for kernel-list replies. Thanks! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: Re: MBR partitions slow?
>>> Mark D Rustad schrieb am 31.08.2016 um 17:32 in >>> Nachricht : > Ulrich Windl wrote: > >> So without partition the throughput is about twice as high! Why? > > My first thought is that by starting at block 0 the accesses were aligned > with the flash block size of the device. By starting at a partition, the > accesses probably were not so aligned. Hi! Thanks for answering. Yes, you are right: Usually I use fdisk to create partitions, and the tool does proper aligning for the partitions. In my case YaST insisted on having a partition before creating a filesystem, so I created on within YaST, and that partition turned out to be badly aligned (I think Yast uses cfdisk internally). I'm sorry that I didn't think about that earlier! Stracing fdisk, I also learned about ioctl(BLKIOOPT) and related... Regards, Ulrich > > -- > Mark Rustad, mrus...@gmail.com
Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?
Hi, since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which seem network-related: Most notably slapd does not run any more, and my sendmail-milter based virus scanner terminates now and then with kernel segfault. Current kernel form SLES10 SP1 is: # cat /proc/version Linux version 2.6.16.53-0.8-smp ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007 The effects in syslog are: Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0008 rip 0042c17a rsp 7fffea55de00 error 4 Aug 31 15:14:57 kgate1 kernel: slapd[5296]: segfault at 55981000 rip 2ad35ffee46b rsp 7fff4bf58c28 error 4 Aug 31 15:17:13 kgate1 kernel: powersaved[4747]: segfault at 0008 rip 0042c17a rsp 7434e260 error 4 Aug 31 17:50:48 kgate1 kernel: slapd[5561]: segfault at 55986000 rip 2b8fa3cf3483 rsp 7fff08252808 error 4 Sep 3 09:02:04 kgate1 kernel: slapd[22654]: segfault at 55992000 rip 2afd6f7b4483 rsp 7fff3c790458 error 4 Sep 3 13:14:45 kgate1 kernel: slapd[28324]: segfault at 55962000 rip 2b5c0ae00483 rsp 7fffa1144e58 error 4 Sep 7 07:48:26 kgate1 kernel: hscan[1142]: segfault at 0003 rip 2afac0581650 rsp 41000928 error 4 Sep 7 09:12:24 kgate1 kernel: slapd[6022]: segfault at 559b3000 rip 2b1c15539483 rsp 7fff96a0c978 error 4 Sep 10 17:02:35 kgate1 kernel: hscan[6795]: segfault at 0004 rip 2b59c0300650 rsp 42002928 error 4 Sep 11 08:43:43 kgate1 kernel: hscan[3456]: segfault at 0004 rip 2adcd625d650 rsp 43004928 error 4 Sep 11 10:45:38 kgate1 kernel: hscan[28343]: segfault at 0003 rip 2b17020de650 rsp 42803928 error 4 I know that this kind of report is not very helpful to you guys, but Novell does not allow me to report a kernel bug directly. (I've told the person who may to do so, but I'm unsure whether something is in progress already). Also note that the i586 (32-bit, non-SMP) kernel does not have that problem. Linux version 2.6.16.53-0.8-default ([EMAIL PROTECTED]) (gcc version 4.1.2 20070115 (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007 Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?
On 11 Sep 2007 at 15:01, Eric Dumazet wrote: [...] > > Also note that the i586 (32-bit, non-SMP) kernel does not have that problem. > > Linux version 2.6.16.53-0.8-default ([EMAIL PROTECTED]) (gcc version 4.1.2 > > 20070115 > > (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007 > > Are you sure ? Not any more ;-) > > segfaulting are sysloged only on 64bits kernel. > > Maybe your slapd/hscan processes are doing bad things, that make them > core dump without notice on a 32bits kernel. I'm using the senddmail milter library that does the socket communication. So any bad things should be searched there. I tend to think that the same program when being compiled as a 32-bit executable does not cause these segfaults on a 64 bit kernel. I also tried to use ksymoops to get a disassembly of the corresponding kernel code, but the result did not look good to me. Is there a deeper reason why the kernel does not provide more info (like a call trace) on segfaults? Will an strace of the program (multi-threaded, unfortunately, just as slapd (most likely)) be helpful? When I tried it for slapd, the (rest of the) strace was: 9931 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 9931 connect(3, {sa_family=AF_INET, sin_port=htons(427), sin_addr=inet_addr("12 7.0.0.1")}, 16) = 0 9931 setsockopt(3, SOL_SOCKET, SO_RCVLOWAT, [18], 4) = 0 9931 setsockopt(3, SOL_SOCKET, SO_SNDLOWAT, [18], 4) = -1 ENOPROTOOPT (Protocol not available) 9931 mmap(NULL, 1434435584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1 , 0) = 0x2ae32000 9931 --- SIGSEGV (Segmentation fault) @ 0 (0) --- Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?
On 11 Sep 2007 at 15:01, Eric Dumazet wrote: > On Tue, 11 Sep 2007 11:30:38 +0200 > "Ulrich Windl" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which > > seem > > network-related: Most notably slapd does not run any more, and my > > sendmail-milter > > based virus scanner terminates now and then with kernel segfault. > > > > Current kernel form SLES10 SP1 is: > > > > # cat /proc/version > > Linux version 2.6.16.53-0.8-smp ([EMAIL PROTECTED]) (gcc version 4.1.2 > > 20070115 > > (prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007 > > > > The effects in syslog are: > > Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at > > 0008 rip > > 0042c17a rsp 7fffea55de00 error 4 [...] > segfaulting are sysloged only on 64bits kernel. > > Maybe your slapd/hscan processes are doing bad things, that make them > core dump without notice on a 32bits kernel. A very wild guess: AFAIK SUSE Distributions are XENified recently, that is they have libraries that treat thread local storage differently from the default. If these programs (powersaved, slapd, hscan) are all multithreaded, could it be that the cause of the problem is in that area? If not, any clues on debugging/tracing? There's a /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing". I also learned that the error code is only documented for i386 arch (thanks to Emacs ediff): * error_code: * bit 0 == 0 means no page found, 1 means protection fault * bit 1 == 0 means read, 1 means write * bit 2 == 0 means kernel, 1 means user-mode So the problem (error 4) looks a bit like a read on a NULL-pointer dereference, right? And the "rip" is user space, correct? Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?
On 11 Sep 2007 at 17:04, Al Viro wrote: > On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote: > > > If not, any clues on debugging/tracing? There's a > > /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing". > > That would be because it has fsck-all to do with the kernel. Get the > coredump, then use gdb to deal with it. Ok, but why is the message there at all? I think in Windows/XP the offending code and the registers are shown in such occasions. I'd say either drop the message, or improve it. It's also difficult to find the code after the program is gone due to mapping of shared libraries. I managed to get a core dump of the application however, and I did modify some code. I'll report once I have results. Maybe it's "mea culpa" for my program, but powersaved and slapd are still to be examined. Regards, Ulrich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Formatting of /proc/meminfo
Hi! My eyes just discovered this mis-alignment for x86_64 machines: --- # cat /proc/meminfo MemTotal: 81366016 kB MemFree:36484504 kB Buffers: 1018764 kB Cached: 38230264 kB [...] VmallocTotal: 34359738367 kB VmallocUsed: 92792 kB VmallocChunk: 34359544432 kB HardwareCorrupted: 0 kB DirectMap4k:132623356 kB DirectMap2M: 0 kB --- It seems the very big numbers for "Vmalloc" are OK, so I suggest to update the formatting. The current code looks like this (/usr/src/linux/fs/proc/meminfo.c): --- seq_printf(m, "MemTotal: %8lu kB\n" "MemFree:%8lu kB\n" "Buffers:%8lu kB\n" [...] --- So the field should be widened by three digits at least (%11lu kB). Maybe one could make the field width variable, depending on 32/64 bit (it would look like a waste on 32 bit platforms). Maybe the code would be friendlier to changes if there was one seq_printf() per value. Then one could use something like seq_printf(m, "%-16s%8lu kB\n", "MemTotal:", K(i.totalram)) instead, I guess... You could put the format (%-16s%8lu kB\n) in a constant also, allowing a change at one point to affect every item... Probably gcc will optimize the code anyway, so there won't be much difference regarding performance. Regards, Ulrich Windl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
heads up: clock_gettime(CLOCK_MONOTONIC, ...) is not monotonic with Xen PVM
quot;, __func__, deltas[u].tv_nsec); printf("%s: largest delta is 0.%09ld\n", __func__, deltas[CLOCK_SAMPLES - 2].tv_nsec); if ( deltas[u].tv_nsec > tsp->tv_nsec ) tsp->tv_nsec = deltas[0].tv_nsec; leave: return(result); } /* main */ int main(int argc, char *argv[]) { int result = 0; ts_tres; get_res(&res); printf("resolution is 0.%09ld\n", res.tv_nsec); return(result); } - it is intentional that the program aborts on error Output from a newer machine: get_res: resolution is 0.1 get_res: smallest delta is 0.00030 get_res: largest delta is 0.00050 resolution is 0.00030 Regards, Ulrich Windl (Keep me on CC if I should read your replies)
Q on ioctl BLKGETSIZE
Hi! I'm wondering (on a x86_64 SLES11 system): "man 4 sd" says: --- BLKGETSIZE Returns the device size in sectors. The ioctl(2) parameter should be a pointer to a long. --- /usr/src/linux/block/ioctl.c (3.0.101-0.15) reads: --- case BLKGETSIZE: size = i_size_read(bdev->bd_inode); if ((size >> 9) > ~0UL) return -EFBIG; return put_ulong(arg, size >> 9); --- Three questions: 1) Shouldn't the manual page says that the sector size of always 512 Bytes, even on new disks with larger sectors? 2) Should the real sector size be used for new disks? 3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 kB)? I'm not subscribed to LKML, so please keep me CC'd when answering. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: setting the process name for ps
Hi! Currently one has to fiddle with argv[] in-place when trying to change the process name "cmd") in Linux. However if you want to change the thread name ("comm"), there is a syscall (prctl(PR_SET_NAME, ...)) for it. For comparison, in HP-UX there is also a syscall to change the process name for ps: --- #include union pstun psu; psu.pst_command = "foobar"; pstat(PSTAT_SETCMD, psu, strlen("foobar") - 1, 0, 0); --- To be fair, HP-XU also has syscalls to get processes, threads and arguments: pstat_getlwp() pstat_getproc() pstat_getcommandline() As Linux is different, I wonder whether there are any plans to provide a syscall to change the process name. For those who aren't afraid of ugly code, here's a quick-and-dirty example how to change the process name in Linux (apologies, you guys know, but those who Google may not: --- #include #include #include #include #include static int delay(void) { struct timespec ts; ts.tv_sec = 10; ts.tv_nsec = 0; return nanosleep(&ts, NULL); } int main(int argc, char *argv[]) { int l = strlen(argv[0]); if ( argc > 1 ) l += 1 + strlen(argv[1]); if (l < 20 ) { printf("provide a long argument\n"); return 1; } printf("look: unchanged\n"); delay(); sprintf(argv[0], "proc %d", getpid()); printf("look: process title\n"); delay(); return 0; } --- As I'm not subscribed to LKML, please keep me CC'd on you replies! Thanks & regards, Ulrich Windl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: uniqueness of pthread_t
Hi! I'm programming a little bit with pthreads in Linux. As I understand pthread_t is an opaque type (a pointer address?) that cannot be mapped to the kernel's TID easily. Anyway: Is it expected that when one thread terminates and another thread is created (in fact the same thread again), that the p_thread my be resused? It seems it just happened when debugging my program (or I made a programming mistake (code compiles clean with -Wall): --- [...] cleanup_thread: 1 used connection entries 5888 [^D] (null) -> 192.168.255.18/80 cleanup_thread: leaves with -1: No child processes 8407: [8413](172.20.64.67/54560) handles socket 4 cleanup_thread 139852035340032 terminated created cleanup_thread 139852035340032 accepting connection (1 handlers) [...] --- libgthread-2_0-0-2.22.5-0.8.12.1 (SLES11 SP3 on x86_64) The code handling the threads looks like this (so the results should be reliable, I guess): if ( tid != 0 && pthread_tryjoin_np(tid, &ret) == 0 ) { DEBUG(1) printf("cleanup_thread %ld terminated\\ n", (long) tid); tid = 0; } [...] if ( tid != 0 && pthread_tryjoin_np(tid, &ret) == 0 ) { DEBUG(1) printf("cleanup_thread %ld terminated\\ n", (long) tid); tid = 0; } (The code above also runs strictly sequential) And while I'm at it: There's a bug in the man page: --- PTHREAD_TRYJOIN_NP(3) Linux Programmer's Manual PTHREAD_TRYJOIN_NP(3) NAME pthread_tryjoin_np, pthread_timedjoin_np - try to join with a termi- nated thread SYNOPSIS #define _GNU_SOURCE #include int pthread_tryjoin_np(pthread_t thread, void **retval); int pthread_timedjoin_np(pthread_t thread, void **retval, const struct timespec *abstime); Compile and link with -pthread. --- You must include after defining _GNU_SOURCE, and you must do this early in your includes... I'm not subscribed d to LKML, so please CC your answers to me. Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Antw: Re: Some problems with HP DL380 G8 BIOS and SLES11 SP3
>>> Don Zickus schrieb am 18.08.2014 um 14:44 in Nachricht <20140818124404.gl49...@redhat.com>: > On Mon, Aug 18, 2014 at 08:12:44AM +0200, Ulrich Windl wrote: >> >>> Don Zickus schrieb am 14.08.2014 um 19:46 in >> >>> Nachricht >> <20140814174658.gv49...@redhat.com>: >> > On Wed, Aug 13, 2014 at 05:22:17PM +0200, Ulrich Windl wrote: >> >> Hello! >> >> >> >> Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are >> > some kernel messages that indicate a bug either in the kernel or in the HP >> > BIOS. Maybe someone can explain, so I can try to get it fixed whatever > party >> > broke it... >> >> >> >> Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version >> >> 4.3.4 >> > [gcc-4_3-branch revision 152973]" (latest). >> >> HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest) >> > >> > Yes, it is because you are letting the firmware dynamically control your >> > cpu frequency. In order to accomplish they need to use a perf counter or >> > two, hence the conflict. Set the firmware setting to OS control and the >> > problem goes away. Contact HP for those instructions, they are very aware >> > of this problem and recommend OS control to all high end servers. >> >> Hi! >> >> Thanks for answering, but the BIOS has set power management to "OS control" > (see attachment). So I guess it must be something different. > > Hmm, sounds like it. Regardless, the error message indicates the counters > are in use most likely by the BIOS. So you can ask HP what is going on. > > I assume this is a normal bootup and not a kdump crash kernel, correct? Yes, it's a normal boot. I'm afraid the standard hardware support at HP does not care much about such issues (I remember those Xeon bugs that caused memory errors during longer idle phases (in the G7 server) that are fixed be recent microcode updates: HP changed memory modules, and they changed the board, but it took very long until they updated the BIOS). Is there any more information I can provide to narrow down the problem? Regards, Ulrich > > Cheers, > Don > >> >> Regards, >> Ulrich >> >> > >> > Cheers, >> > Don >> > >> >> >> >> During ACPI init I see: >> >> [...] >> >> Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB) >> >> ACPI: RSDP 000f4f00 00024 (v02 HP) >> >> ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002 322? >> > 162E) >> >> ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002 322? >> > 162E) >> >> ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 >> > (2011041 >> >> 3/tbfadt-611) >> >> ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 >> > (20110413/ >> >> tbfadt-611) >> >> ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL >> > 20030228) >> >> ACPI: FACS bddac140 00040 >> >> ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001 322? >> > 162E) >> >> ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001 >> > ) >> >> [...] >> >> >> >> HPET id 0 under DRHD base 0xf4ffe000 >> >> BIOS requests to not use x2apic >> >> Use 'intremap=no_x2apic_optout' to override BIOS request >> >> Enabled IRQ remapping in xapic mode >> >> x2apic not enabled, IRQ remapping is in xapic mode >> >> Switched APIC routing to physical flat. >> >> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 >> >> CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04 >> >> Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken >> >> BIOS >> > detec >> >> ted, complain to your hardware vendor. >> >> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) >> >> Intel PMU driver. >> >> ... version:3 >> >> ... bit width: 48 >> >> ... generic registers: 4 >> >> ... value mask: >> >> ... max period: 7fff >> >> ... fixed-purpose events: 3 >> >> ... event mask: 0007000f >> >> N
Some problems with HP DL380 G8 BIOS and SLES11 SP3
Hello! Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are some kernel messages that indicate a bug either in the kernel or in the HP BIOS. Maybe someone can explain, so I can try to get it fixed whatever party broke it... Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973]" (latest). HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest) During ACPI init I see: [...] Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB) ACPI: RSDP 000f4f00 00024 (v02 HP) ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002 322? 162E) ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002 322? 162E) ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (2011041 3/tbfadt-611) ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 (20110413/ tbfadt-611) ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL 20030228) ACPI: FACS bddac140 00040 ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001 322? 162E) ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001 ) [...] HPET id 0 under DRHD base 0xf4ffe000 BIOS requests to not use x2apic Use 'intremap=no_x2apic_optout' to override BIOS request Enabled IRQ remapping in xapic mode x2apic not enabled, IRQ remapping is in xapic mode Switched APIC routing to physical flat. ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04 Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken BIOS detec ted, complain to your hardware vendor. [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) Intel PMU driver. ... version:3 ... bit width: 48 ... generic registers: 4 ... value mask: ... max period: 7fff ... fixed-purpose events: 3 ... event mask: 0007000f NMI watchdog enabled, takes one hw-pmu counter. Booting Node 0, Processors #1 [...] pci:00: Requesting ACPI _OSC control (0x1d) pci:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x00 ACPI _OSC control for PCIe not granted, disabling ASPM [...] pci:20: Requesting ACPI _OSC control (0x1d) pci:20: ACPI _OSC request failed (AE_SUPPORT), returned control mask: 0x00 ACPI _OSC control for PCIe not granted, disabling ASPM [...] Regards, Ulrich P.S. Please CC: me, as I'm not on LKML... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Antw: Re: Some problems with HP DL380 G8 BIOS and SLES11 SP3
>>> Don Zickus schrieb am 14.08.2014 um 19:46 in Nachricht <20140814174658.gv49...@redhat.com>: > On Wed, Aug 13, 2014 at 05:22:17PM +0200, Ulrich Windl wrote: >> Hello! >> >> Running the current SLES11 SP3 kernel on a HP DL380 G8 server, there are > some kernel messages that indicate a bug either in the kernel or in the HP > BIOS. Maybe someone can explain, so I can try to get it fixed whatever party > broke it... >> >> Linux kernel is "3.0.101-0.35-default (geeko@buildhost) (gcc version 4.3.4 > [gcc-4_3-branch revision 152973]" (latest). >> HP server is "HP ProLiant DL380p Gen8, BIOS P70 02/10/2014" (latest) > > Yes, it is because you are letting the firmware dynamically control your > cpu frequency. In order to accomplish they need to use a perf counter or > two, hence the conflict. Set the firmware setting to OS control and the > problem goes away. Contact HP for those instructions, they are very aware > of this problem and recommend OS control to all high end servers. Hi! Thanks for answering, but the BIOS has set power management to "OS control" (see attachment). So I guess it must be something different. Regards, Ulrich > > Cheers, > Don > >> >> During ACPI init I see: >> [...] >> Reserving 128MB of memory at 752MB for crashkernel (System RAM: 132095MB) >> ACPI: RSDP 000f4f00 00024 (v02 HP) >> ACPI: XSDT bddaed00 000D4 (v01 HP ProLiant 0002 322? > 162E) >> ACPI: FACP bddaee40 000F4 (v03 HP ProLiant 0002 322? > 162E) >> ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 > (2011041 >> 3/tbfadt-611) >> ACPI Warning: Invalid length for Pm2ControlBlock: 32, using default 8 > (20110413/ >> tbfadt-611) >> ACPI: DSDT bddaef40 026DC (v01 HP DSDT 0001 INTL > 20030228) >> ACPI: FACS bddac140 00040 >> ACPI: SPCR bddac180 00050 (v01 HP SPCRRBSU 0001 322? > 162E) >> ACPI: MCFG bddac200 0003C (v01 HP ProLiant 0001 > ) >> [...] >> >> HPET id 0 under DRHD base 0xf4ffe000 >> BIOS requests to not use x2apic >> Use 'intremap=no_x2apic_optout' to override BIOS request >> Enabled IRQ remapping in xapic mode >> x2apic not enabled, IRQ remapping is in xapic mode >> Switched APIC routing to physical flat. >> ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 >> CPU0: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz stepping 04 >> Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, Broken BIOS > detec >> ted, complain to your hardware vendor. >> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330) >> Intel PMU driver. >> ... version:3 >> ... bit width: 48 >> ... generic registers: 4 >> ... value mask: >> ... max period: 7fff >> ... fixed-purpose events: 3 >> ... event mask: 0007000f >> NMI watchdog enabled, takes one hw-pmu counter. >> Booting Node 0, Processors #1 >> [...] >> >> pci:00: Requesting ACPI _OSC control (0x1d) >> pci:00: ACPI _OSC request failed (AE_SUPPORT), returned control mask: > 0x00 >> ACPI _OSC control for PCIe not granted, disabling ASPM >> [...] >> >> pci:20: Requesting ACPI _OSC control (0x1d) >> pci:20: ACPI _OSC request failed (AE_SUPPORT), returned control mask: > 0x00 >> ACPI _OSC control for PCIe not granted, disabling ASPM >> [...] >> >> Regards, >> Ulrich >> P.S. Please CC: me, as I'm not on LKML... >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/
FYI: unclean Intel RAID reported as "clean"
Hi! I detected a problem with an Intel (imsm, ICH) RAID1 reported as "clean" by Linux, while the BIOS and Windows claimed the RAID is in state "rebuild". This was for an older kernel, and the bug had been reported to openSUSE bugzilla as bug #902000. Anyone interested can find the details there. I thought the problem is important enough to let you know... Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.16.3: misdetected nsc-ircc version 01 in VMware?
Hello, a short note: Using the release candidate of openSUSE 13.2 (GNOME live medium), I see this when booting the kernel in VMware: Oct 15 12:07:00 linux kernel: nsc-ircc, Found chip at base=0x02e Oct 15 12:07:00 linux kernel: nsc-ircc, Wrong chip version 01 I doubt the VMware has an infrared chip emulated, so I guess the chip detection is wrong. VMware BIOS is detected as: Vendor: "Phoenix Technologies LTD" Version: "6.00" Date: "08/16/2013" The PCI-hardware looks like this: 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01) 00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10) 00:0f.0 VGA compatible controller: VMware SVGA II Adapter 00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01) 00:11.0 PCI bridge: VMware PCI bridge (rev 02) 00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01) 00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01) 00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01) 00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01) 00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01) 03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01) The Linxu kernel string is: Linux linux.site 3.16.3-1.gd2bbe7f-desktop #1 SMP PREEMPT Thu Sep 18 06:32:16 UTC 2014 (d2bbe7f) x86_64 x86_64 x86_64 GNU/Linux Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: EDAC/kprintf/Xen issue (long logs inline)
Hi! I have a somewhat strange isse on a Xen host running SLES11 SP3 on a HP DL380 G7 server (two 5-core Xeon 5650 CPUs): At some time the system had RAM problems, and in one case the messages seemed to overwrite each other as seen in syslog. I wonder whether the locking of kprintf() is broken. See yourself: Mar 14 10:06:40 h05 kernel: [679593.489003] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489010] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489014] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489019] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489023] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489027] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:40 h05 kernel: [679593.489031] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) [...and so on...] Mar 14 10:06:41 h05 kernel: [679593.501561] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501568] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501575] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501583] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501590] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501597] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501604] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501611] EDAC MC1: CE row 6, channel 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501618] EDAC MC1: CE rohannel 0, label "": Corrected error (Socket=1hannel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1hanne l 0, label "": Corrected error (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501647] EDAhannel 0, label "": Corrected error (Socket=1 channel 0, label "": Corrected error (Socket=hannel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 channel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected e rror (Socket=1 Mar 14 10:06:41 h05 kernel: hannel 0, label "": Corrected error (Socket=1 chhannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected err or (Socket=1 channel=2 dimm=0) Mar 14 10:06:41 h05 kernel: [679593.501830] EDAC MC1: CE row 6, channehannel 0, label "": Corrected error (Socket=1 chanhannel 0, label "": Corrected error (Socket=1 channel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 h annel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=hannel 0, label "": Corrected error (Socket=1 channel 0, label "": Corrected er ror (Socket=1 channel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 chahannel 0, label "": Corrected error (Socket=1 hannel 0, labe Mar 14 10:06:41 h05 kernel: l "": Corrected error (Socket=1 hannel 0, label "": Corrected error (Socket=1 hannel 0, label "": Corrected erro
RFE: kernel message "rport-2:0-10: blocked FC remote port time out: removing rport"
Hi! I'd like to point out that the following Fibre CHannel error message is of little practical use, because it lacks essential information: kernel: [ 68.406963] rport-2:0-10: blocked FC remote port time out: removing rport (Seen in the current SLES11 SP3 kernel 3.0.101-0.40-default) As the logical port is removed (as the message says), you cannot find out what device (i.e. WWN) the problem message refers to. For existing ports in /sys/class/fc_remote_ports/ you can query port_id, port_name, port_state, etc. But once the port is removed, you cannot get any information about it. In a real environment, you cannot easily guess where the problem might be, especially as there is no other message about "rport-2:0-10" before it's being removed. I suggest to include the port ID (you can get the HBA WWN from that) and port WWN (the target device port) in the erro message if possible. Regards, Ulrich Windl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Xen. How an error message should not look like
Hi! This is a somewhat generic subject, so please forgive me. We are having some very strange Xen problem in SLES11 SP3 (kernel 3.0.101-0.46-xen). Eventually I found out that the message kernel: [615432.648108] vbd vbd-7-51888: 2 creating vbd structure is not a "progress" message (some vbd structure was created), but an error message (the vbd structure was NOT created because of error "2"). So first the user has to recognize that the text actually is an error, then the user has to find out what "2" means. Interestingly the most important information is missing (block major an minor number of the device for vbd_create()). When I did a little digging in the code I found this pearl of code (/usr/src/linux/drivers/xen/blkback/xenbus.c): switch (err) { case -ENOMEDIUM: if (!(be->blkif->vbd.type & (VDISK_CDROM | VDISK_REMOVABLE))) { default: xenbus_dev_fatal(dev, err, "creating vbd structure"); break; } /* fall through */ case 0: err = xenvbd_sysfs_addif(dev); if (err) { vbd_free(&be->blkif->vbd); xenbus_dev_fatal(dev, err, "creating sysfs entries"); } break; } Itself vbd_create() does not log a message, and neither does blkdev_get_by_dev() where the error comes from. Regards, Ulrich P.S. Not subscribed to LKML, so please keep me on CC: to address me -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Lower bound 0.05 on 15-Minute load?
Hi! I'm not subscribed, so plese CC: me for your replies. When graphing the CPU load, I noticed that the 15-minute average never drops below 0.05, while the 5-minute load and the 1-minute load does (Kernel 3.0.101-0.47.52-xen of SLES11 on x86_64). Ist that a known bug? Interactive call of "uptime" seems to confirm my suspect: windl> uptime 10:41am up 23 days 18:49, 1 user, load average: 0.08, 0.05, 0.05 windl> uptime 10:48am up 23 days 18:56, 1 user, load average: 0.00, 0.04, 0.05 windl> cat /proc/loadavg 0.00 0.04 0.05 1/108 9704 I'll attach a sample graph. Regards, Ulrich
Antw: Re: Lower bound 0.05 on 15-Minute load?
>>> Martin Steigerwald schrieb am 02.07.2015 um 11:26 in Nachricht <1479160.a5Vb4cJSSF@merkaba>: > On Thursday 02 July 2015 10:50:13 Ulrich Windl wrote: >> Hi! > > Hi Ulrich, > >> I'm not subscribed, so plese CC: me for your replies. >> >> When graphing the CPU load, I noticed that the 15-minute average never >> drops below 0.05, while the 5-minute load and the 1-minute load does >> (Kernel 3.0.101-0.47.52-xen of SLES11 on x86_64). > > Load average is *NOT* the CPU load although this is a very common > misconception. I think the correlation of 1-min, 5-min and 15-min values is independent of the actual meaning of the value. > > Load average indicates the amount of processes that are waiting to be > scheduled / running (which is CPU saturation) *and* those that are waiting > uninterruptable. You can have a high load average without much CPU > utilizitation, for example by running 20 find processes on a /home on NFS. > > A high load can be CPU-bound but it doesn't need to be. I knew. > > So a high load only can indicate that things are running more slowly, but > not why, or well the why can be at least two things and does not need to be > CPU. How is that related to my complaint/question? > > Also the load is normalized to CPU cores. Actually I don't think so, but that's also not related to the issue I reported. In know that HP-UX load was the average load of every CPU, while for Linux the load seemed to be the sum of all CPU loads, meaning a load of 4 is low for a 12-CPU machine. But that's all unrelated... > >> Ist that a known bug? Interactive call of "uptime" seems to confirm my >> suspect: windl> uptime >> 10:41am up 23 days 18:49, 1 user, load average: 0.08, 0.05, 0.05 >> windl> uptime >> 10:48am up 23 days 18:56, 1 user, load average: 0.00, 0.04, 0.05 >> windl> cat /proc/loadavg >> 0.00 0.04 0.05 1/108 9704 >> >> I'll attach a sample graph. > > Why should it be? As you can see in the graph you have higher spikes with 1- > minute average. As its just a average about one minute it more easily drops > below 0,05. But the 5 minute and 15 minute avergage need more time to drop > lower, so for it to become lower, you need longer times without spikes in > load average. > > So its natural you get "flatter" curves with higher average. Average easily > hide things like spikes. Actually it seems my "mathematical eye" is better than yours: I have another graph that shows the problem even more clearly (same kernel and hardware, just another machine). Regards, Ulrich
Issue with reading sysfs files
Hi! I wrote a simple tool to browse sysfs. However I noticed that there are some files having "r" (read) permission, but when you actually try to read from those, I get an I/O error. So I wonder whether the actual read was forgotten to implement, or the read permission should be gone actually. It seems to be implemented correctly in uevent, like # ll /sys/module/drm/uevent --w--- 1 root root 4096 Sep 24 12:24 /sys/module/drm/uevent but it is not (e.g.) for # ll /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms -rw-r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms # cat /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms cat: '/sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1/power/autosuspend_delay_ms': Input/output error Here's a summary of such candidates: .../power/autosuspend_delay_ms # ll /sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec -rw-r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec # cat /sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec cat: '/sys/devices/pci:00/:00:03.1/:01:00.0/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sdb/queue/wbt_lat_usec': Invalid argument # ll /sys/devices/pci:00/:00:03.1/:01:00.0/resource0 -rw--- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:00/:00:03.1/:01:00.0/resource0 # cat /sys/devices/pci:00/:00:03.1/:01:00.0/resource0 cat: '/sys/devices/pci:00/:00:03.1/:01:00.0/resource0': Input/output error /sys/devices/pci:00/:00:03.1/:01:00.0/resource2_wc # ll /sys/devices/pci:00/:00:03.1/:01:00.0/rom -rw--- 1 root root 262144 Sep 26 14:03 /sys/devices/pci:00/:00:03.1/:01:00.0/rom # cat /sys/devices/pci:00/:00:03.1/:01:00.0/rom cat: '/sys/devices/pci:00/:00:03.1/:01:00.0/rom': Invalid argument # ll /sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id -r--r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id # cat /sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id cat: '/sys/devices/pci:80/:80:01.1/:81:00.0/net/em1/phys_port_id': Operation not supported .../net/em1/phys_port_name .../net/em1/phys_switch_id # ll /sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve -rw-r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve # cat /sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve cat: '/sys/devices/pci:80/:80:01.2/:82:00.0/:83:00.0/graphics/fb0/bl_curve': No such device # ll /sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer -rw-r--r-- 1 root root 4096 Oct 17 15:25 /sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer # cat /sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer cat: '/sys/devices/pci:80/:80:08.1/:86:00.2/ata1/host1/scsi_host/host1/em_buffer': Invalid argument .../em_message # ll /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer -rw-r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer # cat /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer cat: '/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/scsi_host/host0/fw_crash_buffer': Invalid argument # ll /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask -rw-r--r-- 1 root root 4096 Sep 26 14:03 /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask # cat /sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask cat: '/sys/devices/pci:c0/:c0:01.1/:c1:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda1/trace/act_mask': No such device or address .../block/sda/sda1/trace/enable .../block/sda/sda1/trace/end_lba .../block/sda/sda1/trace/pid .../block/sda/sda1/trace/start_lba # ll /sys/devices/virtual/net/lo/duplex -r--r--r-- 1 root root 4096 Sep 24 12:30 /sys/devices/virtual/net/lo/duplex # cat /sys/devices/virtual/net/lo/duplex cat: /sys/devices/virtual/net/lo/duplex: Invalid argument # ll /sys/devices/virtual/net/lo/name_assign_type -r--r--r-- 1 root root 4096 Sep 24 12:24 /sys/devices/virtual/net/lo/name_assign_type # cat /sys/devices/virtual/net/lo/name_assign_type cat: /sys/devices/virtual/
leap seconds and 61st second in minute
Hi! I was currently following some discussion on the topic of leap seconds, and due to the basic role of time in the kernel, I'd like to send a "heads up" ("food for thought") with some proposal (not to start some useless discussion): The UNIX timescale running in UTC had (I suppose) the idea that no timezone switches will affect it. Unfortunately leap-seconds have not been considered; maybe also because at that time everybody would be happy if the system's time was correct to a few seconds. I don't know. However leap seconds are a kind of "time zone switch" conceptually. Unfortunately the unix time scale does not consider them (as noted in the time(2) manual page). That's one part of posix. The other part of POSIX claims that during an inserted leap second there should be a 61st second in the minute. Unfortunately (AFAIK) there's no interface from kernel's leap second handling to glibc allowing it to actually return 60 as the number of seconds. (localtime(3) only gets a pointer to time_t) OTOH in the NTPv4 clock model there is a TAI offset included (which can be updated by NTP). AFAIK the kernel also has the timezone offset for some time to handle RTCs that run local time. Obviously if the kernel knew the number of leap seconds (the correction to the time() timescale) conversion from UNIX timescale to TAI would be easy. So roughly if the kernel exports the time and type of the next_leap second scheduled, some future localtime could actually return the 61st second in the minute. As it is now applications will all see some magic duplication of the 60th second. (Maybe that' why Google does "leap smear"). If the kernel API would also export the TAI offset, one could even offer a TAI-based time, or, maybe even better: The kernel could run on TAI internally, and convert to UNIX time scale as needed. I'll leave exact specification and implementation to the really clever guys. Regards, Ulrich P.S. Not subscribed to the kernel-list so if you want to talk to me keep me on CC: preferably
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! Maybe someone can confirm this: If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 64, and then define an LVM LV using that multipath map as PV, the LV still has a larger max_sectors_kb. If you send a big request (read in my case), the kernel will complain: kernel: [173116.098798] blk_rq_check_limits: over max size limit. Note that this message does not give any clue to the device being involved, nor the size of the IO attempted, nor the limit of the IO size. My expectation would be that the higher layer reports back an I/O error, and the user process receives an I/O error, or, alternatively, the big request is split into acceptable chunks before passing it to the lower layers. However none of the above happens; instead the request seems to block the request queue, because later TUR-checks also fail: kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. multipathd: 66:384: mark as failed multipathd: NAP_S11: remaining active paths: 1 multipathd: 66:352: mark as failed multipathd: NAP_S11: Entering recovery mode: max_retries=6 multipathd: NAP_S11: remaining active paths: 0 (somewhat later) multipathd: NAP_S11: sdkh - tur checker reports path is up multipathd: 66:336: reinstated multipathd: NAP_S11: Recovered to normal mode kernel: [173117.286712] device-mapper: multipath: Could not failover device 66:368: Handler scsi_dh_alua error 8. (I don't know the implications of this) Of course this error does not appear as long as all devices use the same maximum request size, but tests have shown that different SAN disk systems prefer different request sizes (as they split large requests internally to handle them in chunks anyway). Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) Regards, Ulrich >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : >>> 161 : 60728>: > Hello! > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed > that I/Os dropped, until everything stood still more or less. Looking into > the syslog I found that multipath's TUR-checker considered the paths (FC, > BTW) as dead. Amazingly I did not have this problem when I did read-only > tests. > > The start looks like this: > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing > path 8:32. > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max > size limit. > Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing > path 129:112. > [...] > It seems the TUR-checker does some ping-pong-like game: Paths go up and down > > Now for the Linux part: I found the relevant message in blk-core.c > (blk_rq_check_limits()). > First s/agaist/against/ in " *Such request stacking drivers should check > those requests agaist", the there's the problem that the message neither > outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the > underlying device. That makes debugging somewhat difficult if you customize > the block queue settings per device as I did: > > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-31 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-31 (noop) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTest-31 (128) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/rotational for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/add_random for FirstTest-32 (0) > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of > queue/scheduler for FirstTest-32 (noop) > Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of > queue/max_sectors_kb for FirstTes
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi again! An addition: Processes doing such I/O seem to be unkillable, and I also cannot change the queue parameters while this problem occurs (the process trying to write (e.g.: to queue/scheduler) is also blocked. The process status of the process doing I/O looks like this: # cat /proc/2762/status Name: randio State: D (disk sleep) Tgid: 2762 Pid:2762 PPid: 53250 TracerPid: 0 Uid:0 0 0 0 Gid:0 0 0 0 FDSize: 0 Groups: 0 105 Threads:1 SigQ: 5/38340 SigPnd: ShdPnd: 4102 SigBlk: SigIgn: SigCgt: 00018000 CapInh: CapPrm: CapEff: CapBnd: Cpus_allowed: , Cpus_allowed_list: 0-63 Mems_allowed: ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0001 Mems_allowed_list: 0 voluntary_ctxt_switches:5 nonvoluntary_ctxt_switches: 1 Best regards, Ulrich >>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : >>> 161 : 60728>: > Hi again! > > Maybe someone can confirm this: > If you have a device (e.g. multipath map) that limits max_sectors_kb to > maybe 64, and then define an LVM LV using that multipath map as PV, the LV > still has a larger max_sectors_kb. If you send a big request (read in my > case), the kernel will complain: > > kernel: [173116.098798] blk_rq_check_limits: over max size limit. > > Note that this message does not give any clue to the device being involved, > nor the size of the IO attempted, nor the limit of the IO size. > > My expectation would be that the higher layer reports back an I/O error, and > the user process receives an I/O error, or, alternatively, the big request is > split into acceptable chunks before passing it to the lower layers. > > However none of the above happens; instead the request seems to block the > request queue, because later TUR-checks also fail: > kernel: [173116.105701] device-mapper: multipath: Failing path 66:384. > kernel: [173116.105714] device-mapper: multipath: Failing path 66:352. > multipathd: 66:384: mark as failed > multipathd: NAP_S11: remaining active paths: 1 > multipathd: 66:352: mark as failed > multipathd: NAP_S11: Entering recovery mode: max_retries=6 > multipathd: NAP_S11: remaining active paths: 0 > > (somewhat later) > multipathd: NAP_S11: sdkh - tur checker reports path is up > multipathd: 66:336: reinstated > multipathd: NAP_S11: Recovered to normal mode > kernel: [173117.286712] device-mapper: multipath: Could not failover device > 66:368: Handler scsi_dh_alua error 8. > (I don't know the implications of this) > > Of course this error does not appear as long as all devices use the same > maximum request size, but tests have shown that different SAN disk systems > prefer different request sizes (as they split large requests internally to > handle them in chunks anyway). > > Last seen with this kernel (SLES11 SP4 on x86_64): Linux version > 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch > revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205) > > Regards, > Ulrich > > >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : > >>> 161 > : > 60728>: > > Hello! > > > > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I > noticed > > that I/Os dropped, until everything stood still more or less. Looking into > > the syslog I found that multipath's TUR-checker considered the paths (FC, > > BTW) as dead. Amazingly I did not have this problem when I did read-only > > tests. > > > > The start looks like this: > > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed > > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3 > > Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: > Failing > > path 8:32. > > Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max > > size limit. > > Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max > > s
Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Hi once more! I managed to get the call traces of involved processes: 1) The process doing read(): Dec 7 13:51:16 h10 kernel: [183809.594434] SysRq : Show Blocked State Dec 7 13:51:16 h10 kernel: [183809.594447] taskPC stack pid father Dec 7 13:51:16 h10 kernel: [183809.594750] randio D 8801703a9d68 0 2762 53250 0x0004 Dec 7 13:51:16 h10 kernel: [183809.594758] 880100887ad8 0046 880100886010 00010900 Dec 7 13:51:16 h10 kernel: [183809.594765] 00010900 00010900 00010900 880100887fd8 Dec 7 13:51:16 h10 kernel: [183809.594772] 880100887fd8 00010900 88016bb6a280 88017670c300 Dec 7 13:51:16 h10 kernel: [183809.594778] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.594806] [] io_schedule+0x9c/0xf0 Dec 7 13:51:16 h10 kernel: [183809.594820] [] __lock_page+0x93/0xc0 Dec 7 13:51:16 h10 kernel: [183809.594834] [] truncate_inode_pages_range+0x294/0x460 Dec 7 13:51:16 h10 kernel: [183809.594845] [] __blkdev_put+0x1d7/0x210 Dec 7 13:51:16 h10 kernel: [183809.594856] [] __fput+0xb3/0x200 Dec 7 13:51:16 h10 kernel: [183809.594868] [] filp_close+0x5c/0x90 Dec 7 13:51:16 h10 kernel: [183809.594880] [] put_files_struct+0x7a/0xd0 Dec 7 13:51:16 h10 kernel: [183809.594889] [] do_exit+0x1d0/0x470 Dec 7 13:51:16 h10 kernel: [183809.594897] [] do_group_exit+0x3d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594907] [] get_signal_to_deliver+0x247/0x480 Dec 7 13:51:16 h10 kernel: [183809.594919] [] do_signal+0x71/0x1b0 Dec 7 13:51:16 h10 kernel: [183809.594928] [] do_notify_resume+0x98/0xb0 Dec 7 13:51:16 h10 kernel: [183809.594940] [] int_signal+0x12/0x17 Dec 7 13:51:16 h10 kernel: [183809.594988] [<7f64f28cbba0>] 0x7f64f28cbb9f 2) The process trying to modify the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.594995] blocktune D 88014e048000 0 58867 1 0x Dec 7 13:51:16 h10 kernel: [183809.595000] 880128defd18 0086 880128dee010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595007] 00010900 00010900 00010900 880128deffd8 Dec 7 13:51:16 h10 kernel: [183809.595013] 880128deffd8 00010900 88012889a3c0 8801767bc1c0 Dec 7 13:51:16 h10 kernel: [183809.595019] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595026] [] schedule_timeout+0x1b0/0x2a0 Dec 7 13:51:16 h10 kernel: [183809.595040] [] msleep+0x1d/0x30 Dec 7 13:51:16 h10 kernel: [183809.595052] [] __blk_drain_queue+0xbc/0x140 Dec 7 13:51:16 h10 kernel: [183809.595063] [] elv_quiesce_start+0x51/0x90 Dec 7 13:51:16 h10 kernel: [183809.595071] [] elevator_switch+0x4a/0x150 Dec 7 13:51:16 h10 kernel: [183809.595079] [] elevator_change+0x6d/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595086] [] elv_iosched_store+0x27/0x60 Dec 7 13:51:16 h10 kernel: [183809.595096] [] queue_attr_store+0x67/0xc0 Dec 7 13:51:16 h10 kernel: [183809.595106] [] sysfs_write_file+0xcb/0x160 Dec 7 13:51:16 h10 kernel: [183809.595115] [] vfs_write+0xce/0x140 Dec 7 13:51:16 h10 kernel: [183809.595123] [] sys_write+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595131] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595140] [<7f12b7f70c00>] 0x7f12b7f70bff 3) The process trying to read the queue scheduler: Dec 7 13:51:16 h10 kernel: [183809.595149] cat D 880147873718 0 45053 5957 0x0004 Dec 7 13:51:16 h10 kernel: [183809.595155] 880122f7be00 0082 880122f7a010 00010900 Dec 7 13:51:16 h10 kernel: [183809.595161] 00010900 00010900 00010900 880122f7bfd8 Dec 7 13:51:16 h10 kernel: [183809.595167] 880122f7bfd8 00010900 8801154ea1c0 81a11020 Dec 7 13:51:16 h10 kernel: [183809.595174] Call Trace: Dec 7 13:51:16 h10 kernel: [183809.595181] [] __mutex_lock_slowpath+0x160/0x1f0 Dec 7 13:51:16 h10 kernel: [183809.595189] [] mutex_lock+0x1a/0x40 Dec 7 13:51:16 h10 kernel: [183809.595196] [] queue_attr_show+0x49/0xb0 Dec 7 13:51:16 h10 kernel: [183809.595203] [] sysfs_read_file+0xfe/0x1c0 Dec 7 13:51:16 h10 kernel: [183809.595212] [] vfs_read+0xc7/0x130 Dec 7 13:51:16 h10 kernel: [183809.595219] [] sys_read+0x53/0xa0 Dec 7 13:51:16 h10 kernel: [183809.595226] [] system_call_fastpath+0x16/0x1b Dec 7 13:51:16 h10 kernel: [183809.595235] [<7fb04560dba0>] 0x7fb04560db9f >>> Ulrich Windl schrieb am 07.12.2016 um 13:23 in Nachricht <5847FF5E.7E4 : >>> 161 : 60728>: > Hi again! > > An addition: Processes doing such I/O seem to be unkillable, and I also > cannot change the queue parameters while this problem occurs (the process > trying to write (e.g.: to queue/scheduler) is also blocked. The process > status of the process doing I/O looks like this: > # cat /proc/2762/
MBR partitions slow?
Hello! (I'm not subscribed to this list, but I'm hoping to get a reply anyway) While testing some SAN storage system, I needed a utility to erase disks quickly. I wrote my own one that mmap()s the block device, memset()s the area, then msync()s the changes, and finally close()s the file descriptor. On one disk I had a primary MBR partition spanning the whole disk, like this (output from some of my obscure tools): disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 (10737418240 bytes) partition 1 (1-20971520) Total Sectors = 20971519 When wiping, I started (for no good reason) to wipe partition 1, then I wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 SP4. For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM bandwidth is about 9GB/s. To my surprise I found out that the partition eats significant performance (not quite 50%, but a lot): ### Partition h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at 0x7fbc86739000: 0.052170s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s ### Whole disk h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.290470s Reproducible: h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s time for fstat(): 0.65s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at 0x7f1cc17ab000: 0.37s time to zap 10.7Gib: 24.624000s (436.06 MiB/s) time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s So without partition the throughput is about twice as high! Why? Regards Ulrich
Antw: MBR partitions slow?
Update: I found out the bad performance was caused by partition alignment, and not by the pertition per se (YaST created the partition next to the MBR). I compared two partitions, number one badly aligned, and number 2 properly aligned. Then I got these results: Disk /dev/disk/by-id/dm-name-FirstTest-32: 10.7 GB, 10737418240 bytes 64 heads, 32 sectors/track, 10240 cylinders, total 20971520 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 16384 bytes / 16777216 bytes Disk identifier: 0x00016340 Device Boot Start End Blocks Id System /dev/disk/by-id/dm-name-FirstTest-32-part1 1 5242879 2621439+ 83 Linux Partition 1 does not start on physical sector boundary. /dev/disk/by-id/dm-name-FirstTest-32-part2 524288010485759 2621440 83 Linux h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.21s time for fstat(): 0.60s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size 2684.4MiB) at 0x7f826a8a1000: 0.38s time to zap 2684.4MiB: 11.734121s (228.76 MiB/s) time to sync 2684.4MiB: 3.515991s (763.47 MiB/s) time to unmap 2684.4MiB at 0x7f826a8a1000: 0.038104s time to close /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.673100s h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part2 time to open /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.20s time for fstat(): 0.69s time to map /dev/disk/by-id/dm-name-FirstTest-32_part2 (size 2684.4MiB) at 0x7fe18823e000: 0.44s time to zap 2684.4MiB: 4.861062s (552.22 MiB/s) time to sync 2684.4MiB: 0.811360s (3308.47 MiB/s) time to unmap 2684.4MiB at 0x7fe18823e000: 0.038380s time to close /dev/disk/by-id/dm-name-FirstTest-32_part2: 0.265687s So the correctly aligned partition is two to three times faster than the badly aligned partition (write-only case), and it's about the performance of an unpartitioned disk. Regards, Ulrich >>> Ulrich Windl schrieb am 30.08.2016 um >>> 11:32 in Nachricht <57C552B6.33D : 161 : 60728>: > Hello! > > (I'm not subscribed to this list, but I'm hoping to get a reply anyway) > While testing some SAN storage system, I needed a utility to erase disks > quickly. I wrote my own one that mmap()s the block device, memset()s the > area, then msync()s the changes, and finally close()s the file descriptor. > > On one disk I had a primary MBR partition spanning the whole disk, like this > (output from some of my obscure tools): > disk /dev/disk/by-id/dm-name-FirstTest-32 has 20971520 blocks of size 512 > (10737418240 bytes) > partition 1 (1-20971520) > Total Sectors = 20971519 > > When wiping, I started (for no good reason) to wipe partition 1, then I > wiped the whole disk. The disk is 4-way multipathed to a 8Gb FC-SAN, and the > disk system is all-SSD (32x2TB). Using kernel 3.0.101-80-default of SLES11 > SP4. > For the test I had reduced the amount of RAM via "mem=4G". The machine's RAM > bandwidth is about 9GB/s. > > To my surprise I found out that the partition eats significant performance > (not quite 50%, but a lot): > > ### Partition > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32_part1 time to > open /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.42s time for fstat(): > 0.17s time to map /dev/disk/by-id/dm-name-FirstTest-32_part1 (size > 10.7Gib) at 0x7fbc86739000: 0.39s time to zap 10.7Gib: 52.474054s (204.62 > MiB/s) time to sync 10.7Gib: 4.148350s (2588.36 MiB/s) time to unmap 10.7Gib at > 0x7fbc86739000: 0.052170s time to close > /dev/disk/by-id/dm-name-FirstTest-32_part1: 0.770630s > > ### Whole disk > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 time to open > /dev/disk/by-id/dm-name-FirstTest-32: 0.22s time for fstat(): > 0.61s time to map /dev/disk/by-id/dm-name-FirstTest-32 (size > 10.7Gib) at 0x7fa2434cc000: 0.37s time to zap 10.7Gib: 24.580162s (436.83 > MiB/s) time to sync 10.7Gib: 1.097502s (9783.51 MiB/s) time to unmap 10.7Gib at > 0x7fa2434cc000: 0.052385s time to close /dev/disk/by-id/dm-name-FirstTest-32: > 0.290470s > > Reproducible: > h10:~ # ./flashzap -f -s /dev/disk/by-id/dm-name-FirstTest-32 > time to open /dev/disk/by-id/dm-name-FirstTest-32: 0.39s > time for fstat(): 0.65s > time to map /dev/disk/by-id/dm-name-FirstTest-32 (size 10.7Gib) at > 0x7f1cc17ab000: 0.37s > time to zap 10.7Gib: 24.624000s (436.06 MiB/s) > time to sync 10.7Gib: 1.199741s (8949.79 MiB/s) > time to unmap 10.7Gib at 0x7f1cc17ab000: 0.069956s > time to close /dev/disk/by-id/dm-name-FirstTest-32: 0.327232s > > So without partition the throughput is about twice as high! Why? > > Regards > Ulrich > >
ioprio_set() & IOPRIO_WHO_PROCESS: Rename?
Hi! I noticed that older Manual pages for ioprio_set(2) say IOPRIO_WHO_PROCESS modified the process, while I think it should be per thread. Newer manual pages say it's per thread, but shouldn't IOPRIO_WHO_PROCESS be declared obsolete then and be replaced with a new IOPRIO_WHO_THREAD? (i.e. #define IOPRIO_WHO_PROCESS IOPRIO_WHO_THREAD /* and #define IOPRIO_WHO_THREAD */) Regards, Ulrich -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/