Alpha "process table hang"
I've been experiencing a particular kind of hang for many versions (since 2.3.99 days, recently seen with 2.4.1, 2.4.2, and 2.4.2-ac4) on the alpha architecture. The symptom is that any program that tries to access the process table will hang. (ps, w, top) The hang will go away by itself after ~10minutes - 1 hour or so. When it hangs I run ps and see that it gets halfway through the process list and hangs. The process that comes next in the list (after hang goes away) almost always has nonsensical memory numbers, like multi-gigabyte SIZE. Linux draal.physics.wisc.edu 2.3.99-pre5 #8 Sun Apr 23 16:21:48 CDT 2000 alpha unknown Gnu C 2.96 Gnu make 3.78.1 binutils 2.10.0.18 util-linux 2.11a modutils 2.4.5 e2fsprogs 1.18 PPP2.3.11 Linux C Library2.2.1 Dynamic linker (ldd) 2.2.1 Procps 2.0.7 Net-tools 1.54 Kbd0.94 Sh-utils 2.0 Modules Loaded nfsd lockd sunrpc af_packet msdos fat pas2 sound soundcore Has anyone else seen this? Is there a fix? -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Alpha "process table hang"
Peter Rival [[EMAIL PROTECTED]] wrote: > You wouldn't happen to have khttpd loaded as a module, would you? I've seen > this type of problem caused by that before... Nope... > > - Pete > > Bob McElrath wrote: > > > I've been experiencing a particular kind of hang for many versions > > (since 2.3.99 days, recently seen with 2.4.1, 2.4.2, and 2.4.2-ac4) on > > the alpha architecture. The symptom is that any program that tries to > > access the process table will hang. (ps, w, top) The hang will go away > > by itself after ~10minutes - 1 hour or so. When it hangs I run ps and > > see that it gets halfway through the process list and hangs. The > > process that comes next in the list (after hang goes away) almost always > > has nonsensical memory numbers, like multi-gigabyte SIZE. > > > > Linux draal.physics.wisc.edu 2.3.99-pre5 #8 Sun Apr 23 16:21:48 CDT 2000 > > alpha unknown > > > > Gnu C 2.96 > > Gnu make 3.78.1 > > binutils 2.10.0.18 > > util-linux 2.11a > > modutils 2.4.5 > > e2fsprogs 1.18 > > PPP2.3.11 > > Linux C Library2.2.1 > > Dynamic linker (ldd) 2.2.1 > > Procps 2.0.7 > > Net-tools 1.54 > > Kbd0.94 > > Sh-utils 2.0 > > Modules Loaded nfsd lockd sunrpc af_packet msdos fat pas2 sound > > soundcore > > > > Has anyone else seen this? Is there a fix? > > > > -- Bob > > > > Bob McElrath ([EMAIL PROTECTED]) > > Univ. of Wisconsin at Madison, Department of Physics > > > > > >Part 1.2Type: application/pgp-signature -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Alpha "process table hang"
Peter Rival [[EMAIL PROTECTED]] wrote: > Hmpf. Haven't seen this at all on any of the Alphas that I'm running. What > exact system are you seeing this on, and what are you running when it happens? This is a LX164 system, 533 MHz. I have a hunch it's related to the X server because I've seen it many, many times while sitting at the console (in X), but never when I'm logged on remotely. I've seen it with both XFree86 3.3.6, 4.0.2, 4.0.3, Matrox Millenium II video card, 8MB. I'm also experiencing regular X crashes, but the process-table-hang doesn't occur at the same time as an X crash (or v/v). I sent a patch to [EMAIL PROTECTED] a few days ago that seemed to fix (one of) the X crashes (in the mga driver, ask if you want details). (But since the X server shouldn't have the ability to corrupt the kernel's process list, there has to be a problem in the kernel somewhere) Note that this system was completely stable with 2.2 kernels. Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Alpha "process table hang"
Well, here's the list of modules I have loaded: nfsd 102496 8 (autoclean) lockd 72976 1 (autoclean) [nfsd] sunrpc 87984 1 (autoclean) [nfsd lockd] nls_iso8859-1 4160 1 (autoclean) nls_cp437 5664 1 (autoclean) msdos 7728 1 (autoclean) fat42784 0 (autoclean) [msdos] pas2 17488 1 sound 83184 1 [pas2] soundcore 5568 5 [sound] Are there any known problems with these? I have at times also used matroxfb, and usb-uhci (along with visor, usb-storage), but I've seen the process-table-hang with matroxfb and usb-uhci *not* installed, so I don't think that's it. I have the above modules installed consistently at each bootup. Der Herr Hofrat [[EMAIL PROTECTED]] wrote: > > I've been experiencing a particular kind of hang for many versions > > (since 2.3.99 days, recently seen with 2.4.1, 2.4.2, and 2.4.2-ac4) on > > the alpha architecture. The symptom is that any program that tries to > > access the process table will hang. (ps, w, top) The hang will go away > > by itself after ~10minutes - 1 hour or so. When it hangs I run ps and > > see that it gets halfway through the process list and hangs. The > > process that comes next in the list (after hang goes away) almost always > > has nonsensical memory numbers, like multi-gigabyte SIZE. > > > > > I know this effect independant of the platform when you have a proc entry that > is not corectly unregistered. > > (the code only compiles for 2.2.X, for 2.4.X you need to change > the proc struct.) > > ---snip--- > #include > #include > #include > > #define BUF_LEN 1024 > struct proc_dir_entry prockill_proc_file={ > 0, > 0, > "prockill", > S_IFREG|S_IRUGO, > 1, > 0, > 0, > BUF_LEN, > NULL, > NULL, > NULL, > }; > > int init_module(void) { > printk("prockill.o registering proc entry\n"); > return proc_register(&proc_root,&prockill_proc_file); > } > > void cleanup_module(void) { > printk("prockill.o fogets to unregister proc entry\n"); > } > ---snip--- > compile this as kernel module > > insmod proc_kill.o > rmmod proc_kill > > and the system will run without error until you do something like > > ls /proc/or > ls -R /proc > > after this the system will drop dead for minutes to hours or even for good > > > any chance you have a faulty module ?? > > > hofrat -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: k 2.4.2; usb; handspring-visor
Erik DeBill [[EMAIL PROTECTED]] wrote: > On Wed, Apr 11, 2001 at 08:52:30AM -0500, John Madden wrote: > > > Apr 8 23:33:09 horus kernel: hub.c: USB new device connect on bus1/1, > > > assigned device number 5 > > > Apr 8 23:33:12 horus kernel: usb_control/bulk_msg: timeout > > > Apr 8 23:33:12 horus kernel: usb.c: USB device not accepting new > > > address=5 (error=-110) > > > > Funny, I've been getting the same messages (on 2.4.0 and now 2.4.3) for a > > while now, and I thought the problem was with my Visor. (...I haven't > > been able to sync for months...) > > Have you tried using the normal UHCI driver, instead of the UHCI > Alternate Driver (JE)? I know the "alternate" one is default from > Linus, but it's incompatible with the usb-visor driver. The > maintainer said he'd patch the docs to clear up the confusion, but it > hasn't shown up in the mainstream kernels yet. I've also been seeing these messages (for a very, very long time), with both the uhci and usb-uhci drivers, with many different devices (not just visor). Usually the only way to fix it is to have the usb stuff compiled as modules, remove it all, and the re-insmod it all. Then it works again for a little while... There really needs to be only one driver. It's just confusing. > In my case, trying to use the visor would actually lock up the > machine, requiring a cold boot. Switched to the other UHCI driver > and it works fine. Never had to reboot though...on either x86 or alpha with UHCI... Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Alpha "process table hang"
Alan Cox [[EMAIL PROTECTED]] wrote: > > (But since the X server shouldn't have the ability to corrupt the > > kernel's process list, there has to be a problem in the kernel > > somewhere) > > The X server has enough priviledge to corrupt anything. Its unlikely to and > I do agree they two are likely to be unrelated. Well, nix that idea. I just fell back to 2.2.19, and I see neither the X crash nor the process-table-hang crash (which rules out hardware problems, thankfully). The X crash is also kernel related, it seems. I'm using XFree86 4.0.3 with the mga driver. It hangs in mga_storm.c on a line that looks like: while (MGAISBUSY()) {} where: #define MGAISBUSY() (INREG8(MGAREG_Status + 2) & 0x01) Killing and restarting X causes it to immediately hang in the same place. (I have to reboot to recover the console) This would seem to be PCI related. Have any significant PCI code changes been made to the alpha architecture, especially pyxis or cabriolet code? I see that arch/alpha/kernel has been totally rearranged, but since this doesn't crash in kernel code, I have no idea how to debug it. Thanks, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: generic rwsem [Re: Alpha "process table hang"]
Andrea Arcangeli [[EMAIL PROTECTED]] wrote: > > So please try to reproduce the hang with 2.4.4pre3 with those two > patches applied: > > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre3aa3/00_alpha-numa-3 > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre3aa3/00_rwsem-generic-1 > > All alpha users should run with at least the above two patches applied > to compile their tree and to make sure to have rock solid rwsemaphores. Excellent! I'll give it a try. Note that I recently saw the X hang with the 2.2.19 kernel, but I still haven't seen the process-table-hang with 2.2.19 (about 4 days running with 2.2.19). It is *far* easier to get the X hang in 2.4 than 2.2. (minutes for 2.4, days for 2.2) Also note that this is not an SMP machine (single processor 21164a, LX164 mobo). But I'll apply your patch tonight and let you know the results. Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: generic rwsem [Re: Alpha "process table hang"]
Bob McElrath [[EMAIL PROTECTED]] wrote: > Andrea Arcangeli [[EMAIL PROTECTED]] wrote: > > > > So please try to reproduce the hang with 2.4.4pre3 with those two > > patches applied: > > > > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre3aa3/00_alpha-numa-3 > > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre3aa3/00_rwsem-generic-1 > > > > All alpha users should run with at least the above two patches applied > > to compile their tree and to make sure to have rock solid rwsemaphores. > > Excellent! I'll give it a try. > > Note that I recently saw the X hang with the 2.2.19 kernel, but I still > haven't seen the process-table-hang with 2.2.19 (about 4 days running > with 2.2.19). It is *far* easier to get the X hang in 2.4 than 2.2. > (minutes for 2.4, days for 2.2) Also note that this is not an SMP > machine (single processor 21164a, LX164 mobo). > > But I'll apply your patch tonight and let you know the results. Status report: I'm at 2 days uptime now, and have not seen the process-table-hang. Looks like this fixed it. Previously I would get a hang in the first day or so. I'm using your alpha-numa-3 and rwsem-generic-4 against 2.4.4pre3. Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: [parisc-linux] Re: OK, let's try cleaning up another nit. Is anyone paying attention?
Jeff Garzik [[EMAIL PROTECTED]] wrote: > Tom Rini wrote: > > Which does boil down to having to work with trees other than Linus or > > Alans. Remember, the official tree is not always the up-to-date tree, > > or in the case of other arches, the most relevant tree. > > Yep. You could even look at Linus as simply the x86 port maintainer :) > > Except for alpha and x86, AFAIK, most people wind up going through > arch-specific channels to get their kernels... This may be a dumb question, but is there some place where the arch maintainers are listed? Where the arch-specific trees are kept? Where would I go to get the latest set of relevant patches for alpha? Grepping the Documentation/ directory for "alpha" I see nothing relevant. IMHO this should all be listend in one place. Maybe Documentation/Arch-Maintainers.txt. Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: generic rwsem [Re: Alpha "process table hang"]
Andrea Arcangeli [[EMAIL PROTECTED]] wrote: > On Thu, Apr 19, 2001 at 11:21:17AM -0500, Bob McElrath wrote: > > I'm at 2 days uptime now, and have not seen the process-table-hang. > > Looks like this fixed it. Previously I would get a hang in the first > > day or so. I'm using your alpha-numa-3 and rwsem-generic-4 against > > 2.4.4pre3. > > good, thanks for the report. > > BTW, if you upgrade to 2.4.4pre4 you can apply those two patches: > > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre4aa1/00_alpha-numa-4 > >ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4pre4aa1/00_rwsem-generic-6 > > really the first is not necessary anymore unless you're using a wildfire. The > second also resurrect the optimized rwsemaphores for all archs but alpha and > ia32. Well, take that back, I just got it to hang. Again, this is 2.4.4pre3 with alpha-numa-3 and rwsem-generic-4. I saw it upon starting mozilla. I also saw some scary filesystem errors that may or may not be related: Apr 23 18:09:40 draal kernel: EXT2-fs error (device sd(8,2)): ext2_new_block: Free blocks count corrupted for block group 252 There has been a lot of discussion on the topic of rwsems (that, admittedly, I haven't followed very closely). It looks like rwsem-generic-6 is the latest from Andrea, I'll build a new 2.4.4pre4 kernel with these patches and let you know the results. Have you made changes between rwsem-generic-4 and rwsem-generic-6 that would fix/prevent a deadlock? Let me know if there are any useful tests I could perform. Would it be useful for me to run the rwsem benchmarks you've been using? Could these detect a deadlock situation? Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Mawanella
Mayank Vasa [[EMAIL PROTECTED]] wrote: > > Mawanella is one of the Sri Lanka's Muslim Village Looks like a vbs virus to me. Thankfully, we all run linux here...sucker. -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: es1371 and recent kernels
Pierfrancesco Caci [[EMAIL PROTECTED]] wrote: > > [please be kind and Cc when replying] > > Has someone been able to get es1371 to actually produce anything > audible with latest kernels? The last version I could use was 2.4.0. > Then I had some trouble but I attributed them to devfs. Now I've > removed devfs and still I'm not able to play anything. Works for me, but it produces all kinds of crackly noise garbage. I'm not sure if this is because the driver has a bug, or the sound card is a piece of flaming shit. But I'm inclined to believe the latter. Anybody have a suggestion for a card that isn't a flaming piece of shit, (and not made by Creative) less than $100 US, PCI, supported by linux, and available? Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: es1371 and recent kernels
Jeff Golds [[EMAIL PROTECTED]] wrote: > Wakko Warner wrote: > > > > My ES1370 has done me good. You might want to try that card. Yes it's a > > creative card. It only has a crackle running 22k 8-bit > > > > It's probably better because that is the AudioPCI chip from Ensoniq > before Creative bought them. I thought that was a good chip, too. Argh, I had one of those, gave it away because it would hang my alpha hard (I'm told the card is pretty nonconformant to the PCI spec). *sigh* Nobody out there uses a non-Create PCI card? (oh and not Aureal either) Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
aa's rwsem-generic-6 bug? Process stuck in 'R' state.
Running 2.4.4pre4 with Andrea's rwsem-generic-6 patch, I have just gotten a process stuck in the 'R' state. According to the ps man page this is: "runnable (on run queue)". The 'ps aux' output is: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 7921 0.8 26.9 91720 68608 ? R< 00:33 11:20 /usr/X11R6/bin/X X is niced at -10 and doesn't respond to kill or kill -9. alpha 21164 (ev56) architecture. kernel compiled with: gcc version 2.96 20000731 (Red Hat Linux 7.0) Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]
Andrea Arcangeli [[EMAIL PROTECTED]] wrote: > On Wed, Apr 25, 2001 at 10:39:39PM -0500, Bob McElrath wrote: > > Running 2.4.4pre4 with Andrea's rwsem-generic-6 patch, I have just > > gotten a process stuck in the 'R' state. According to the ps man page > > this is: "runnable (on run queue)". The 'ps aux' output is: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > root 7921 0.8 26.9 91720 68608 ? R< 00:33 11:20 /usr/X11R6/bin/X > > > > X is niced at -10 and doesn't respond to kill or kill -9. > > > > alpha 21164 (ev56) architecture. kernel compiled with: > > gcc version 2.96 2731 (Red Hat Linux 7.0) > > The fact X is also part of the equation makes things even less obvious > (now we're not even sure it's a kernel bug). Tell me about it. But the fact remains that I never see these hangs with a 2.2 kernel. I've also futzed with X quite a bit to try and track this down, to no avail. I have tracked down some separate X bugs though. In the next iteration I'll use the mga driver from XFree86 CVS (which had some alpha-specific changes, I hear). During this last hang I tried to get gdb to attach to the X process. gdb hung after issuing 'attach 7921', and had to be killed. My naive interpretation is that this indicates a kernel problem, and nothing to do with X. Egad I wish this were more reproducible. Having a hang once every 3 days sucks for debugging. > generic-rwsem-6 is a very trivial implementation and I'm pretty sure it > is the _last_ thing that could go wrong in your equation. I mean if it > goes wrong then it's more likely to be a bug in the spinlocks or > whatever in the architectural part of the kernel than in the common code > (rwsem-generic-6 was all common code btw). > > Furthmore the X server shouldn't really be such an heavy user of the > rwsemaphores, as first it's not even threaded. When I posted this bug originally, you came right out and said it was probably the rwsemaphores. I really have no idea how the rwsemaphores work, and don't know myself that they are even the problem. My process-table-hang seems consistent with something having a lock on the process table and not letting go of it. (Note in this last "hang", the process table did not hang...that is, ps dumped the entire process list without a burp) > You can also press SYSRQ+P and get some EIP so we see a bit more what's > going on with the X server (assuming such cpu still receives interrupt). The CPU still receives interrupts, and other than this one X process, acts normally. (even in the process-table-hang-case...as long as I don't run ps, everything is fine) I had to reboot to get rid of the hung X process though. (shutdown proceeded normally) I'm running a debug X build at this point, and have identified at least two separate bugs in the X server that were causing hangs. I've reported these to the X people. I didn't get debug info out of the X server after the process-table-hang because X continued to behave normally during the process-table-hang. > BTW, could you also try to compile with egcs 1.1.2 just in case? I > learnt the hard way that for the alpha gcc 2.95.* isn't going to work > well (I didn't tried official 95.3 exactly yet, but certainly an older .3 > from the 2_95-branch of gcc cvs definitely miscompiled all my 2.4 > kernels, 2.96 with some houndred of patches [literally] is certainly > better than 2.95.* on the alpha but egcs is definitely still worth a > try) (personally I'm using egcs 1.1.2 for the 2.[24] alpha kernels and > 2.95.4 (2_95-branch of cvs) for the 2.[24] x86 kernels [and gcc 3.1 for > x86-64 ;]) I have been using egcs 1.1.2 (rh7 kgcc) Only this last hang was with a 2.96-compiled kernel (I forgot to change the makefile to use kgcc instead of gcc...then figured what the hell) The rest were with egcs 1.1.2. I'll use egcs 1.1.2 in the future. Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: it isn't aa's rwsem-generic-6 bug but something else [Re: aa's rwsem-generic-6 bug? Process stuck in 'R' state.]
Andrea Arcangeli [[EMAIL PROTECTED]] wrote: > On Thu, Apr 26, 2001 at 12:38:02AM -0500, Bob McElrath wrote: > > When I posted this bug originally, you came right out and said it was > > probably the rwsemaphores. I really have no idea how the rwsemaphores > > You were talking about the ps table hang when I told you about the rwsem > races. I had the same trouble on my alpha and I reproduced the races > trivially by lanucing: > > make MAKE='make -j2' -j2 & > > while :; do ps xa ; sleep 1 ; done > > After a few seconds ps deadlocked. Try that on the old asm semaphores. This does not cause a hang on my machine with your new rwsemaphores. > It was 100% reproducible, and after I rewrote the rwsemaphores the > deadlock gone away completly. > > Your X hanging in R state is completly unrelated to the rwsem ps table > hang problem as far I can tell. Ok, so what are the other alternatives? In the R state, the scheduler should give it some CPU at the first available jiffy, correct? After several minutes it was still stuck in the R state, and had received 0 CPU time. Could this be a scheduler bug? Another thing I just noticed: watching the ps list, gcc is getting called with -mcpu=ev56, which in turn is calling as with -mev6. Since this is an ev56 processor, not the newer ev6, this could conceivable be generating illegal instructions, though I haven't ever seen any kernel illegal instruction faults. *Sigh* -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature
Re: Broken gcc ?
Amarendra GODBOLE [[EMAIL PROTECTED]] wrote: > Hello World ! > > If I recall correctly, RHL 7 shipped with a broken gcc. Has it been > fixed ? Basically, is it safe to switch to RHL 7 for development > purposes ? Presently I use RHL 6.2 with 2.2.14 kernel. It "works"...sorta. It will compile the kernel. But in my development using lots of STL and C++, I've become an expert at generating "Internal compiler error" with it. YMMV. > Apologies if this is not the proper list for this question, and yes, > thanks in advance. Not really... Cheers, -- Bob Bob McElrath ([EMAIL PROTECTED]) Univ. of Wisconsin at Madison, Department of Physics PGP signature