Re: 3ware problems
** Mike Tancsa <[EMAIL PROTECTED]> on Thu, 15 Mar 2001 14:33:29 -0500 ** in [Re: 3ware problems ] writes: Mike> I tried yesterday to stress the machine with 25 simultaneous Mike> bonnie -s 500 & Mike> Although the machine was sluggish, it still worked. Similarly, Mike> make -j12 buildworld worked. In the past when i saw a similar Mike> bug, I could reproduce it 100% of the time this way. Earlier today I ran 30 concurrent "bonnie -s 500" and while things were slow, no problems showed up. Right now I'm on my 7th "make -j16 build world" and its working fine. After this buildworld finishes, I think I'll start up a shell script to keep 20 concurrent bonnie's running overnight. (The buildworlds are taking about 70 minutes to complete. The system is a dual PIII 400MHz with 384MB of RAM on a SuperMicro P6DBU. Not bad times.) So far the only way I can get the problem to show up is banging on MySQL for 3-12 hours. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
Doug> It takes a surprisingly long time to initialize the array. Mike> The delay is normal. When you setup anything other than a RAID0 Mike> array, the card is actually doing work to your drives in the Mike> background. Grab the array manager from Mike> http://people.freebsd.org/~msmith/RAID/3ware/3dm-bsd.1.09.00.002.tar.gz Mike> and it will notify you when its done. You can also speed up the Mike> initialization part a bit by setting it to a faster rebuild Mike> time. We finally did figure that out. The problem in this particular cirmstance with the 3dm utility is that the only controller in the box is the 3ware 6400. So inorder to run 3dm I need to have FreeBSD installed, and installing FreeBSD at the same time the controller is initializing the array, is really slow. :) The first time I did this I thought something was broken when I watched the newfs output those duplicate super block locations. It was about 10 seconds between each block! After a search of the FreeBSD lists I found a reference to initializing the array, and just waited. On ttyv1 the kernel issues a message when the array initialization is done, so I usually just wait for that. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
Doug> Mike, thanks for all your help and the time you've invested in Doug> this! What we can do to assist? Mike> Do you have any coding experience? I can't reproduce this here, Mike> but what I want to do is see what the command that's stuck on Mike> the busy queue looks like. Sure, sounds like fun. Mike> If you can add another function like twe_printstate that invokes Mike> twe_print_request on each of the requests on the busy queue and Mike> let me know what they look like, that might give me some clues. I'll do that today. Mike> (I'd send you diffs, but I'm snowed at work and quite ill just Mike> now 8(...) Hope you feel better. I've never seen that smilie before, is that for projectile vomiting? To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
Mike> If you can add another function like twe_printstate that invokes Mike> twe_print_request on each of the requests on the busy queue and Mike> let me know what they look like, that might give me some clues. Doug> OK, I haven't written the twe_printstate function yet, but I Doug> think I have the request. I got the filesystem wedged first, and Doug> then browsing the datastructures with DDB, I think I've found Doug> the busy queue. Here's the request: Mike> Cool, this works just as well. 8) Doug> db> call twe_print_request(0xc1529800) Doug> twe0: CMD: request_id 89 opcode size 7 unit 0 host_id 0 Doug> twe0: status 0 flags 0x0 count 16 sgl_offset 3 Doug> twe0: lba 264703 Doug> twe0: 0: 0xce4f000/4096 Doug> twe0: 1: 0x2ab/4096 Doug> twe0: tr_command 0xc1529800/0x1749d800 tr_data 0xcb928000/0xce4f000,8192 Doug> twe0: tr_status 2 tr_flags 0x1 tr_complete 0xc011f170 tr_private 0 Mike> Er. This is bad; tr_status == 2 means that the command has been Mike> completed; it shouldn't still be on the busy queue. Can you Mike> check to make sure you have the right queue here? I am not at all positive I've got the right queue. I *think* I do. I'm trying to break it again now, and I'll use the code below to verify the queue. I'm also going to hit the kernel core with gdb to see if I can verify that. Doug> I'm rebuilding the kernel now with the function twe_printstate, Doug> after I figured it out with the debugger. (This reminds me of a Doug> saying that has to do with horses and carriages, hmm.) Mike> Hrm. It *should* be pretty easy; I'm sorry I confused you with Mike> the 'printstate' reference; you should be able to fix up Mike> twe_report to just dump the busy queue: Mike> struct twe_request *tr; Mike> ... Mike> TAILQ_FOREACH(tr, TAILQ_FIRST(sc->twe_busy), tr_link) Mike> twe_print_request(tr); This doesn't compile for me. Every time I try to use 'sc->twe_busy' I get a syntax error: invalid type argument of `->' Here is what I'm using right now: s = splbio(); for (i = 0; (sc = devclass_get_softc(twe_devclass, i)) != NULL; i++) { twe_print_controller(sc); printf("ready queue: %d entries\n", sc->twe_qstat[TWEQ_READY].q_length); TAILQ_FOREACH(tr, sc->twe_ready, tr_link) twe_print_request(tr); printf("busy queue: %d entries\n", sc->twe_qstat[TWEQ_BUSY].q_length); TAILQ_FOREACH(tr, sc->twe_busy, tr_link) twe_print_request(tr); printf("complete queue: %d entries\n", sc->twe_qstat[TWEQ_COMPLETE].q_length); TAILQ_FOREACH(tr, sc->twe_complete, tr_link) twe_print_request(tr); } splx(s); This compiles, and when I run it it doesn't crash! :) In fact, it says all the queues are empty. Doug> Oh, btw, it took over 3 million rows to get it stuck this Doug> time. Gotta love a test cycle of 6 hours or so. Sigh. Mike> This is obviously a really weird case; possibly either an Mike> extremely narrow race, or some very borderline PCI issue. One Mike> question I should have asked, but don't recall whether you Mike> answered; are you using an AMD K7 system by any chance? We've Mike> seen some *very* weird behaviour with these controllers in some Mike> K7 systems. Yes, it *is* really weird. I can only get it to break with MySQL. From a suggestion of Mike Tancsa, I tried lots of concurrent bonnies, and also running a buildworld with a high -j value. I let both run for about 12 hours each, with no failure. The only thing that'll kill it is MySQL. I'm confused. :( Nope. Its a SuperMicro P6DBU, with dual 400MHz CPUs. Mike> Thanks again for your help here. My pleasure. (Calling what we are doing 'help' is complementary. If this is help, what you are doing for us must be close to divine intervention! :)) To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
Mike> Sorry, the above code is totally bogus; I'm kinda delirious Mike> (feverish) right now. Mike> Try Mike> TAILQ_FOREACH(tr, &sc->twe_busy, tr_link) Yup, that is what I half figured out half guessed at. (Helps to have other code to look through!) Mike> [...] there's a pattern of some sort involved, we just don't Mike> know what it is yet... I do! (outside air temp / inside air temp) * day of the month % line voltage + wc -l /etc/motd - df -k /var/db/mysql Oh, no. Thats my IQ. :) To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
Mike> Er. This is bad; tr_status == 2 means that the command has been Mike> completed; it shouldn't still be on the busy queue. Can you Mike> check to make sure you have the right queue here? Well, it looks like I had the wrong queue before. Blush. At least this time tr_status is 1. Not sure if that is good or bad though! :) Here is the debug output: db> call twe_printqueues twe0: status 57007310 twe0: current max twe0: free 0099 0100 twe0: ready twe0: busy 0001 0100 twe0: complete 0009 twe0: bioq 0021 twe0: AEN queue head 1 tail 0 ready queue: 0 entries busy queue: 1 entries twe0: CMD: request_id 54 opcode size 11 unit 0 host_id 0 twe0: status 0 flags 0x0 count 32 sgl_offset 3 twe0: lba 10466 twe0: 0: 0xffc4000/4096 twe0: 1: 0x11f85000/4096 twe0: 2: 0x12d66000/4096 twe0: 3: 0x10e87000/4096 twe0: tr_command 0xc1520400/0x174f4400 tr_data 0xce0f4000/0xffc4000,16384 twe0: tr_status 1 tr_flags 0x2 tr_complete 0xc011f1b0 tr_private 0xc9260400 complete queue: 0 entries This was generated with the code: void twe_printqueues(void) { struct twe_softc*sc; struct twe_request *tr = NULL; int i, s; s = splbio(); for (i = 0; (sc = devclass_get_softc(twe_devclass, i)) != NULL; i++) { twe_print_controller(sc); printf("ready queue: %d entries\n", sc->twe_qstat[TWEQ_READY].q_length); TAILQ_FOREACH(tr, sc->twe_ready, tr_link) twe_print_request(tr); printf("busy queue: %d entries\n", sc->twe_qstat[TWEQ_BUSY].q_length); TAILQ_FOREACH(tr, sc->twe_busy, tr_link) twe_print_request(tr); printf("complete queue: %d entries\n", sc->twe_qstat[TWEQ_COMPLETE].q_length); TAILQ_FOREACH(tr, sc->twe_complete, tr_link) twe_print_request(tr); } splx(s); } To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
PR kern/109152 -- RocketPort panics
We have 3 of the 32 port Comtrol PCI cards, the original 5v only cards, not the newer universal PCI cards, and we are trying to upgrade the system with all these serial ports from FreeBSD 4.8 (ya, kinda old) to FreeBSD 6.2. And we are running into what seems like a common problem with these cards: panics about non-busy devices: panic: device_unbusy: called for non-busy device rp0 The PR kern/109152 addresses this issue, and the patch from Craig Leres added to the PR on Tue, 13 Mar 2007 19:27:01 -0700 solves the problems for me. (I was experiencing the problems with HylaFAX, but it seems easy to reproduce.) Can this patch be applied to RELENG_6? I've also tried the patch from John Baldwin posed to freebsd-hackers in http://groups.google.com/group/mailing.freebsd.hackers/browse_frm/thread/883da63a8c62854d that generates a rp_open_count for each device with out any change in the behavior, I would continue to panic the system by sending a FAX. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
iir + Tyan S2460 + SMP problems
We're having problems with FreeBSD 5.4, 6.0, and 6.1 and an ICP Vortex GDT8546RZ 4 port SATA RAID card in a Tyan S2460 system with dual AMD Athlon MP 1600+ CPUs. We do not have any problems with this configuration under FreeBSD 4.11, and we have the same ICP cards in Tyan based Opterion system (S2882 and S4882) run with out problems under FreeBSD 5.4 and 6.1. We can reproduce the problem on two different S2460 based systems, and have tried 2 seperate ICP GDT8546RZ cards, so we don't believe it is a hardware problem. (Our success with FreeBSD 4.11 also provides some evidence that our hardware is OK.) The problem is that the system seems to stop doing any disk IO through the ICP card. Processes that don't need to page in work fine. (You can hit return in a shell, get another login: prompt on other consoles, and the like.) The system continues to respond to pings, but anything that attempts to do a disk IO simply stops. Sometimes the kernel emits messages like this: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096 The test we are using to produce this "hang" is a fairly trivial expansion of a tar ball being fed via nc from another system. We run on the source system: tar cf - radar | nc -w 3 10.10.10.229 12345 And on the system being tested we run: nc -l 12345 | tar xvf - One iteration of this test is the extraction of a 1.2 GB directory of 2,274 files. The problem only exists with SMP kernels. While our other tests almost always failed in the first iteration or two, the longest time to failure was 5 iterations. With out SMP the test ran with out problems for 570 iterations over 18 hours. We've tried a number of different tests. These tests are with a stock 6.1-RC1 kernel from the RC CD's. Unless otherwise specified, all tests are on a UFS2 filesystem with softupdates enabled and a SMP enabled GENERIC kernel. * !SMP: Ran 570 iterations in 18 hours with out a problem, test terminated by hand. * Large (190 GB) UFS2 filesystem with soft updates enabled and SMP kernel: Fails during the first iteration. * Medium (12 GB) UFS2 filesystem with soft updates enabled and SMP kernel: Fails during the first iteration. * !softupdates: fails during first iteration. * !ACPI: fails during the first iteration. * UFS1: fails during the first iteration. * UFS1 + !ADAPTIVE_GIANT: failed during the first iteration. * !ADAPTIVE_GIANT: failed during the first iteration. * Cleared motherboard CMOS: failed at the end of the second iteration. * FULL_PREEMPTION: failed during the first iteration. * !PREEMPTION: failed during the first iteration. * WITNESS + WITNESS_KDB: failed during the second iteration with no witness related kernel messages and with out entering the kernel debugger. * WITNESS + INVARIANTS: failed during the fifth iteration, again w/o kernel messages. * Motherboard BIOS "Use PCI Interrupt Entries in MP Table" set to OFF: failed during first iteration. * Motherboard BIOS "Multiprocessor Specification" set from 1.4 to 1.1: failed during first iteration. * MUTEX_WAKE_ALL: failed during first iteration. I have a serial console and a kernel debugger enabled, so if anybody has suggestions for probes to do once the system is hung let us know. Any advice is welcome. Well, except for "dump the Tyan S2460 motherboards" maybe. Oh, and we're at current BIOS and firmware revs for both the ICP card and the motherboard. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: iir + Tyan S2460 + SMP problems
Doug> We're having problems with FreeBSD 5.4, 6.0, and 6.1 and an ICP Doug> Vortex GDT8546RZ 4 port SATA RAID John-Mark> We've had very similar experiences on 4.7-R. The box would John-Mark> hang on one partition waiting for IO to come back, but John-Mark> direct access to the disk, or accessing other partitions John-Mark> would pass IO fine. Interesting, the opposite of our problems. It works perfectly for us on 4.x systems, but not in 5.4 and 6.x. Doug> Any advice is welcome. Well, except for "dump the Tyan S2460 Doug> motherboards" maybe. John-Mark> How about drop iir? While not as expensive as switching motherboards, still a pain. We've been very happy with our ICP cards. But your suggestion does have merit, especially as recent PCI-X and PCI-E ICP cards have no FreeBSD driver in sight. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: iir + Tyan S2460 + SMP problems
John-Mark> How about drop iir? Doug> But your suggestion does have merit, especially as recent PCI-X Doug> and PCI-E ICP cards have no FreeBSD driver in sight. Scott> My understanding is that the ICP division is switching over to Scott> the architcture supported by the 'aac' driver. Adaptec Scott> provided updates to this driver last year that include a number Scott> of ICP id numbers. If you have access to one of these cards Scott> that you mention, would you mind trying the aac driver and Scott> reported on whether or not it worked? I didn't know that the newer ICP cards would work, so we don't have any of the newer ones. Sorry. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: What is considered "the best supported" RAID controller for 5.x?
JM> Basicly a process will hang in either ffsfsn (fsync induces this JM> write) or getblk (a read), and as far as I can find out, the io JM> just never returns even though the underlying block device JM> continues to work fine.. I know this is probably a stupid answer, but ... Have you upgraded the firmware on the card? We have a number of ICP cards and with the 2.39 firmware we can easily lockup the box with the behavior you explain with a simple buildworld. Upgrading the firmware to 2.44 solves the problem completely for us. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
5.4 hangs on disk IO
I've got 2 FreeBSD 5.4 systems that seem to get stuck doing disk IO. When the system gets hung, it seems to refuse to do any disk io. It will continue to respond to pings, and the tty driver on the serial console continues to work and echo characters. But all the processes seem to get stuck in the state "ufs". On the serial console I tried sync, which then hangs, and ^T produces: athearn# sync load: 0.66 cmd: csh 4138 [ufs] 0.00u 0.00s 0% 1748k One of the systems emits these error messages on the console: swap_pager: indefinite wait buffer: device: da0s1a, blkno: 2, size: 4096 swap_pager: indefinite wait buffer: device: da0s1h, blkno: 6, size: 4096 swap_pager: indefinite wait buffer: device: da0s1h, blkno: 7, size: 4096 swap_pager: indefinite wait buffer: device: da0s1h, blkno: 15, size: 4096 swap_pager: indefinite wait buffer: device: da0s1a, blkno: 26, size: 4096 Other than that difference, both systems hang the same way. The interesting part is that it only seems to happen when I run amd. With amd running one of our users can hang the system in about 5-10 minutes of heavy disk traffic to a LOCAL disk. The local disk with the heaviest traffic is "behind" a amd managed symlink. If I don't run amd and do all the NFS mounts by hand and build the symlink by hand, the system runs fine. I've tried both the stock amd that comes with 5.4, the fairly outdated 6.0.10-20040513, and the brand new 6.1.2.1 from ports. From the kernel debugger I can't even panic or call boot(0) with out errors. I can easily reproduce the hang, so if there is any suggestions for things to poke at with the kernel debugger on the serial console, let me know. Here's the result of ps from a hung system, along with the errors from panic and boot(0): db> ps pid proc uid ppid pgrp flag stat wmesgwchan cmd 4138 c34051c40 530 4138 412 [SLPQ ufs 0xc346b2bc][SLP] csh 4137 c3e8be20 1002 740 735 400 [SLPQ sbwait 0xc366fe84][SLP] perl5.8.7 4136 c3e8bc5c 1002 741 735 400 [SLPQ sbwait 0xc3a95d40][SLP] perl5.8.7 4135 c3e071c4 1002 743 735 0004000 [SLPQ nfs 0xc35a7e14][SLP] mkdir 4134 c3d8e1c4 1002 749 735 0004000 [SLPQ ufs 0xc346b2bc][SLP] perl5.8.7 4132 c3e6154c 1002 739 735 0004000 [SLPQ nfs 0xc35a7e14][SLP] remap 2355 c3714a98 1000 682 2355 0004002 [SLPQ select 0xc06787c4][SLP] top 749 c3ba3c5c 1002 1 735 0004002 [SLPQ wait 0xc3ba3c5c][SLP] tclsh8.4 748 c348f54c 1002 1 735 0004002 [SLPQ accept 0xc36a6a5a][SLP] perl5.8.7 747 c3710a98 1002 1 735 0004002 [SLPQ accept 0xc3a6154a][SLP] perl5.8.7 746 c370fa98 1002 1 735 0004002 [SLPQ accept 0xc374e17e][SLP] perl5.8.7 745 c3ba3e20 1002 1 735 0004002 [SLPQ accept 0xc374e68e][SLP] perl5.8.7 744 c3ba6000 1002 1 735 0004002 [SLPQ accept 0xc35ac7d2][SLP] perl5.8.7 743 c3ba61c4 1002 1 735 0004002 [SLPQ wait 0xc3ba61c4][SLP] perl5.8.7 742 c3ba6388 1002 1 735 0004002 [SLPQ accept 0xc366f406][SLP] perl5.8.7 741 c34b6e20 1002 1 735 0004002 [SLPQ wait 0xc34b6e20][SLP] perl5.8.7 740 c34b6388 1002 1 735 0004002 [SLPQ wait 0xc34b6388][SLP] perl5.8.7 739 c34b6c5c 1002 1 735 0004002 [SLPQ wait 0xc34b6c5c][SLP] perl5.8.7 682 c34ef000 1000 681 682 0004002 [SLPQ pause 0xc34ef038][SLP] zsh 681 c340954c 1000 678 678 100 [SLPQ select 0xc06787c4][SLP] sshd 678 c348ca980 442 678 100 [SLPQ sbwait 0xc3a616ec][SLP] sshd 530 c3710e200 529 530 0004002 [SLPQ ppwait 0xc3710e20][SLP] csh 529 c370f3880 1 529 0004102 [SLPQ wait 0xc370f388][SLP] login 528 c37103880 1 528 0004002 [SLPQ ttyin 0xc31d0a10][SLP] getty 527 c348c1c40 1 527 0004002 [SLPQ ttyin 0xc31b8410][SLP] getty 526 c348f3880 1 526 0004002 [SLPQ ttyin 0xc31b8010][SLP] getty 525 c34ef8d40 1 525 0004002 [SLPQ ttyin 0xc3166c10][SLP] getty 524 c34b6a980 1 524 0004002 [SLPQ ttyin 0xc3166210][SLP] getty 523 c348f8d40 1 523 0004002 [SLPQ ttyin 0xc3166410][SLP] getty 522 c348ce200 1 522 0004002 [SLPQ ttyin 0xc3166610][SLP] getty 521 c34f3e200 1 521 0004002 [SLPQ ttyin 0xc3166810][SLP] getty 491 c34053880 1 490 000 [SLPQ ufs 0xc346b2bc][SLP] snmpd 464 c37107100 1 464 000 [SLPQ ufs 0xc346b2bc][SLP] cron 452 c34b6710 25 1 452 100 [SLPQ pause 0xc34b6748][SLP] sendmail 448 c37140000 1 448 100 [SLPQ pause 0xc3714038][SLP] sendmail 442 c34f38d40 1 442 100 [SLPQ select 0xc06787c4][SLP] sshd 427 c34097100 1 427 000 [SLPQ select 0xc06787c4][SLP] ntpd 382 c3167c5c0 378 378 000 [SLPQ - 0xc3383c00][SLP] nfsd 381 c34ef3880 378 378 000 [SLPQ - 0xc33d7200][SLP] nfsd 380 c34f354c0 378 378 000 [SLPQ - 0xc33f6400][SLP] nfsd 379 c34b654c0 378 378 000 [SLPQ - 0xc338f000][SLP] nfsd 378 c34057100 1 378 000 [SLPQ accept 0xc365403a][SLP]
Re: indefinite wait buffer: Does this indicate hardware issue?
** On Sat, 17 Dec 2005 00:49:48 +0800, Xin LI <[EMAIL PROTECTED]> said: Xin> I have a box indicating the following sometimes: Xin> "swap_pager: indefinite wait buffer: bufobj: 0, blkno: 262169, size: 4096" We are having a very similar problem that we've been trying to diagnose off and on for a while. About 20% of the time the system will emit very similar messages. In each case processes trying to write to disks get stuck in the "ufs" state. This is a dual CPU AMD Athlon(tm) MP 1600+ system on a Tyan 2460 mobo with 512 MB of RAM and an ICP Vortex GDT8546RZ SATA RAID controller. Here is the result of a C-t on an interactive shell via a serial console: load: 0.00 cmd: csh 547 [ufs] 0.02u 0.01s 0% 2516k With the kernel debugger a few processes will be stuck in the same state: db> ps pid proc uid ppid pgrp flag stat wmesgwchan cmd 647 c1a080000 645 647 110 [SLPQ ufs 0xc1856c08][SLP] cron 646 c18834182 644 646 110 [SLPQ biord 0xcbcff650][SLP] cron 645 c1a0a8300 493 493 000 [SLPQ ppwait 0xc1a0a830][SLP] cron 644 c1887c480 493 493 000 [SLPQ ppwait 0xc1887c48][SLP] cron 643 c1a0a4180 632 642 0004002 [SLPQ wdrain 0xc06a26c4][SLP] bsdtar 632 c1a08a3c0 631 632 0004002 [SLPQ pause 0xc1a08a70][SLP] tcsh 631 c1834624 1001 616 631 0004102 [SLPQ wait 0xc1834624][SLP] su 616 c188720c 1001 615 616 0004002 [SLPQ pause 0xc1887240][SLP] tcsh 615 c1a0a624 1001 613 613 100 [SLPQ select 0xc06a2144][SLP] sshd 613 c18836240 471 613 0004100 [SLPQ sbwait 0xc1b3dbc8][SLP] sshd 547 c1671a3c0 537 547 0004002 [SLPQ ufs 0xc1856c08][SLP] csh 537 c1883a3c0 1 537 0004102 [SLPQ wait 0xc1883a3c][SLP] login 536 c18876240 1 536 0004002 [SLPQ ttyin 0xc16e8810][SLP] getty 535 c16716240 1 535 0004002 [SLPQ ttyin 0xc16e8410][SLP] getty 534 c1a0aa3c0 1 534 0004002 [SLPQ ttyin 0xc16dac10][SLP] getty 533 c1671c480 1 533 0004002 [SLPQ ttyin 0xc16e7010][SLP] getty 532 c1a0ac480 1 532 0004002 [SLPQ ttyin 0xc16e7410][SLP] getty 531 c1887a3c0 1 531 0004002 [SLPQ ttyin 0xc16e7810][SLP] getty 530 c18870000 1 530 0004002 [SLPQ ttyin 0xc16e7c10][SLP] getty 529 c183420c0 1 529 0004002 [SLPQ ttyin 0xc16e8010][SLP] getty 513 c1a0a20c0 1 513 000 [SLPQ select 0xc06a2144][SLP] inetd 493 c1a088300 1 493 000 [SLPQ ufs 0xc1856c08][SLP] cron 481 c1883c48 25 1 481 100 [SLPQ pause 0xc1883c7c][SLP] sendmail 477 c167120c0 1 477 100 [SLPQ pause 0xc1671240][SLP] sendmail 471 c18340000 1 471 100 [SLPQ select 0xc06a2144][SLP] sshd 349 c18874180 1 349 000 [SLPQ ufs 0xc1856c08][SLP] ypbind 336 c18344180 1 336 000 [SLPQ select 0xc06a2144][SLP] rpcbind 317 c18878300 1 317 000 [SLPQ select 0xc06a2144][SLP] syslogd 283 c16718300 1 283 000 [SLPQ select 0xc06a2144][SLP] devd 228 c1883000 65 1 228 100 [SLPQ select 0xc06a2144][SLP] dhclient 208 c188320c0 153 002 [SLPQ select 0xc06a2144][SLP] dhclient 52 c1834a3c0 0 0 204 [SLPQ - 0xd5526d08][SLP] schedcpu 51 c1834c480 0 0 204 [SLPQ - 0xc06aa4cc][SLP] nfsiod 3 50 c161b6240 0 0 204 [SLPQ - 0xc06aa4c8][SLP] nfsiod 2 49 c161b8300 0 0 204 [SLPQ - 0xc06aa4c4][SLP] nfsiod 1 48 c161ba3c0 0 0 204 [SLPQ - 0xc06aa4c0][SLP] nfsiod 0 47 c161bc480 0 0 204 [SLPQ vlruwt 0xc161bc48][SLP] vnlru 46 c1670 0 0 204 [SLPQ getblk 0xcbc36578][SLP] syncer 45 c167020c0 0 0 204 [SLPQ psleep 0xc06a268c][SLP] bufdaemon 44 c16704180 0 0 20c [SLPQ pgzero 0xc06b09c4][SLP] pagezero 9 c16706240 0 0 204 [SLPQ psleep 0xc06b0514][SLP] vmdaemon 8 c16708300 0 0 204 [SLPQ psleep 0xc06b04d0][SLP] pagedaemon 43 c1670a3c0 0 0 204 [IWAIT] swi0: sio 7 c1670c480 0 0 204 [SLPQ - 0xc16dd43c][SLP] fdc0 6 c16710000 0 0 204 [SLPQ - 0xc166f280][SLP] kqueue taskq 42 c160fc480 0 0 204 [IWAIT] swi5:+ 5 c161a0000 0 0 204 [SLPQ - 0xc15c1280][SLP] thread taskq 41 c161a20c0 0 0 204 [IWAIT] swi6:+ 40 c161a4180 0 0 204 [IWAIT] swi6: task queue 39 c161a6240 0 0 204 [IWAIT] swi2: cambio 38 c161a8300 0 0 204 [SLPQ - 0xc0698140][SLP] yarrow 4 c161aa3c0 0 0 204 [SLPQ - 0xc0698b08][SLP] g_down 3 c161ac480 0 0 204 [SLPQ - 0xc0698b04][SLP] g_up 2 c161b0000 0 0 204 [SLPQ - 0xc0698afc][SLP] g_event 37 c161b20c0 0 0 204 [IWAIT] swi3: vm 36 c161b4180 0 0 20c [IWAIT] swi4: clock sio 35 c16006240 0 0
Re: indefinite wait buffer: Does this indicate hardware issue?
Xin> Hi Hi. On 19 Dec 2005 14:32:31 -0600, Douglas K. Rand <[EMAIL PROTECTED]> wrote: Doug> Tracing command swapper pid 0 tid 0 td 0xc0698e20 Doug> sched_switch(c0698e20,0,1) at sched_switch+0x14b Doug> mi_switch(1,0) at mi_switch+0x1ba Doug> scheduler(0,81ec00,81e000,0,c042f5d5) at scheduler+0x262 Doug> mi_startup() at mi_startup+0x96 Doug> begin() at begin+0x2c Xin> Which scheduler are you using? SCHED_4BSD. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
pam_ssh problems
Sometime between 4.1.1-STABLE and 4.2-BETA we started having difficulities with using pam_ssh and wdm. Here is a piece of our /etc/pam.conf: wdm authsufficient pam_ssh.so wdm authrequiredpam_unix.so wdm account requiredpam_unix.so try_first_pass wdm session requiredpam_ssh.so wdm password required pam_deny.so This used to work just fine: It would authenticate against the user's ~/.ssh/identity, and when wdm started the session, it would automatically startup ssh-agent and add the user's SSH key. After a cvsup, wdm started dropping core. I've cvsup'ed a few times since then also, hoping for a fix, but no luck yet. My latest cvsup was yesterday. So, I rebuilt wdm with debug symbols, and rebuilt world with -g too, and here is the backtrace from gdb: #0 0x283ed553 in ?? () #1 0x283ed72b in ?? () #2 0x283ea744 in ?? () #3 0x28321a10 in _pam_dispatch_aux (pamh=0x8069300, flags=0, h=0x8069900, resumed=PAM_FALSE) at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_dispatch.c:79 #4 0x28321e10 in _pam_dispatch (pamh=0x8069300, flags=0, choice=4) at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_dispatch.c:270 #5 0x283200d6 in pam_open_session (pamh=0x8069300, flags=0) at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_session.c:26 #6 0x8054b9d in StartClient (verify=0x805fbfc, d=0x8069000, pidp=0x805fbe0, name=0x805f4e8 "user", passwd=0x805f500 "password") at session.c:682 #7 0x8054009 in ManageSession (d=0x8069000) at session.c:308 #8 0x8050454 in StartDisplay (d=0x8069000) at dm.c:635 #9 0x805023b in CheckDisplayStatus (d=0x8069000) at dm.c:562 #10 0x8050a40 in ForEachDisplay (f=0x80501d4 ) at dpylist.c:55 #11 0x8050257 in StartDisplays () at dm.c:571 #12 0x804f638 in main (argc=2, argv=0xbfbff708) at dm.c:185 #13 0x804a5c8 in _start (arguments=0xbfbff818 "-:0 ") at /usr/src/lib/csu/i386-elf/crt1.c:96 The code seems to be launching the module, but I can't figure out which module it is having trouble with, although I expect it is pam_ssh.so. Here are a few more details from gdb: (gdb) print *h $8 = {must_fail = 0, func = 0x283ea1a0, actions = {-1, -3 , -1, -3 , 0, -3, -3, -3, -3, -3, -3}, argc = 0, argv = 0x0, next = 0x0} (gdb) print *pamh $9 = {authtok = 0x0, pam_conversation = 0x8065de0, oldauthtok = 0x0, prompt = 0x0, service_name = 0x8065dc0 "wdm", user = 0x8065dd0 "user", rhost = 0x0, ruser = 0x0, tty = 0x8065e00 ":0", pam_default_log = { ident = 0x0, option = 0, facility = 0}, data = 0x8065f50, env = 0x8065df0, fail_delay = {set = PAM_FALSE, delay = 0, begin = 974839469, delay_fn_ptr = 0x0}, handlers = {module = 0x806a6c0, modules_allocated = 4, modules_used = 3, handlers_loaded = 1, conf = { authenticate = 0x8069400, setcred = 0x8069500, acct_mgmt = 0x8069800, open_session = 0x8069900, close_session = 0x8069a00, chauthtok = 0x8069b00}, other = {authenticate = 0x8069c00, setcred = 0x8069d00, acct_mgmt = 0x8069e00, open_session = 0x0, close_session = 0x0, chauthtok = 0x0}}, former = {choice = 0, depth = 0, impression = 0, status = 0, want_user = 0, prompt = 0x0, update = PAM_FALSE}} I don't know if anybody else is having this problem, or know how to fix it, but any assistance would be usefule. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
> Drat. There it is; you've got a command that looks like it's stuck in > the adapter. I'll go grab the can of WD-40. :) > I didn't see you respond to Mike T - are you using 64k or 128k stripes? I didn't get his query until I had already started the mysqd trying to break things. And now I'm at home, and while serial consoles are *really* great for most things, I can't get at the 3ware BIOS from here. I didn't want to respond until I had checked the bios. I'm /fairly/ sure that I took the default 64K stripes, but one time in rebinding the array I did change the stripe size. > If the latter, try changing the value of TWE_Q_LENGTH in > /sys/dev/twe/twereg.h to 100 and see if you can reproduce it. Rebuilding the kernel as I type. > I am worrying about firmware here at the moment We are running the latest firmware as of about 10 days ago. > Thanks for your patience. Are you kidding? Thanks for all the help. We really appreciate it. Anything we can do to help, let us know. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 3ware problems
> Drat. There it is; you've got a command that looks like it's stuck in > the adapter. > > try changing the value of TWE_Q_LENGTH in /sys/dev/twe/twereg.h to 100 and > see if you can reproduce it. Well, I just woke up and mysqd was stuck again in getblk, this time with a TWE_Q_LENGTH of 100: db> call twe_report twe0: status 57007390 twe0: current max twe0: free 0099 0100 twe0: ready twe0: busy 0001 0100 twe0: complete 0011 twe0: bioq 0027 twe0: AEN queue head 1 tail 0 twed: total bio count in 1646323 out 1646322 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
3Ware, Western Digital disks, and stray interrupts
We have two pretty much identical systems: Both are Tyan Tiger MP S2469 boards with a 3ware 7450 controller and Western Digital WD1000 100GB disks. One system has 4 disks in a RAID 10 configuration, and the other has 2 disks in a RAID 1 configuration. One system only has a single Athlon MP CPU, while the other has 2 Athlon MP CPUs. We have gone through 5 of the WD1000 disks so far, with a 6th that just failed the other day. The first 3 failures we tested with Western Digital's drive fitness test, which reported all thee drives to be OK. The first disk that failed we tried to put back in and have the 3ware controller rebuild, but the rebuild failed after 2 hours. We've stopped testing the disks, and just send them back to Western Digital. All the failures have been drive timeouts: Dec 29 16:55:31 doppler[kern.crit] /kernel: twe0: AEN: Jan 1 23:36:04 doppler[kern.crit] /kernel: twe0: AEN: Feb 22 18:19:21 doppler[kern.crit] /kernel: twe0: AEN: Mar 7 20:18:44 vault[kern.crit] /kernel: twe0: AEN: Mar 16 21:42:02 vault[kern.crit] /kernel: twe0: AEN: The last two messages were somewhat massaged by me, that comes later... So, the first question: Has anybody else seen such a horrible failure rate witht he WD1000 disks? The other problem we are having, which /may/ be related, is that the second system (vault, the single CPU box) has had 2 failures that coincide with a spate of "stray irq 7" messages. We are using swatch to watch for the twe messages, but the two failures on vault have had the kernel log mixed with the stray irq 7 messages: Mar 7 20:18:44 vault[kern.crit] /kernel: t Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: we0 Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more Mar 7 20:18:44 vault[kern.crit] /kernel: : AEN: Mar 16 21:42:02 vault[kern.crit] /kernel: tw Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: e0: Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: A Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: EN: In both cases, there aren't any kernel logs for 2 hours on either side of this message. We have the parallel port disabled in the BIOS, and after the last failure took irq 7 away from the PCI and PnP devices. (None of the previous dmesg for the system report any devices using irq 7.) I've put the current dmesg at the end. So, is the 3ware controller causing the stray irq 7 messages when the disk failes, or are the stray irq 7 messages causing the 3ware controller to timeout the disk? Any help would be appreciated. Pretty soon Western Digital is gonna stop taking our phone calls. Either that, or we'll loose 2 disks before we get the first one fixed. ;^) Copyright (c) 1992-2002 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.5-RELEASE #1: Wed Feb 13 17:10:19 CST 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/VAULT Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 1400054127 Hz CPU: AMD Athlon(tm) MP Processor 1600+ (1400.05-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x662 Stepping = 2 Features=0x383fbff AMD Features=0xc048<,AMIE,DSP,3DNow!> real memory = 268435456 (262144K bytes) avail memory = 258568192 (252508K bytes) Preloaded elf kernel "kernel" at 0xc02a8000. Pentium Pro MTRR support enabled Using $PIR table, 268435454 entries at 0xc00fdf10 npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 pci1: at 5.0 irq 10 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 chip1: at device 7.3 on pci0 twe0: <3ware Storage Controller> port 0x1430-0x143f mem 0xf400-0xf47f,0xf4901000-0xf490100f irq 5 at device 8.0 on pci0 twe0: 4 ports, Firmware FE7X 1.03.09.027, BIOS BE7X 1.07.02.002 fxp0: port 0x1400-0x141f mem 0xf480-0xf48f,0xf4903000-0xf4903fff irq 5 at device 12.0 on pci0 fxp0: Ethernet address 00:90:27:18:d7:45 inphy0: on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto ahc0: port 0x1000-0x10ff mem 0xf490-0xf4900fff irq 11 at device 13.0 on pci0 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/255 SCBs orm0: at iomem 0xc-0xc7fff,0xc8000-0xc8fff,0xc9800-0xc9fff on isa0 fdc0: at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: at port 0x60,0x64 on isa0 vga0: at port 0x3c0-0x3df iomem 0xa-0xb on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio0 at port 0x3f8-0x3ff irq 4 flags 0x1
Re: 3Ware, Western Digital disks, and stray interrupts
Albert> We're running a 6410 with four WD1000 disks and have had only Albert> one failure so far. Anyway, according to the author of the twe Albert> driver, Michael Smith, a firmware upgrade for the 7xxx series Albert> controllers should be available by now. Thanks, I had missed that post. We might have to switch from RELENG_4_5 to RELENG_4 for that box and try it. Not the end of the world, we've just gotten to like tracking those branches. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message