Re: 3ware problems

2001-03-15 Thread rand

** Mike Tancsa <[EMAIL PROTECTED]> on Thu, 15 Mar 2001 14:33:29 -0500
** in [Re: 3ware problems ] writes:

Mike> I tried yesterday to stress the machine with 25 simultaneous
Mike> bonnie -s 500 &

Mike> Although the machine was sluggish, it still worked.  Similarly,
Mike> make -j12 buildworld worked. In the past when i saw a similar
Mike> bug, I could reproduce it 100% of the time this way.

Earlier today I ran 30 concurrent "bonnie -s 500" and while things
were slow, no problems showed up. Right now I'm on my 7th "make -j16
build world" and its working fine. After this buildworld finishes, I
think I'll start up a shell script to keep 20 concurrent bonnie's
running overnight.

(The buildworlds are taking about 70 minutes to complete. The system
 is a dual PIII 400MHz with 384MB of RAM on a SuperMicro P6DBU. Not bad
 times.)

So far the only way I can get the problem to show up is banging on
MySQL for 3-12 hours.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-18 Thread rand

Doug> It takes a surprisingly long time to initialize the array.

Mike> The delay is normal.  When you setup anything other than a RAID0
Mike> array, the card is actually doing work to your drives in the
Mike> background. Grab the array manager from
Mike> http://people.freebsd.org/~msmith/RAID/3ware/3dm-bsd.1.09.00.002.tar.gz
Mike> and it will notify you when its done. You can also speed up the
Mike> initialization part a bit by setting it to a faster rebuild
Mike> time.

We finally did figure that out. The problem in this particular
cirmstance with the 3dm utility is that the only controller in the box
is the 3ware 6400. So inorder to run 3dm I need to have FreeBSD
installed, and installing FreeBSD at the same time the controller is
initializing the array, is really slow.  :)

The first time I did this I thought something was broken when I
watched the newfs output those duplicate super block locations. It was
about 10 seconds between each block! After a search of the FreeBSD
lists I found a reference to initializing the array, and just waited.

On ttyv1 the kernel issues a message when the array initialization is
done, so I usually just wait for that.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-21 Thread rand

Doug> Mike, thanks for all your help and the time you've invested in
Doug> this! What we can do to assist?

Mike> Do you have any coding experience?  I can't reproduce this here,
Mike> but what I want to do is see what the command that's stuck on
Mike> the busy queue looks like.

Sure, sounds like fun.

Mike> If you can add another function like twe_printstate that invokes
Mike> twe_print_request on each of the requests on the busy queue and
Mike> let me know what they look like, that might give me some clues.

I'll do that today.

Mike> (I'd send you diffs, but I'm snowed at work and quite ill just
Mike> now 8(...)

Hope you feel better. 

I've never seen that smilie before, is that for projectile vomiting? 

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-21 Thread rand

Mike> If you can add another function like twe_printstate that invokes
Mike> twe_print_request on each of the requests on the busy queue and
Mike> let me know what they look like, that might give me some clues.
 
Doug> OK, I haven't written the twe_printstate function yet, but I
Doug> think I have the request. I got the filesystem wedged first, and
Doug> then browsing the datastructures with DDB, I think I've found
Doug> the busy queue. Here's the request:

Mike> Cool, this works just as well. 8)

Doug> db> call twe_print_request(0xc1529800)
Doug> twe0: CMD: request_id 89  opcode   size 7  unit 0  host_id 0
Doug> twe0:  status 0  flags 0x0  count 16  sgl_offset 3
Doug> twe0:  lba 264703
Doug> twe0:   0: 0xce4f000/4096
Doug> twe0:   1: 0x2ab/4096
Doug> twe0:  tr_command 0xc1529800/0x1749d800  tr_data 0xcb928000/0xce4f000,8192
Doug> twe0:  tr_status 2  tr_flags 0x1  tr_complete 0xc011f170  tr_private 0

Mike> Er.  This is bad; tr_status == 2 means that the command has been
Mike> completed; it shouldn't still be on the busy queue.  Can you
Mike> check to make sure you have the right queue here?

I am not at all positive I've got the right queue. I *think* I do. I'm
trying to break it again now, and I'll use the code below to verify
the queue. I'm also going to hit the kernel core with gdb to see if I
can verify that. 

Doug> I'm rebuilding the kernel now with the function twe_printstate,
Doug> after I figured it out with the debugger. (This reminds me of a
Doug> saying that has to do with horses and carriages, hmm.)

Mike> Hrm.  It *should* be pretty easy; I'm sorry I confused you with
Mike> the 'printstate' reference; you should be able to fix up
Mike> twe_report to just dump the busy queue:

Mike>   struct twe_request  *tr;
Mike> ...

Mike>   TAILQ_FOREACH(tr, TAILQ_FIRST(sc->twe_busy), tr_link)
Mike>   twe_print_request(tr);

This doesn't compile for me. Every time I try to use 'sc->twe_busy' I
get a syntax error: invalid type argument of `->'

Here is what I'm using right now:

s = splbio();
for (i = 0; (sc = devclass_get_softc(twe_devclass, i)) != NULL; i++) {
twe_print_controller(sc);
printf("ready queue: %d entries\n", sc->twe_qstat[TWEQ_READY].q_length);
TAILQ_FOREACH(tr, sc->twe_ready, tr_link) twe_print_request(tr);
printf("busy queue: %d entries\n", sc->twe_qstat[TWEQ_BUSY].q_length);
TAILQ_FOREACH(tr, sc->twe_busy, tr_link) twe_print_request(tr);
printf("complete queue: %d entries\n", sc->twe_qstat[TWEQ_COMPLETE].q_length);
TAILQ_FOREACH(tr, sc->twe_complete, tr_link) twe_print_request(tr);
}
splx(s);

This compiles, and when I run it it doesn't crash!  :) In fact, it
says all the queues are empty.

Doug> Oh, btw, it took over 3 million rows to get it stuck this
Doug> time. Gotta love a test cycle of 6 hours or so.  Sigh.

Mike> This is obviously a really weird case; possibly either an
Mike> extremely narrow race, or some very borderline PCI issue.  One
Mike> question I should have asked, but don't recall whether you
Mike> answered; are you using an AMD K7 system by any chance?  We've
Mike> seen some *very* weird behaviour with these controllers in some
Mike> K7 systems.

Yes, it *is* really weird. I can only get it to break with MySQL. From
a suggestion of Mike Tancsa, I tried lots of concurrent bonnies, and
also running a buildworld with a high -j value. I let both run for
about 12 hours each, with no failure. The only thing that'll kill it
is MySQL.  I'm confused.  :(

Nope. Its a SuperMicro P6DBU, with dual 400MHz CPUs. 

Mike> Thanks again for your help here.

My pleasure. (Calling what we are doing 'help' is complementary. If
this is help, what you are doing for us must be close to divine
intervention!  :))

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-21 Thread rand

Mike> Sorry, the above code is totally bogus; I'm kinda delirious
Mike> (feverish) right now.

Mike> Try

Mike>   TAILQ_FOREACH(tr, &sc->twe_busy, tr_link)

Yup, that is what I half figured out half guessed at. (Helps to have
other code to look through!)


Mike> [...] there's a pattern of some sort involved, we just don't
Mike> know what it is yet...

I do!

   (outside air temp / inside air temp) * day of the month % line voltage + wc -l 
/etc/motd - df -k /var/db/mysql

Oh, no. Thats my IQ.   :)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-21 Thread rand

Mike> Er.  This is bad; tr_status == 2 means that the command has been
Mike> completed; it shouldn't still be on the busy queue.  Can you
Mike> check to make sure you have the right queue here?

Well, it looks like I had the wrong queue before.  Blush.
At least this time tr_status is 1. Not sure if that is good or bad
though! :)

Here is the debug output:

db> call twe_printqueues 
twe0: status   57007310
twe0:   current  max
twe0: free  0099 0100
twe0: ready  
twe0: busy  0001 0100
twe0: complete   0009
twe0: bioq   0021
twe0: AEN queue head 1  tail 0
ready queue: 0 entries
busy queue: 1 entries
twe0: CMD: request_id 54  opcode   size 11  unit 0  host_id 0
twe0:  status 0  flags 0x0  count 32  sgl_offset 3
twe0:  lba 10466
twe0:   0: 0xffc4000/4096
twe0:   1: 0x11f85000/4096
twe0:   2: 0x12d66000/4096
twe0:   3: 0x10e87000/4096
twe0:  tr_command 0xc1520400/0x174f4400  tr_data 0xce0f4000/0xffc4000,16384
twe0:  tr_status 1  tr_flags 0x2  tr_complete 0xc011f1b0  tr_private 0xc9260400
complete queue: 0 entries

This was generated with the code:

void
twe_printqueues(void)
{
struct twe_softc*sc;
struct twe_request  *tr = NULL;
int i, s;

s = splbio();
for (i = 0; (sc = devclass_get_softc(twe_devclass, i)) != NULL; i++) {
twe_print_controller(sc);
printf("ready queue: %d entries\n", sc->twe_qstat[TWEQ_READY].q_length);
TAILQ_FOREACH(tr, sc->twe_ready, tr_link) twe_print_request(tr);
printf("busy queue: %d entries\n", sc->twe_qstat[TWEQ_BUSY].q_length);
TAILQ_FOREACH(tr, sc->twe_busy, tr_link) twe_print_request(tr);
printf("complete queue: %d entries\n", sc->twe_qstat[TWEQ_COMPLETE].q_length);
TAILQ_FOREACH(tr, sc->twe_complete, tr_link) twe_print_request(tr);
}
splx(s);
}


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



PR kern/109152 -- RocketPort panics

2007-04-19 Thread Douglas K. Rand
We have 3 of the 32 port Comtrol PCI cards, the original 5v only
cards, not the newer universal PCI cards, and we are trying to upgrade
the system with all these serial ports from FreeBSD 4.8 (ya, kinda
old) to FreeBSD 6.2. And we are running into what seems like a common
problem with these cards: panics about non-busy devices:

  panic: device_unbusy: called for non-busy device rp0

The PR kern/109152 addresses this issue, and the patch from Craig
Leres added to the PR on Tue, 13 Mar 2007 19:27:01 -0700 solves the
problems for me. (I was experiencing the problems with HylaFAX, but it
seems easy to reproduce.) 

Can this patch be applied to RELENG_6?

I've also tried the patch from John Baldwin posed to freebsd-hackers
in 
http://groups.google.com/group/mailing.freebsd.hackers/browse_frm/thread/883da63a8c62854d
that generates a rp_open_count for each device with out any change in
the behavior, I would continue to panic the system by sending a FAX.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


iir + Tyan S2460 + SMP problems

2006-04-21 Thread Douglas K. Rand
We're having problems with FreeBSD 5.4, 6.0, and 6.1 and an ICP Vortex
GDT8546RZ 4 port SATA RAID card in a Tyan S2460 system with dual AMD
Athlon MP 1600+ CPUs. We do not have any problems with this
configuration under FreeBSD 4.11, and we have the same ICP cards in
Tyan based Opterion system (S2882 and S4882) run with out problems
under FreeBSD 5.4 and 6.1.

We can reproduce the problem on two different S2460 based systems, and
have tried 2 seperate ICP GDT8546RZ cards, so we don't believe it is a
hardware problem. (Our success with FreeBSD 4.11 also provides some
evidence that our hardware is OK.)

The problem is that the system seems to stop doing any disk IO through
the ICP card. Processes that don't need to page in work fine. (You can
hit return in a shell, get another login: prompt on other consoles,
and the like.) The system continues to respond to pings, but anything
that attempts to do a disk IO simply stops. Sometimes the kernel emits
messages like this:

  swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2, size: 4096

The test we are using to produce this "hang" is a fairly trivial
expansion of a tar ball being fed via nc from another system. We run
on the source system:

   tar cf - radar | nc -w 3 10.10.10.229 12345

And on the system being tested we run: 

   nc -l 12345 | tar xvf -

One iteration of this test is the extraction of a 1.2 GB directory of
2,274 files.

The problem only exists with SMP kernels. While our other tests almost
always failed in the first iteration or two, the longest time to
failure was 5 iterations. With out SMP the test ran with out problems
for 570 iterations over 18 hours.

We've tried a number of different tests.  These tests are with a stock
6.1-RC1 kernel from the RC CD's. Unless otherwise specified, all tests
are on a UFS2 filesystem with softupdates enabled and a SMP enabled
GENERIC kernel.  

  * !SMP: Ran 570 iterations in 18 hours with out a problem, test
terminated by hand. 

  * Large (190 GB) UFS2 filesystem with soft updates enabled and SMP
kernel: Fails during the first iteration. 

  * Medium (12 GB) UFS2 filesystem with soft updates enabled and SMP
kernel: Fails during the first iteration. 

  * !softupdates: fails during first iteration. 

  * !ACPI: fails during the first iteration. 

  * UFS1: fails during the first iteration. 

  * UFS1 + !ADAPTIVE_GIANT: failed during the first iteration. 

  * !ADAPTIVE_GIANT: failed during the first iteration. 

  * Cleared motherboard CMOS: failed at the end of the second
iteration. 

  * FULL_PREEMPTION: failed during the first iteration. 

  * !PREEMPTION: failed during the first iteration. 

  * WITNESS + WITNESS_KDB: failed during the second iteration with no
witness related kernel messages and with out entering the kernel
debugger. 

  * WITNESS + INVARIANTS: failed during the fifth iteration, again w/o
kernel messages. 

  * Motherboard BIOS "Use PCI Interrupt Entries in MP Table" set to
OFF: failed during first iteration. 

  * Motherboard BIOS "Multiprocessor Specification" set from 1.4 to
1.1: failed during first iteration. 

  * MUTEX_WAKE_ALL: failed during first iteration.

I have a serial console and a kernel debugger enabled, so if anybody
has suggestions for probes to do once the system is hung let us know. 

Any advice is welcome. Well, except for "dump the Tyan S2460
motherboards" maybe.

Oh, and we're at current BIOS and firmware revs for both the ICP card
and the motherboard.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: iir + Tyan S2460 + SMP problems

2006-04-21 Thread Douglas K. Rand
Doug> We're having problems with FreeBSD 5.4, 6.0, and 6.1 and an ICP
Doug> Vortex GDT8546RZ 4 port SATA RAID

John-Mark> We've had very similar experiences on 4.7-R.  The box would
John-Mark> hang on one partition waiting for IO to come back, but
John-Mark> direct access to the disk, or accessing other partitions
John-Mark> would pass IO fine.

Interesting, the opposite of our problems. It works perfectly for us
on 4.x systems, but not in 5.4 and 6.x.

Doug> Any advice is welcome. Well, except for "dump the Tyan S2460
Doug> motherboards" maybe.

John-Mark> How about drop iir?

While not as expensive as switching motherboards, still a pain. We've
been very happy with our ICP cards. But your suggestion does have
merit, especially as recent PCI-X and PCI-E ICP cards have no FreeBSD
driver in sight.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: iir + Tyan S2460 + SMP problems

2006-04-22 Thread Douglas K. Rand
John-Mark> How about drop iir?

Doug> But your suggestion does have merit, especially as recent PCI-X
Doug> and PCI-E ICP cards have no FreeBSD driver in sight.

Scott> My understanding is that the ICP division is switching over to
Scott> the architcture supported by the 'aac' driver.  Adaptec
Scott> provided updates to this driver last year that include a number
Scott> of ICP id numbers.  If you have access to one of these cards
Scott> that you mention, would you mind trying the aac driver and
Scott> reported on whether or not it worked?

I didn't know that the newer ICP cards would work, so we don't have
any of the newer ones. Sorry.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: What is considered "the best supported" RAID controller for 5.x?

2005-06-30 Thread Douglas K. Rand
JM> Basicly a process will hang in either ffsfsn (fsync induces this
JM> write) or getblk (a read), and as far as I can find out, the io
JM> just never returns even though the underlying block device
JM> continues to work fine..

I know this is probably a stupid answer, but ...  Have you upgraded
the firmware on the card? We have a number of ICP cards and with the
2.39 firmware we can easily lockup the box with the behavior you
explain with a simple buildworld.  Upgrading the firmware to 2.44
solves the problem completely for us.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


5.4 hangs on disk IO

2005-10-07 Thread Douglas K. Rand
I've got 2 FreeBSD 5.4 systems that seem to get stuck doing disk
IO. When the system gets hung, it seems to refuse to do any disk
io. It will continue to respond to pings, and the tty driver on the
serial console continues to work and echo characters. But all the
processes seem to get stuck in the state "ufs". On the serial console
I tried sync, which then hangs, and ^T produces:

athearn# sync
load: 0.66  cmd: csh 4138 [ufs] 0.00u 0.00s 0% 1748k

One of the systems emits these error messages on the console:

swap_pager: indefinite wait buffer: device: da0s1a, blkno: 2, size: 4096
swap_pager: indefinite wait buffer: device: da0s1h, blkno: 6, size: 4096
swap_pager: indefinite wait buffer: device: da0s1h, blkno: 7, size: 4096
swap_pager: indefinite wait buffer: device: da0s1h, blkno: 15, size: 4096
swap_pager: indefinite wait buffer: device: da0s1a, blkno: 26, size: 4096

Other than that difference, both systems hang the same way.

The interesting part is that it only seems to happen when I run
amd. With amd running one of our users can hang the system in about
5-10 minutes of heavy disk traffic to a LOCAL disk. The local disk
with the heaviest traffic is "behind" a amd managed symlink. If I
don't run amd and do all the NFS mounts by hand and build the symlink
by hand, the system runs fine. I've tried both the stock amd that
comes with 5.4, the fairly outdated 6.0.10-20040513, and the brand new
6.1.2.1 from ports. 

From the kernel debugger I can't even panic or call boot(0) with out
errors. 

I can easily reproduce the hang, so if there is any suggestions for
things to poke at with the kernel debugger on the serial console, let
me know.

Here's the result of ps from a hung system, along with the errors from
panic and boot(0):

db> ps
  pid   proc uid  ppid  pgrp  flag   stat  wmesgwchan  cmd
 4138 c34051c40   530  4138 412 [SLPQ ufs 0xc346b2bc][SLP] csh
 4137 c3e8be20 1002   740   735 400 [SLPQ sbwait 0xc366fe84][SLP] perl5.8.7
 4136 c3e8bc5c 1002   741   735 400 [SLPQ sbwait 0xc3a95d40][SLP] perl5.8.7
 4135 c3e071c4 1002   743   735 0004000 [SLPQ nfs 0xc35a7e14][SLP] mkdir
 4134 c3d8e1c4 1002   749   735 0004000 [SLPQ ufs 0xc346b2bc][SLP] perl5.8.7
 4132 c3e6154c 1002   739   735 0004000 [SLPQ nfs 0xc35a7e14][SLP] remap
 2355 c3714a98 1000   682  2355 0004002 [SLPQ select 0xc06787c4][SLP] top
  749 c3ba3c5c 1002 1   735 0004002 [SLPQ wait 0xc3ba3c5c][SLP] tclsh8.4
  748 c348f54c 1002 1   735 0004002 [SLPQ accept 0xc36a6a5a][SLP] perl5.8.7
  747 c3710a98 1002 1   735 0004002 [SLPQ accept 0xc3a6154a][SLP] perl5.8.7
  746 c370fa98 1002 1   735 0004002 [SLPQ accept 0xc374e17e][SLP] perl5.8.7
  745 c3ba3e20 1002 1   735 0004002 [SLPQ accept 0xc374e68e][SLP] perl5.8.7
  744 c3ba6000 1002 1   735 0004002 [SLPQ accept 0xc35ac7d2][SLP] perl5.8.7
  743 c3ba61c4 1002 1   735 0004002 [SLPQ wait 0xc3ba61c4][SLP] perl5.8.7
  742 c3ba6388 1002 1   735 0004002 [SLPQ accept 0xc366f406][SLP] perl5.8.7
  741 c34b6e20 1002 1   735 0004002 [SLPQ wait 0xc34b6e20][SLP] perl5.8.7
  740 c34b6388 1002 1   735 0004002 [SLPQ wait 0xc34b6388][SLP] perl5.8.7
  739 c34b6c5c 1002 1   735 0004002 [SLPQ wait 0xc34b6c5c][SLP] perl5.8.7
  682 c34ef000 1000   681   682 0004002 [SLPQ pause 0xc34ef038][SLP] zsh
  681 c340954c 1000   678   678 100 [SLPQ select 0xc06787c4][SLP] sshd
  678 c348ca980   442   678 100 [SLPQ sbwait 0xc3a616ec][SLP] sshd
  530 c3710e200   529   530 0004002 [SLPQ ppwait 0xc3710e20][SLP] csh
  529 c370f3880 1   529 0004102 [SLPQ wait 0xc370f388][SLP] login
  528 c37103880 1   528 0004002 [SLPQ ttyin 0xc31d0a10][SLP] getty
  527 c348c1c40 1   527 0004002 [SLPQ ttyin 0xc31b8410][SLP] getty
  526 c348f3880 1   526 0004002 [SLPQ ttyin 0xc31b8010][SLP] getty
  525 c34ef8d40 1   525 0004002 [SLPQ ttyin 0xc3166c10][SLP] getty
  524 c34b6a980 1   524 0004002 [SLPQ ttyin 0xc3166210][SLP] getty
  523 c348f8d40 1   523 0004002 [SLPQ ttyin 0xc3166410][SLP] getty
  522 c348ce200 1   522 0004002 [SLPQ ttyin 0xc3166610][SLP] getty
  521 c34f3e200 1   521 0004002 [SLPQ ttyin 0xc3166810][SLP] getty
  491 c34053880 1   490 000 [SLPQ ufs 0xc346b2bc][SLP] snmpd
  464 c37107100 1   464 000 [SLPQ ufs 0xc346b2bc][SLP] cron
  452 c34b6710   25 1   452 100 [SLPQ pause 0xc34b6748][SLP] sendmail
  448 c37140000 1   448 100 [SLPQ pause 0xc3714038][SLP] sendmail
  442 c34f38d40 1   442 100 [SLPQ select 0xc06787c4][SLP] sshd
  427 c34097100 1   427 000 [SLPQ select 0xc06787c4][SLP] ntpd
  382 c3167c5c0   378   378 000 [SLPQ - 0xc3383c00][SLP] nfsd
  381 c34ef3880   378   378 000 [SLPQ - 0xc33d7200][SLP] nfsd
  380 c34f354c0   378   378 000 [SLPQ - 0xc33f6400][SLP] nfsd
  379 c34b654c0   378   378 000 [SLPQ - 0xc338f000][SLP] nfsd
  378 c34057100 1   378 000 [SLPQ accept 0xc365403a][SLP] 

Re: indefinite wait buffer: Does this indicate hardware issue?

2005-12-19 Thread Douglas K. Rand
** On Sat, 17 Dec 2005 00:49:48 +0800, Xin LI <[EMAIL PROTECTED]> said:

Xin> I have a box indicating the following sometimes:
Xin> "swap_pager: indefinite wait buffer: bufobj: 0, blkno: 262169, size: 4096"

We are having a very similar problem that we've been trying to
diagnose off and on for a while. About 20% of the time the system will
emit very similar messages. In each case processes trying to write to
disks get stuck in the "ufs" state.

This is a dual CPU AMD Athlon(tm) MP 1600+ system on a Tyan 2460 mobo
with 512 MB of RAM and an ICP Vortex GDT8546RZ SATA RAID controller.

Here is the result of a C-t on an interactive shell via a serial
console:

  load: 0.00  cmd: csh 547 [ufs] 0.02u 0.01s 0% 2516k

With the kernel debugger a few processes will be stuck in the same state:

db> ps
  pid   proc uid  ppid  pgrp  flag   stat  wmesgwchan  cmd
  647 c1a080000   645   647 110 [SLPQ ufs 0xc1856c08][SLP] cron
  646 c18834182   644   646 110 [SLPQ biord 0xcbcff650][SLP] cron
  645 c1a0a8300   493   493 000 [SLPQ ppwait 0xc1a0a830][SLP] cron
  644 c1887c480   493   493 000 [SLPQ ppwait 0xc1887c48][SLP] cron
  643 c1a0a4180   632   642 0004002 [SLPQ wdrain 0xc06a26c4][SLP] bsdtar
  632 c1a08a3c0   631   632 0004002 [SLPQ pause 0xc1a08a70][SLP] tcsh
  631 c1834624 1001   616   631 0004102 [SLPQ wait 0xc1834624][SLP] su
  616 c188720c 1001   615   616 0004002 [SLPQ pause 0xc1887240][SLP] tcsh
  615 c1a0a624 1001   613   613 100 [SLPQ select 0xc06a2144][SLP] sshd
  613 c18836240   471   613 0004100 [SLPQ sbwait 0xc1b3dbc8][SLP] sshd
  547 c1671a3c0   537   547 0004002 [SLPQ ufs 0xc1856c08][SLP] csh
  537 c1883a3c0 1   537 0004102 [SLPQ wait 0xc1883a3c][SLP] login
  536 c18876240 1   536 0004002 [SLPQ ttyin 0xc16e8810][SLP] getty
  535 c16716240 1   535 0004002 [SLPQ ttyin 0xc16e8410][SLP] getty
  534 c1a0aa3c0 1   534 0004002 [SLPQ ttyin 0xc16dac10][SLP] getty
  533 c1671c480 1   533 0004002 [SLPQ ttyin 0xc16e7010][SLP] getty
  532 c1a0ac480 1   532 0004002 [SLPQ ttyin 0xc16e7410][SLP] getty
  531 c1887a3c0 1   531 0004002 [SLPQ ttyin 0xc16e7810][SLP] getty
  530 c18870000 1   530 0004002 [SLPQ ttyin 0xc16e7c10][SLP] getty
  529 c183420c0 1   529 0004002 [SLPQ ttyin 0xc16e8010][SLP] getty
  513 c1a0a20c0 1   513 000 [SLPQ select 0xc06a2144][SLP] inetd
  493 c1a088300 1   493 000 [SLPQ ufs 0xc1856c08][SLP] cron
  481 c1883c48   25 1   481 100 [SLPQ pause 0xc1883c7c][SLP] sendmail
  477 c167120c0 1   477 100 [SLPQ pause 0xc1671240][SLP] sendmail
  471 c18340000 1   471 100 [SLPQ select 0xc06a2144][SLP] sshd
  349 c18874180 1   349 000 [SLPQ ufs 0xc1856c08][SLP] ypbind
  336 c18344180 1   336 000 [SLPQ select 0xc06a2144][SLP] rpcbind
  317 c18878300 1   317 000 [SLPQ select 0xc06a2144][SLP] syslogd
  283 c16718300 1   283 000 [SLPQ select 0xc06a2144][SLP] devd
  228 c1883000   65 1   228 100 [SLPQ select 0xc06a2144][SLP] dhclient
  208 c188320c0 153 002 [SLPQ select 0xc06a2144][SLP] dhclient
   52 c1834a3c0 0 0 204 [SLPQ - 0xd5526d08][SLP] schedcpu
   51 c1834c480 0 0 204 [SLPQ - 0xc06aa4cc][SLP] nfsiod 3
   50 c161b6240 0 0 204 [SLPQ - 0xc06aa4c8][SLP] nfsiod 2
   49 c161b8300 0 0 204 [SLPQ - 0xc06aa4c4][SLP] nfsiod 1
   48 c161ba3c0 0 0 204 [SLPQ - 0xc06aa4c0][SLP] nfsiod 0
   47 c161bc480 0 0 204 [SLPQ vlruwt 0xc161bc48][SLP] vnlru
   46 c1670 0 0 204 [SLPQ getblk 0xcbc36578][SLP] syncer
   45 c167020c0 0 0 204 [SLPQ psleep 0xc06a268c][SLP] bufdaemon
   44 c16704180 0 0 20c [SLPQ pgzero 0xc06b09c4][SLP] pagezero
9 c16706240 0 0 204 [SLPQ psleep 0xc06b0514][SLP] vmdaemon
8 c16708300 0 0 204 [SLPQ psleep 0xc06b04d0][SLP] pagedaemon
   43 c1670a3c0 0 0 204 [IWAIT] swi0: sio
7 c1670c480 0 0 204 [SLPQ - 0xc16dd43c][SLP] fdc0
6 c16710000 0 0 204 [SLPQ - 0xc166f280][SLP] kqueue taskq
   42 c160fc480 0 0 204 [IWAIT] swi5:+
5 c161a0000 0 0 204 [SLPQ - 0xc15c1280][SLP] thread taskq
   41 c161a20c0 0 0 204 [IWAIT] swi6:+
   40 c161a4180 0 0 204 [IWAIT] swi6: task queue
   39 c161a6240 0 0 204 [IWAIT] swi2: cambio
   38 c161a8300 0 0 204 [SLPQ - 0xc0698140][SLP] yarrow
4 c161aa3c0 0 0 204 [SLPQ - 0xc0698b08][SLP] g_down
3 c161ac480 0 0 204 [SLPQ - 0xc0698b04][SLP] g_up
2 c161b0000 0 0 204 [SLPQ - 0xc0698afc][SLP] g_event
   37 c161b20c0 0 0 204 [IWAIT] swi3: vm
   36 c161b4180 0 0 20c [IWAIT] swi4: clock sio
   35 c16006240 0 0 

Re: indefinite wait buffer: Does this indicate hardware issue?

2005-12-23 Thread Douglas K. Rand
Xin> Hi

Hi.

On 19 Dec 2005 14:32:31 -0600, Douglas K. Rand <[EMAIL PROTECTED]> wrote:
Doug> Tracing command swapper pid 0 tid 0 td 0xc0698e20
Doug> sched_switch(c0698e20,0,1) at sched_switch+0x14b
Doug> mi_switch(1,0) at mi_switch+0x1ba
Doug> scheduler(0,81ec00,81e000,0,c042f5d5) at scheduler+0x262
Doug> mi_startup() at mi_startup+0x96
Doug> begin() at begin+0x2c

Xin> Which scheduler are you using?

SCHED_4BSD.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


pam_ssh problems

2000-11-21 Thread Douglas K. Rand

Sometime between 4.1.1-STABLE and 4.2-BETA we started having
difficulities with using pam_ssh and wdm. Here is a piece of our
/etc/pam.conf:

   wdm authsufficient  pam_ssh.so
   wdm authrequiredpam_unix.so
   wdm account requiredpam_unix.so  try_first_pass
   wdm session requiredpam_ssh.so
   wdm password required   pam_deny.so

This used to work just fine: It would authenticate against the user's
~/.ssh/identity, and when wdm started the session, it would
automatically startup ssh-agent and add the user's SSH key. 

After a cvsup, wdm started dropping core. I've cvsup'ed a few times
since then also, hoping for a fix, but no luck yet. My latest cvsup
was yesterday. 

So, I rebuilt wdm with debug symbols, and rebuilt world with -g too,
and here is the backtrace from gdb:

#0  0x283ed553 in ?? ()
#1  0x283ed72b in ?? ()
#2  0x283ea744 in ?? ()
#3  0x28321a10 in _pam_dispatch_aux (pamh=0x8069300, flags=0, h=0x8069900, 
resumed=PAM_FALSE)
at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_dispatch.c:79
#4  0x28321e10 in _pam_dispatch (pamh=0x8069300, flags=0, choice=4)
at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_dispatch.c:270
#5  0x283200d6 in pam_open_session (pamh=0x8069300, flags=0)
at /usr/src/lib/libpam/libpam/../../../contrib/libpam/libpam/pam_session.c:26
#6  0x8054b9d in StartClient (verify=0x805fbfc, d=0x8069000, pidp=0x805fbe0, 
name=0x805f4e8 "user", passwd=0x805f500 "password")
at session.c:682
#7  0x8054009 in ManageSession (d=0x8069000) at session.c:308
#8  0x8050454 in StartDisplay (d=0x8069000) at dm.c:635
#9  0x805023b in CheckDisplayStatus (d=0x8069000) at dm.c:562
#10 0x8050a40 in ForEachDisplay (f=0x80501d4 )
at dpylist.c:55
#11 0x8050257 in StartDisplays () at dm.c:571
#12 0x804f638 in main (argc=2, argv=0xbfbff708) at dm.c:185
#13 0x804a5c8 in _start (arguments=0xbfbff818 "-:0  ")
at /usr/src/lib/csu/i386-elf/crt1.c:96

The code seems to be launching the module, but I can't figure out
which module it is having trouble with, although I expect it is
pam_ssh.so. 

Here are a few more details from gdb:

(gdb) print *h
$8 = {must_fail = 0, func = 0x283ea1a0, actions = {-1, -3 , 
-1, -3 , 0, -3, -3, -3, -3, -3, -3}, argc = 0, 
  argv = 0x0, next = 0x0}

(gdb) print *pamh
$9 = {authtok = 0x0, pam_conversation = 0x8065de0, oldauthtok = 0x0, 
  prompt = 0x0, service_name = 0x8065dc0 "wdm", user = 0x8065dd0 "user", 
  rhost = 0x0, ruser = 0x0, tty = 0x8065e00 ":0", pam_default_log = {
ident = 0x0, option = 0, facility = 0}, data = 0x8065f50, env = 0x8065df0, 
  fail_delay = {set = PAM_FALSE, delay = 0, begin = 974839469, 
delay_fn_ptr = 0x0}, handlers = {module = 0x806a6c0, 
modules_allocated = 4, modules_used = 3, handlers_loaded = 1, conf = {
  authenticate = 0x8069400, setcred = 0x8069500, acct_mgmt = 0x8069800, 
  open_session = 0x8069900, close_session = 0x8069a00, 
  chauthtok = 0x8069b00}, other = {authenticate = 0x8069c00, 
  setcred = 0x8069d00, acct_mgmt = 0x8069e00, open_session = 0x0, 
  close_session = 0x0, chauthtok = 0x0}}, former = {choice = 0, depth = 0, 
impression = 0, status = 0, want_user = 0, prompt = 0x0, 
update = PAM_FALSE}}

I don't know if anybody else is having this problem, or know how to
fix it, but any assistance would be usefule.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-14 Thread Douglas K. Rand

> Drat.  There it is; you've got a command that looks like it's stuck in
> the adapter.

I'll go grab the can of WD-40.   :)

> I didn't see you respond to Mike T - are you using 64k or 128k stripes?

I didn't get his query until I had already started the mysqd trying to break
things. And now I'm at home, and while serial consoles are *really* great for
most things, I can't get at the 3ware BIOS from here. I didn't want to respond
until I had checked the bios.

I'm /fairly/ sure that I took the default 64K stripes, but one time in
rebinding the array I did change the stripe size.

> If the latter, try changing the value of TWE_Q_LENGTH in
> /sys/dev/twe/twereg.h to 100 and see if you can reproduce it.

Rebuilding the kernel as I type.

> I am worrying about firmware here at the moment

We are running the latest firmware as of about 10 days ago.

> Thanks for your patience.

Are you kidding?  Thanks for all the help. We really appreciate it.
Anything we can do to help, let us know.



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: 3ware problems

2001-03-15 Thread Douglas K. Rand

> Drat.  There it is; you've got a command that looks like it's stuck in
> the adapter.
>
> try changing the value of TWE_Q_LENGTH in /sys/dev/twe/twereg.h to 100 and
> see if you can reproduce it.

Well, I just woke up and mysqd was stuck again in getblk, this time with a
TWE_Q_LENGTH of 100:

db> call twe_report
twe0: status   57007390
twe0:   current  max
twe0: free  0099 0100
twe0: ready  
twe0: busy  0001 0100
twe0: complete   0011
twe0: bioq   0027
twe0: AEN queue head 1  tail 0
twed: total bio count in 1646323  out 1646322




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



3Ware, Western Digital disks, and stray interrupts

2002-03-17 Thread Douglas K. Rand

We have two pretty much identical systems: Both are Tyan Tiger MP
S2469 boards with a 3ware 7450 controller and Western Digital WD1000
100GB disks. One system has 4 disks in a RAID 10 configuration, and
the other has 2 disks in a RAID 1 configuration. One system only has a
single Athlon MP CPU, while the other has 2 Athlon MP CPUs.

We have gone through 5 of the WD1000 disks so far, with a 6th that
just failed the other day. The first 3 failures we tested with Western
Digital's drive fitness test, which reported all thee drives to be
OK. The first disk that failed we tried to put back in and have the
3ware controller rebuild, but the rebuild failed after 2 hours. We've
stopped testing the disks, and just send them back to Western
Digital. 

All the failures have been drive timeouts:

Dec 29 16:55:31 doppler[kern.crit] /kernel: twe0: AEN: 
Jan  1 23:36:04 doppler[kern.crit] /kernel: twe0: AEN: 
Feb 22 18:19:21 doppler[kern.crit] /kernel: twe0: AEN: 
Mar  7 20:18:44 vault[kern.crit] /kernel: twe0: AEN: 
Mar 16 21:42:02 vault[kern.crit] /kernel: twe0: AEN: 

The last two messages were somewhat massaged by me, that comes later...

So, the first question: Has anybody else seen such a horrible failure
rate witht he WD1000 disks?

The other problem we are having, which /may/ be related, is that the
second system (vault, the single CPU box) has had 2 failures that
coincide with a spate of "stray irq 7" messages. We are using swatch
to watch for the twe messages, but the two failures on vault have had
the kernel log mixed with the stray irq 7 messages:

Mar  7 20:18:44 vault[kern.crit] /kernel: t
Mar  7 20:18:44 vault[kern.err] /kernel: stray irq 7
Mar  7 20:18:44 vault[kern.crit] /kernel: we0
Mar  7 20:18:44 vault[kern.err] /kernel: stray irq 7
Mar  7 20:18:44 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more
Mar  7 20:18:44 vault[kern.crit] /kernel: : AEN: 

Mar 16 21:42:02 vault[kern.crit] /kernel: tw
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: e0:
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: A
Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7
Mar 16 21:42:02 vault[kern.crit] /kernel: EN: 

In both cases, there aren't any kernel logs for 2 hours on either side
of this message. We have the parallel port disabled in the BIOS, and
after the last failure took irq 7 away from the PCI and PnP
devices. (None of the previous dmesg for the system report any devices
using irq 7.) I've put the current dmesg at the end.

So, is the 3ware controller causing the stray irq 7 messages when the
disk failes, or are the stray irq 7 messages causing the 3ware
controller to timeout the disk?

Any help would be appreciated. Pretty soon Western Digital is gonna
stop taking our phone calls. Either that, or we'll loose 2 disks
before we get the first one fixed. ;^)


Copyright (c) 1992-2002 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 4.5-RELEASE #1: Wed Feb 13 17:10:19 CST 2002
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/VAULT
Timecounter "i8254"  frequency 1193182 Hz
Timecounter "TSC"  frequency 1400054127 Hz
CPU: AMD Athlon(tm) MP Processor 1600+ (1400.05-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x662  Stepping = 2
  
Features=0x383fbff
  AMD Features=0xc048<,AMIE,DSP,3DNow!>
real memory  = 268435456 (262144K bytes)
avail memory = 258568192 (252508K bytes)
Preloaded elf kernel "kernel" at 0xc02a8000.
Pentium Pro MTRR support enabled
Using $PIR table, 268435454 entries at 0xc00fdf10
npx0:  on motherboard
npx0: INT 16 interface
pcib0:  on motherboard
pci0:  on pcib0
pcib1:  at device 1.0 on pci0
pci1:  on pcib1
pci1:  at 5.0 irq 10
isab0:  at device 7.0 on pci0
isa0:  on isab0
atapci0:  port 0xf000-0xf00f at device 7.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
chip1:  at device 7.3 on pci0
twe0: <3ware Storage Controller> port 0x1430-0x143f mem 
0xf400-0xf47f,0xf4901000-0xf490100f irq 5 at device 8.0 on pci0
twe0: 4 ports, Firmware FE7X 1.03.09.027, BIOS BE7X 1.07.02.002
fxp0:  port 0x1400-0x141f mem 
0xf480-0xf48f,0xf4903000-0xf4903fff irq 5 at device 12.0 on pci0
fxp0: Ethernet address 00:90:27:18:d7:45
inphy0:  on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ahc0:  port 0x1000-0x10ff mem 0xf490-0xf4900fff 
irq 11 at device 13.0 on pci0
aic7880: Ultra Wide Channel A, SCSI Id=7, 16/255 SCBs
orm0:  at iomem 0xc-0xc7fff,0xc8000-0xc8fff,0xc9800-0xc9fff on isa0
fdc0:  at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0:  at port 0x60,0x64 on isa0
vga0:  at port 0x3c0-0x3df iomem 0xa-0xb on isa0
sc0:  at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x1

Re: 3Ware, Western Digital disks, and stray interrupts

2002-03-19 Thread Douglas K. Rand

Albert> We're running a 6410 with four WD1000 disks and have had only
Albert> one failure so far. Anyway, according to the author of the twe
Albert> driver, Michael Smith, a firmware upgrade for the 7xxx series
Albert> controllers should be available by now.

Thanks, I had missed that post. We might have to switch from
RELENG_4_5 to RELENG_4 for that box and try it. Not the end of the
world, we've just gotten to like tracking those branches. 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message