Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime): > On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote: >> Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime): >>> Hello, >>> >>> I have a remote machine with a probably defective ODD, but until r294989 >>> (from Jan 28th) I could boot with just these warnings: >>> (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 >>> (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error >>> (cd1:ahcich1:0:0:0): SCSI status: Check Condition >>> (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read >>> error) >>> (cd1:ahcich1:0:0:0): Error 5, Unretryable error >>> (cd1:ahcich1:0:0:0): cddone: got error 0x5 back >>> … >>> >>> beta3 doesn't boot anymore, it's hanging with ahci-timeouts: >>> ahcich2: Timeout on slot 11 port 0 >>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr >>> 00000000 cmd 0004cb17 >>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00 >>> 00 00 00 >>> ... >>> (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00 >>> 00 00 >>> (aprobe0:ahcich2:0:0:0) CAM status: Command timeout >>> (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked >>> ada1 detached >>> ... >>> The numbers (first ACB) and also the channel varies from time to time >> >> I could narrow it down to r295480 >> (https://svnweb.freebsd.org/base?view=revision&revision=295480) >> >> Reverting that lets the machine boot again. >> >> I captured verbose boot messages, finding out that problem relaxes with >> verbose-booting, since ahci seems to recover: >> … >> TSC timecounter discards lower 1 bit(s) >> Timecounter "TSC-low" frequency 1746033500 Hz quality -100 >> ahcich2: Timeout on slot 12 port 0 >> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr >> 00000000 cmd 0004cc17 >> ahcich2: AHCI reset... >> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00 >> 00 00 00 >> (ada1:ahcich2:0:0:0): CAM status: Command timeout >> (ada1:ahcich2:0:0:0): Retrying command >> ahcich2: SATA connect time=100us status=00000123 >> ahcich2: AHCI reset: device found >> ahcich2: AHCI reset: device ready after 100ms >> ahcich1: SNTF 0x0001 >> ahcich1: SNTF 0x0001 >> … >> >> I have checked twice that r295480 introduces boot failure here. >> >> I have absolutely no idea where/how/why/what race happens... >> >> Thanks for any hints, > > That is most bizarre. Does HEAD boot fine on this machine? The change > in question probably alters the timing of startup a bit since the random > kthread is placed on the run queue later which might affect the relative > order of kthreads as they start executing, but that would just mean it is > exposting a race in some other part of the system.
Bizarre it is... HEAD (r295683, 02/17/2016) boots fine. BETA3 fails. This time I checked with vendor-ISO images, while before it was a custom setup rollout with local patches and special hw adaptions. Now I'm sure it's not site-related. With BETA3-iso, where booting fails, "random: unblocking device." happens after timecounter initialization and before attaching ses0/cdX/adaX. With HEAD-iso, where booting succeeds, "random: unblocking device." happens way after ses0/adaX/cdX attached, right before rc. On HEAD, ahci-devices attach in the same order as with -stable pre-r295480. Since r295480, cdX attaches before adaX on -stable and while searching for the cluprit, I had observed that attaching-order was a clear indicator whether machine boots or not. Sorry, I can't provide more useful info at this time, I just can describe simple symptoms :-( While playing with that machine I remember having had such a bizarre problem before, arising with r284665. Here's what I found in my kernel config: # Don't build kernel with RACCT by default, which was enabled with r284665, # since ahci(4) fails when using MSI with ahcichX timeout! nooptions RACCT # Resource accounting framework nooptions RACCT_DEFAULT_TO_DISABLED # Set kern.racct.enable=0 by default nooptions RCTL # Resource limits Perhpas it's related?! https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html Thanks, -Harry _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"