Re: [gentoo-user] Long boot time after kernel update

Rich Freeman Mon, 27 Dec 2021 06:15:28 -0800

On Mon, Dec 27, 2021 at 8:46 AM Wols Lists <antli...@youngman.org.uk> wrote:
>
> On 27/12/2021 13:40, Michael wrote:
> > On Monday, 27 December 2021 11:32:39 GMT Wols Lists wrote:
> >> On 27/12/2021 11:07, Jacques Montier wrote:
> >>> Well, i don't know if my partitions are aligned or mis-aligned... How
> >>> could i get it ?
> >>
> >> fdisk would have spewed a bunch of warnings. So you're okay.
> >>
> >> I'm not sure of the details, but it's the classic "off by one" problem -
> >> if there's a mismatch between the kernel block size and the disk block
> >> size any writes required doing a read-update-write cycle which of course
> >> knackered performance. I had that hit a while back.
> >>
> >> But seeing as fdisk isn't moaning, that isn't the problem ...
> >>
> >> Cheers,
> >> Wol
> >
> > I also thought of misaligned boundaries when I first saw the error, but the
> > mention of Seagate by the OP pointed me to another edge case which crept up
> > with zstd compression on ZFS.  I'm mentioning it here in case it is 
> > relevant:
> >
> > https://livelace.ru/posts/2021/Jul/19/unaligned-write-command/
> >
> that might be of interest to me ... I'm getting system lockups but it's
> not an SSD. I've got two IronWolves and a Barracuda.
>
> But I notice the OP has a Barra*C*uda. Note the different spelling.
> That's a shingled drive I believe, which shouldn't make a lot of
> difference in light usage, but you don't want to hammer it!


I've run into this issue and I've seen rare reports of it online, but
no sign of resolution.  I'm pretty sure it is some sort of bug in the
kernel.  I've tended to see it under load, and mostly when using zfs.
I do not use zstd compression and do not have any zvols on the pools
that had this issue.  So, either there are multiple problems, or that
linked post did not correctly identify the root cause (which seems
likely).  I'm guessing it is triggered under load and perhaps using
zstd compression helps create that load.

I haven't seen it much lately - probably because I've shifted a lot of
my load to lizardfs and also I'm using USB3 hard drives for the bulk
of my storage and since these seem to be ATA errors the removal of the
SATA host and associated drivers may bypass the problem.

I doubt this has anything to do with physical/logical sector size and
partition alignment.  The disks should still work correctly if the
physical sectors aren't aligned - they should just have performance
degradation.  In any case, all my drives are aligned on physical
sector boundaries.  I'm not familiar enough with ATA to understand
what the actual errors are referring to.

Here is an example of one of the errors I've had in the past from one
of these situations.  A zpool scrub usually clears up any damage and
then the drive works normally until the issue happens again (which
hasn't happened in quite a while for me now).  I have a dump of the
SMART logs and the kernel ring buffer:

ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 12838 hours (534 days + 22 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 e0 88 cc c3 06  Error: ICRC, ABRT at LBA = 0x06c3cc88 = 113495176

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 c0 68 cb c3 40 08   2d+00:45:18.962  WRITE FPDMA QUEUED
  60 00 b8 98 67 00 40 08   2d+00:45:18.917  READ FPDMA QUEUED
  60 00 b0 98 65 00 40 08   2d+00:45:18.916  READ FPDMA QUEUED
  60 00 a8 98 66 00 40 08   2d+00:45:18.916  READ FPDMA QUEUED
  61 00 a0 68 ca c3 40 08   2d+00:45:18.879  WRITE FPDMA QUEUED

[354064.268896] ata6.00: exception Emask 0x11 SAct 0x1000000 SErr
0x480000 action 0x6 frozen
[354064.268907] ata6.00: irq_stat 0x48000008, interface fatal error
[354064.268910] ata6: SError: { 10B8B Handshk }
[354064.268915] ata6.00: failed command: WRITE FPDMA QUEUED
[354064.268919] ata6.00: cmd 61/00:c0:68:cb:c3/07:00:06:01:00/40 tag
24 ncq dma 917504 out
                         res 50/00:00:68:cb:c3/00:07:06:01:00/40 Emask
0x10 (ATA bus error)
[354064.268922] ata6.00: status: { DRDY }
[354064.268926] ata6: hard resetting link
[354064.731093] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[354064.734739] ata6.00: configured for UDMA/133
[354064.734759] sd 5:0:0:0: [sdc] tag#24 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[354064.734764] sd 5:0:0:0: [sdc] tag#24 Sense Key : Illegal Request [current]
[354064.734767] sd 5:0:0:0: [sdc] tag#24 Add. Sense: Unaligned write command
[354064.734771] sd 5:0:0:0: [sdc] tag#24 CDB: Write(16) 8a 00 00 00 00
01 06 c3 cb 68 00 00 07 00 00 00
[354064.734774] print_req_error: I/O error, dev sdc, sector 4408462184
[354064.734791] ata6: EH complete


-- 
Rich

Re: [gentoo-user] Long boot time after kernel update

Reply via email to