Hello everybody,
I'm also facing this problem for several months, trying to solve it
individually.
In my case the situation is yet more complicated, as I'm using quite
complex setup, connecting VM's partitions from another server via AoE.
According to console logs, the problems seems to be related to networking,
as the first "hang-up" messages are usually something like
igb 0000:01:00.0 eth2: Reset adapter
igb 0000:01:00.0 eth2: Reset adapter
...
followed by sequence of:
ata3.00: exception Emask 0x0 SAct 0x300000 SErr 0x0 action 0x6 frozen
ata3.00: failed command: WRITE FPDMA QUEUED
ata3.00: cmd 61/01:a0:08:f0:08/00:00:00:00:00/40 tag 20 ncq dma 512 out
res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3.00: failed command: READ FPDMA QUEUED
ata3.00: cmd 60/08:a8:20:f8:55/00:00:11:00:00/40 tag 21 ncq dma 4096 in
res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
...
and then other I/O errors, like
blk_update_request: I/O error, dev sdb, sector 254266864
and others.
IMHO something (the networking?) breaks the I/O in kernel. Nevertheless,
the system somehow keeps running, being not able to read or write to
disks, but a kernel panic is not announced!
I've tried all suggested actions - trying newer kernels, building custom
kernel with disabled IPv6 & QoS and other potentially problematic items,
disabling all suspected features in BIOS, tuning network interfaces
parameters, even disabling them (and using others from another vendor)...
With no difference.
However I've discovered by lucky chance, that the problem is somehow
related to CPU family - when using the Intel E55xx (tested at least for
E5520 and E5504), the problem occurs quite frequently (~ once per day),
while using E56xx (tested at least for E5606), the problem becomes very
rare or disappears at all (I'm using it longer than 1 month without
a problem now).
I hope this information will help to solve this problem, or at least it
will give viable option for other afflicted (and desparate) users.
Regards,
Jan.