So I changed all 45 nodes in the cluster back to the 2.6.32 kernel and
restarted a test job. After 15-20 minutes, some of the nodes had dropped
out - responding still to pings, but impossible to ssh to them.
Output from one node as follows (after connecting to console),
Ben Hutchings wrote:
We'll want to see the kernel log (output from dmesg) after this happens,
even if you can't spot anything in it.
The problem first started at 08:56 - as you can see the last messages in
/var/log/kern.log are from 08:31 (just after booting) - the output from
dmesg is identical
Mar 9 08:31:11 node20 kernel: [ 9.102710] sdc1
Mar 9 08:31:11 node20 kernel: [ 9.102890] sd 2:0:0:0: [sdc] Attached
SCSI disk
Mar 9 08:31:11 node20 kernel: [ 9.104582] sda1 sda2 < sda5 sda6 >
Mar 9 08:31:11 node20 kernel: [ 9.131101] sd 0:0:0:0: [sda] Attached
SCSI disk
Mar 9 08:31:11 node20 kernel: [ 9.133794] sd 0:0:0:0: Attached scsi
generic sg0 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133824] sd 1:0:0:0: Attached scsi
generic sg1 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133849] sd 2:0:0:0: Attached scsi
generic sg2 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133875] sd 3:0:0:0: Attached scsi
generic sg3 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133970] sr 4:0:0:0: Attached scsi
generic sg4 type 5
Mar 9 08:31:11 node20 kernel: [ 9.324705] PM: Starting manual resume
from disk
Mar 9 08:31:11 node20 kernel: [ 9.339131] EXT4-fs (sda1): INFO:
recovery required on readonly filesystem
Mar 9 08:31:11 node20 kernel: [ 9.339135] EXT4-fs (sda1): write
access will be enabled during recovery
Mar 9 08:31:11 node20 kernel: [ 10.429924] EXT4-fs (sda1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 10.430834] EXT4-fs (sda1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 11.227223] udev: starting version 151
Mar 9 08:31:11 node20 kernel: [ 11.447716] processor LNXCPU:00:
registered as cooling_device0
Mar 9 08:31:11 node20 kernel: [ 11.448024] processor LNXCPU:01:
registered as cooling_device1
Mar 9 08:31:11 node20 kernel: [ 11.448329] processor LNXCPU:02:
registered as cooling_device2
Mar 9 08:31:11 node20 kernel: [ 11.448646] processor LNXCPU:03:
registered as cooling_device3
Mar 9 08:31:11 node20 kernel: [ 11.448950] processor LNXCPU:04:
registered as cooling_device4
Mar 9 08:31:11 node20 kernel: [ 11.449253] processor LNXCPU:05:
registered as cooling_device5
Mar 9 08:31:11 node20 kernel: [ 11.449557] processor LNXCPU:06:
registered as cooling_device6
Mar 9 08:31:11 node20 kernel: [ 11.449858] processor LNXCPU:07:
registered as cooling_device7
Mar 9 08:31:11 node20 kernel: [ 11.584225] i2c i2c-0: nForce2 SMBus
adapter at 0x2d00
Mar 9 08:31:11 node20 kernel: [ 11.584244] i2c i2c-1: nForce2 SMBus
adapter at 0x2e00
Mar 9 08:31:11 node20 kernel: [ 11.614078] input: PC Speaker as
/devices/platform/pcspkr/input/input5
Mar 9 08:31:11 node20 kernel: [ 11.699803] EDAC MC: Ver: 2.1.0 Jan 10
2010
Mar 9 08:31:11 node20 kernel: [ 11.826247] EDAC amd64_edac: Ver:
3.2.0 Jan 10 2010
Mar 9 08:31:11 node20 kernel: [ 11.826765] Error: Driver 'pcspkr' is
already registered, aborting...
Mar 9 08:31:11 node20 kernel: [ 11.826812] EDAC amd64: ECC is enabled
by BIOS.
Mar 9 08:31:11 node20 kernel: [ 11.826992] EDAC amd64: ECC is enabled
by BIOS.
Mar 9 08:31:11 node20 kernel: [ 11.827029] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827095] EDAC MC0: Giving out
device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
Mar 9 08:31:11 node20 kernel: [ 11.827098] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827157] EDAC MC1: Giving out
device to 'amd64_edac' 'Family 10h': DEV 0000:00:19.2
Mar 9 08:31:11 node20 kernel: [ 11.827174] EDAC PCI0: Giving out
device to module 'amd64_edac' controller 'EDAC PCI controller': DEV
'0000:00:18.2' (POLLED)
Mar 9 08:31:11 node20 kernel: [ 12.124910] Adding 32170120k swap on
/dev/sda6. Priority:-1 extents:1 across:32170120k
Mar 9 08:31:11 node20 kernel: [ 12.338321] loop: module loaded
Mar 9 08:31:11 node20 kernel: [ 14.407171] EXT4-fs (sda5): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 15.002385] EXT4-fs (sdb1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 15.002786] EXT4-fs (sdb1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 15.568677] EXT4-fs (sdc1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 15.570225] EXT4-fs (sdc1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 16.180705] EXT4-fs (sdd1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 16.180909] EXT4-fs (sdd1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 16.834881] alloc irq_desc for 30 on
node 0
Mar 9 08:31:11 node20 kernel: [ 16.834885] alloc kstat_irqs on node 0
Mar 9 08:31:11 node20 kernel: [ 16.834895] forcedeth 0000:00:08.0:
irq 30 for MSI/MSI-X
Mar 9 08:31:21 node20 kernel: [ 27.456017] eth0: no IPv6 routers present
The device statistics (output from ethtool -S eth0) might also be
informative.
NIC statistics:
tx_bytes: 63756006188
tx_zero_rexmt: 56365619
tx_one_rexmt: 0
tx_many_rexmt: 0
tx_late_collision: 0
tx_fifo_errors: 0
tx_carrier_errors: 0
tx_excess_deferral: 0
tx_retry_error: 0
rx_frame_error: 0
rx_extra_byte: 0
rx_late_collision: 0
rx_runt: 0
rx_frame_too_long: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_align_error: 0
rx_length_error: 0
rx_unicast: 58975439
rx_multicast: 933
rx_broadcast: 1618
rx_packets: 58977990
rx_errors_total: 0
tx_errors_total: 0
tx_deferral: 0
tx_packets: 56365619
rx_bytes: 69269122814
tx_pause: 0
rx_pause: 46798
rx_drop_frame: 46798
tx_unicast: 2284
tx_multicast: 3008
tx_broadcast: 16510200339
If I ifdown eth0 and then ifup eth0, I can again connect to the system
without problems.
Thanks,
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com
--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4b9617b9.4030...@deri.org