-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Package: kernel-image-2.6.8-1-686-smp Version: 2.6.8-3 Severity: Important
Hardware layout - --------------------- Dual Xeon + Latest BIOS 1 GB ram 2 x 3ware SATA raid controllers + Latest Firmware All disks live on the 3ware 9xxx controllers Controllers provides 3 x 1.5TB raid-5 stripes One of which holds /, swap and /var. The rest of the free space I've built as a 4.5TB raid-0 stripe for the backup volume This is then carved into..... - ---------------------------------------------------------------------------------------- backup-srv:~# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 3937220 2285880 1451336 62% / /dev/sda3 3937252 2497220 1240024 67% /var /dev/md0 4677217408 3090281796 1586935612 67% /backups - ---------------------------------------------------------------------------------------- The /dev/md0 device is .... - ---------------------------------------------------------------------------------------- backup-srv:~# cat /proc/mdstat md0 : active raid0 sdc1[2] sdb1[1] sda4[0] 4677348544 blocks 64k chunks unused devices: <none> - ---------------------------------------------------------------------------------------- I had to use XFS as this was the only FS that would build that large. ext3 seems to barf at anything over 2TB Problem Description - ------------------------- This machine is the production backup server for all the *nix machines on the network. cron runs rsync via ssh to grab the files from each client target The bulk of the systems are backed up weekly and a few daily The system seems to survive anywhere between a couple of days to no more that 2 weeks under this sort of heavy IO & network loading before giving up the ghost. dmesg dumps follow...... This problem was also exhibited by 2.6.7 and 2.6.6 I'm dropping back to 2.4.27 now & will let you know if pain persists - From Dmesg - ---------------------------------------------------------------------------------------- Unable to handle kernel paging request at virtual address 20fda90c printing eip: f8b26144 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: af_packet ipv6 piix hw_random uhci_hcd usbcore shpchp pciehp pci_hotplug floppy parport_pc parport pcspkr evdev e1000 xfs raid0 md dm_mod ide_cd ide_core cdrom rtc ext3 jbd mbcache sd_mod unix 3w_9xxx scsi_mod CPU: 3 EIP: 0060:[<f8b26144>] Not tainted EFLAGS: 00010213 (2.6.8.20040927) EIP is at xfs_ail_insert+0x24/0xd0 [xfs] eax: 000003e7 ebx: 00000000 ecx: 000003e7 edx: 00000000 esi: 20fda904 edi: f7198c18 ebp: c2005168 esp: f7703dd4 ds: 007b es: 007b ss: 0068 Process xfslogd/3 (pid: 604, threadinfo=f7702000 task=f7cb87d0) Stack: 0002050a 0000052a 549b2041 ed9cd202 c2005168 f7198c18 f7198c00 c0f1d30c f8b25e5d f7198c18 c2005168 00000000 c2005168 0002050a 0000052a 00000000 c2005168 0002050a 0000052a f8b258bc f7198c00 c2005168 0002050a 0000052a Call Trace: [<f8b25e5d>] xfs_trans_update_ail+0x5d/0xf0 [xfs] [<f8b258bc>] xfs_trans_chunk_committed+0x17c/0x240 [xfs] [<f8b2566a>] xfs_trans_committed+0x4a/0x120 [xfs] [<f8b17743>] xlog_state_do_callback+0x2c3/0x3d0 [xfs] [<f8b178d0>] xlog_state_done_syncing+0x80/0xc0 [xfs] [<f8b15fe5>] xlog_iodone+0x55/0xf0 [xfs] [<f8b359bd>] pagebuf_iodone_work+0x4d/0x50 [xfs] [<c0131a26>] worker_thread+0x1f6/0x2e0 [<f8b35970>] pagebuf_iodone_work+0x0/0x50 [xfs] [<c011c4f0>] default_wake_function+0x0/0x20 [<c011c4f0>] default_wake_function+0x0/0x20 [<c0131830>] worker_thread+0x0/0x2e0 [<c0135f8a>] kthread+0xba/0xc0 [<c0135ed0>] kthread+0x0/0xc0 [<c01042c5>] kernel_thread_helper+0x5/0x10 Code: 8b 46 08 8b 56 0c 89 44 24 08 89 54 24 0c 8b 55 0c 8b 45 08 <6>note: xfslogd/3[604] exited with preempt_count 1 - ---------------------------------------------------------------------------------------- Machine locks up a little while after this & after a kick in the guts gives on next startup.... - ---------------------------------------------------------------------------------------- backup-srv:~# mount /backups/ Oct 4 12:47:03 ouprci05 kernel: Filesystem "md0": XFS internal error xlog_clear_stale_blocks(2) at line 1253 of file fs/xfs/xfs_log_recover.c. Caller 0xf8b28876 Oct 4 12:47:03 ouprci01 kernel: Filesystem "md0": XFS internal error xlog_clear_stale_blocks(2) at line 1253 of file fs/xfs/xfs_log_recover.c. Caller 0xf8b28876 mount: Unknown error 990 - ---------------------------------------------------------------------------------------- So I try..... - ---------------------------------------------------------------------------------------- backup-srv:~# xfs_repair /dev/md0 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. - ---------------------------------------------------------------------------------------- So I .... - ---------------------------------------------------------------------------------------- backup-srv:~# xfs_repair -L /dev/md0 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 LEAFN node level is 1 inode 2820138 bno = 8388608 entry contains offset out of order in shortform dir 19126020 corrected entry offsets in directory 19126020 - agno = 1 - agno = 2 LEAFN node level is 1 inode 2147942164 bno = 8388608 LEAFN node level is 1 inode 2148480815 bno = 8388608 .... And so on for a few hours, for the rest of the 4.5TB file system check to complete :( .... ________________________________ It is by caffeine alone I set my mind in motion, It is by the beans of Java that thoughts acquire speed, The hands acquire shaking, the shaking becomes a warning, It is by caffeine alone I set my mind in motion. (author unknown) with thanks and apologies to Frank Herbert ________________________________ Jan Eringa Unix Admin Orbian Management Ltd ________________________________ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBYl6XX4LWCZ7JjaMRAtH0AJwPIxdCA6xO88hHtJa27qo7UBlG/QCgigGI dhtLCXAxPd1W46KbnFMdMcY= =nuOo -----END PGP SIGNATURE-----