The plot is starting to thicken. I've noticed all the systems that have done this (so far) have this flash card on them.
STEC M2+ CF 9.0.2 K1186-2 From talking to checkpoint this is a newer flash they have started using. I just had a 4th machine do the same thing yesterday. Basic install, about %70 disk space free, very new install, like 1-2 month and the up time on the machine in question was only 16 days. After rebooting i did a few dd if=/dev/zero of=~/file bs=1m count=350 and didn't get any errors. The latest machine is a 1 gig version of the flash listed above, so this ate almost all the free disk space. Checkpoint is asking that we RAM one of the flash cards so they can play with it. ________________________________ From: "jfleming...@yahoo.com" <jfleming...@yahoo.com> To: Jeremy Chadwick <free...@jdc.parodius.com> Cc: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Sent: Tuesday, February 14, 2012 7:57 PM Subject: Re: 6.2-Release ..ish.. CF + ata == freeze? 2 of the 3 cf cards are very new, like less then 6 months old. I think around 65-70 percent is in use. This number doesn't change unless the user dumps data in a home dir, which isn't the case so far. You are correct that only writes are failing. Msgbuf has more then what I pasted but I'm pretty sure its just more of the same errors. Ill redouble my check. The other slices are very small. One is 35 meg the other is 100 some odd meg. H is 1.2 gig. I don't know if ill be able to try the dd test for a few reasons but ill check it out. Let me ask you this. Say zeroing out the drive works without error. Does that tell me anything? I also don't have access to smart tools as this is basically a closed system and the vendor would never give us access to a complier. Granted I haven't tried just throwing on gcc from 6.2. I could play with that or maybe since said vendor's dev team is keeping track of this thread they could provide said binary :). I really don't like the idea of replacing hardware as I'm looking at around 200 boxes. I really hope it doesn't come to that. Thanks for the reply! Sent via BlackBerry from T-Mobile -----Original Message----- From: Jeremy Chadwick <free...@jdc.parodius.com> Date: Mon, 13 Feb 2012 21:18:28 To: john fleming<jfleming...@yahoo.com> Cc: freebsd-stable@freebsd.org<freebsd-stable@freebsd.org> Subject: Re: 6.2-Release ..ish.. CF + ata == freeze? On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote: > Just thought i would post over here as i'm not getting a warm fuzzy from > checkpoint about being able to find the root cause of an issue. I have a > large install base of IPSO checkpoint firewalls, which are based on FreeBSD > 6.2. I've had 3 firewalls hang basically the same way, with something that > looks like a filesystem issue or an?issue with a CF card. FreeBSD 6.2 was EOL'd in early-to-mid-2008. The ATA driver has changed significantly since then (present-day uses CAM). > Does anyone happen to know of any bugs (i've been looking around) that could > cause something like that? Granted, it could be a batch of bad CF cards, but > its odd that i'm seeing the same thing on 3 different boxes and once rebooted > they seem ok. > ? > Also is it possible to get useful info form the atacontroller when things go > south like this from the ddb prompt? Not particularly. What's shown below indicates that the driver had issued some form of ATA write command (there are multiple kinds per ATA specification), and either the underlying media (CF/disk) or controller stalled/locked up/took too long. I forget what the timeout value is in 6.2; I can't be bothered to remember such from 6 years ago. :-) > This is what shows in show msgbuf > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > ad0: timeout waiting to issue command > ad0: error issuing WRITE command > g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5 > ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 > g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 error 5 = EIO = Input/output error. But this isn't too big of a surprise given the timeouts you see prior. Are these CF cards brand new -- meaning, are they completely unused (having never had any writes done to them), or have they been in use a while? I'm betting they've been in use a while, and have probably been doing many writes over the years. Two things to note here: 1) The errors you've shown are only happening on writes, not reads. Of course if you omitted information then this isn't an accurate statement. 2) Timeouts are seen when issuing writes to some LBA regions. How full is the CF card, disk-space-wise? Not just ad0s4h, I'm talking about the entire card. How much space is roughly available? They're very small CF cards (1.8GByte roughly), and the less space available, the less effectiveness of wear levelling (and in some cases the slower the writes are). Reason I ask: given that these are CF cards, this smells of cards which are simply "worn down". CF cards have limited numbers of writes, and the card may be "freaking out" internally when attempting to write to some LBAs which map to CF sectors that are, in effect, "bad". The CF cards' ECC implementation may be buggy, or may simply be "spinning hard" for too long. You can read about this sort of behaviour on Wikipedia's CompactFlash article. You wouldn't be able to verify this with dd if=/dev/ad0, because those are read operations. You could zero the media (dd if=/dev/zero of=/dev/ad0) as a form of verification if you wanted. Do you happen to know if these CF cards support SMART? If so, installing smartmontools (version 5.42 or newer please) and providing output from "smartctl -a /dev/ad0" may be helpful to me, but I make no guarantees anything of use will be shown there. Overall my advice would be to replace the CF cards, especially if they have been in use for a long while. It really doesn't matter to me that it's happening on 3 machines (honest), especially if these are 6.2 machines with CF cards that have been in use for years. We're lucky to get 2 years out of our CF cards on our Juniper M120/320s before they start spitting I/O errors. Pick larger CF cards as well; more space = more room for effective wear levelling. > ? > ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4 > atapci0: <Intel 6300ESB UDMA100 controller> port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff > at device 31.1 on pci0 > ata0: <ATA channel 0> on atapci0 > ata1: <ATA channel 1> on atapci0 > atapci1: <Intel 6300ESB SATA150 controller> port > 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 > at device 31.2 on pci0 > ata2: <ATA channel 0> on atapci1 > ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on > the box where almost anything that needs to be written goes. > trace > Tracing pid 1101 tid 100043 td 0x656d8460 > kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b > siointr1(64ba1400) at siointr1+0xf0 > siointr(64ba1400) at siointr+0x38 > intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at > intr_execute_handler+0x61 > intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at > intr_execute_handlers+0x40 > atpic_handle_intr(4) at atpic_handle_intr+0x96 > Xatpic_intr4() at Xatpic_intr4+0x20 > --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c --- > lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f > getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2 > flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30 > flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf > softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61 > ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2 > ffs_fsync(f0a4ac74) at ffs_fsync+0x12 > VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38 > fsync(656d8460,f0a4acb4) at fsync+0x170 > syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee > Xint0x80_syscall() at Xint0x80_syscall+0x1f -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"