Re: Constant rebooting after power loss
2011/4/2 Matthew Dillon : > The core of the issue here comes down to two things: > > First, a power loss to the drive will cause the drive's dirty write cache > to be lost, that data will not make it to disk. Nor do you really want > to turn of write caching on the physical drive. Well, you CAN turn it > off, but if you do performance will become so bad that there's no point. > So turning off the write caching is really a non-starter. > > The solution to this first item is for the OS/filesystem to issue a > disk flush command to the drive at appropriate times. If I recall the > ZFS implementation in FreeBSD *DOES* do this for transaction groups, > which guarantees that a prior transaction group is fully synced before > a new ones starts running (HAMMER in DragonFly also does this). > (Just getting an 'ack' from the write transaction over the SATA bus only > means the data made it to the drive's cache, not that it made it to > the platter). Amen ! > I'm not sure about UFS vis-a-vie the recent UFS logging features... > it might be an option but I don't know if it is a default. Perhaps > someone can comment on that. > > One last note here. Many modern drives have very large ram caches. > OCZ's SSDs have something like 256MB write caches and many modern HDs > now come with 32MB and 64MB caches. Aged drives with lots of relocated > sectors and bit errors can also take a very long time to perform writes > on certain sectors. So these large caches take time to drain and one > can't really assume that an acknowledged write to disk will actually > make it to the disk under adverse circumstances any more. All sorts > of bad things can happen. > > Finally, the drives don't order their writes to the platter (you can > set a bit to tell them to, but like many similar bits in the past there > is no real guarantee that the drives will honor it). So if two > transactions do not have a disk flush command inbetween them it is > possible for data from the second transaction to commit to the platter > before all the data from the first transaction commits to the platter. > Or worse, for the non-transactional data to update out of order relative > to the transactional data which was supposed to commit first. > > Hence IMHO the OS/filesystem must use the disk flush command in such > situations for good reliability. > > -- > > The second problem is that a physical loss of power to the drive can > cause the drive to physically lose one or more sectors, and can even > effectively destroy the drive (even with the fancy auto-park)... if the > drive happens to be in the middle of a track write-back when power is > lost it is possible to lose far more than a single sector, including > sectors unrelated to recent filesystem operations. > > The only solution to #2 is to make sure your machines (or at least the > drives if they happen to be in external enclosures) are connected to > a UPS and that the machines are communicating with the UPS via > something like the "apcupsd" port. AND also that you test to make > sure the machines properly shut themselves down when AC is lost before > the UPS itself runs out of battery time. After all, a UPS won't help > if the machines don't at least idle their drives before power is lost!!! > > I learned this lesson the hard way about 3 years ago. I had something > like a dozen drives in two raid arrays doing heavy write activity and > lost physical power and several of the drives were totally destroyed, > with thousands of sector errors. Not just one or two... thousands. > > (It is unclear how SSDs react to physical loss of power during heavy > writing activity. Theoretically while they will certainly lose their > write cache they shouldn't wind up with any read errors). > > -Matt > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > -- Olivier Smedts _ ASCII ribbon campaign ( ) e-mail: oliv...@gid0.org - against HTML email & vCards X www: http://www.gid0.org - against proprietary attachments / \ "Il y a seulement 10 sortes de gens dans le monde : ceux qui comprennent le binaire, et ceux qui ne le comprennent pas." ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Any success stories for HAST + ZFS?
On Thu, Mar 24, 2011 at 01:36:32PM -0700, Freddie Cash wrote: > [Not sure which list is most appropriate since it's using HAST + ZFS > on -RELEASE, -STABLE, and -CURRENT. Feel free to trim the CC: on > replies.] > > I'm having a hell of a time making this work on real hardware, and am > not ruling out hardware issues as yet, but wanted to get some > reassurance that someone out there is using this combination (FreeBSD > + HAST + ZFS) successfully, without kernel panics, without core dumps, > without deadlocks, without issues, etc. I need to know I'm not > chasing a dead rabbit. I just committed a fix for a problem that might look like a deadlock. With trociny@ patch and my last fix (to GEOM GATE and hastd) do you still have any issues? -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgp80PxT4EuiQ.pgp Description: PGP signature
[releng_7 tinderbox] failure on sparc64/sparc64
TB --- 2011-04-02 08:01:56 - tinderbox 2.6 running on freebsd-legacy.sentex.ca TB --- 2011-04-02 08:01:56 - starting RELENG_7 tinderbox run for sparc64/sparc64 TB --- 2011-04-02 08:01:56 - cleaning the object tree TB --- 2011-04-02 08:02:09 - cvsupping the source tree TB --- 2011-04-02 08:02:09 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca -s /usr/home/tinderbox/RELENG_7/sparc64/sparc64/supfile TB --- 2011-04-02 08:02:15 - building world TB --- 2011-04-02 08:02:15 - MAKEOBJDIRPREFIX=/obj TB --- 2011-04-02 08:02:15 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-04-02 08:02:15 - TARGET=sparc64 TB --- 2011-04-02 08:02:15 - TARGET_ARCH=sparc64 TB --- 2011-04-02 08:02:15 - TZ=UTC TB --- 2011-04-02 08:02:15 - __MAKE_CONF=/dev/null TB --- 2011-04-02 08:02:15 - cd /src TB --- 2011-04-02 08:02:15 - /usr/bin/make -B buildworld >>> World build started on Sat Apr 2 08:02:16 UTC 2011 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sat Apr 2 09:14:14 UTC 2011 TB --- 2011-04-02 09:14:14 - generating LINT kernel config TB --- 2011-04-02 09:14:14 - cd /src/sys/sparc64/conf TB --- 2011-04-02 09:14:14 - /usr/bin/make -B LINT TB --- 2011-04-02 09:14:14 - building LINT kernel TB --- 2011-04-02 09:14:14 - MAKEOBJDIRPREFIX=/obj TB --- 2011-04-02 09:14:14 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-04-02 09:14:14 - TARGET=sparc64 TB --- 2011-04-02 09:14:14 - TARGET_ARCH=sparc64 TB --- 2011-04-02 09:14:14 - TZ=UTC TB --- 2011-04-02 09:14:14 - __MAKE_CONF=/dev/null TB --- 2011-04-02 09:14:14 - cd /src TB --- 2011-04-02 09:14:14 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat Apr 2 09:14:14 UTC 2011 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies >>> stage 3.2: building everything [...] cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=15000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin -mcmodel=medany -msoft-float -ffreestanding -Werror /src/sys/kern/kern_intr.c cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=15000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin -mcmodel=medany -msoft-float -ffreestanding -Werror /src/sys/kern/kern_jail.c cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=15000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin -mcmodel=medany -msoft-float -ffreestanding -Werror /src/sys/kern/kern_kse.c cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=15000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin -mcmodel=medany -msoft-float -ffreestanding -Werror /src/sys/kern/kern_kthread.c cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=15000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin -mcmodel=medany -msoft-float -ffreestanding -Werror /src/sys/kern/kern_ktr.c cc -c -O2 -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototype
ahci.ko in RELENG_8_2, what about atacontrol cap?
Hi, all, On my system using the "old" ATA driver, I can use a command like this to get useful information about my disk drives: nas-pmh# atacontrol cap ad4 Protocol SATA revision 2.x device model ST32000542AS serial number 5XW251QF firmware revision CC34 cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 3907029168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Native Command Queuing (NCQ) yes - 31/0x1F Tagged Command Queuing (TCQ) no no 31/0x1F SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management yes yes 49344/0xC0C0 automatic acoustic management yes yes 254/0xFE254/0xFE When I switch to the new AHCI driver the drives are connected to the CAM subsystem: nas-pmh# camcontrol devlist at scbus0 target 0 lun 0 (ada0,pass0) at scbus1 target 0 lun 0 (ada1,pass1) at scbus2 target 0 lun 0 (ada2,pass2) at scbus3 target 0 lun 0 (ada3,pass3) at scbus4 target 0 lun 0 (da0,pass4) But: nas-pmh# camcontrol inquiry ada0 nas-pmh# camcontrol readcap ada0 (pass0:ahcich0:0:0:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 (pass0:ahcich0:0:0:0): CAM status: CCB request was invalid Obvious question: is there a way to get the same information (NCQ support, write cache status, ...) with the new driver? Thanks, Patrick -- punkt.de GmbH * Kaiserallee 13a * 76133 Karlsruhe Tel. 0721 9109 0 * Fax 0721 9109 100 i...@punkt.de http://www.punkt.de Gf: Jürgen Egeling AG Mannheim 108285 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ahci.ko in RELENG_8_2, what about atacontrol cap?
On Sat, Apr 02, 2011 at 11:34:32AM +0200, Patrick M. Hausen wrote: > Hi, all, > > On my system using the "old" ATA driver, I can use a command like this > to get useful information about my disk drives: > > > nas-pmh# atacontrol cap ad4 > > Protocol SATA revision 2.x > device model ST32000542AS > serial number 5XW251QF > firmware revision CC34 > cylinders 16383 > heads 16 > sectors/track 63 > lba supported 268435455 sectors > lba48 supported 3907029168 sectors > dma supported > overlap not supported > > Feature Support EnableValue Vendor > write cacheyesyes > read ahead yesyes > Native Command Queuing (NCQ) yes - 31/0x1F > Tagged Command Queuing (TCQ) no no 31/0x1F > SMART yesyes > microcode download yesyes > security yesno > power management yesyes > advanced power management yesyes 49344/0xC0C0 > automatic acoustic management yesyes 254/0xFE254/0xFE > > > When I switch to the new AHCI driver the drives are connected to > the CAM subsystem: > > > nas-pmh# camcontrol devlist > at scbus0 target 0 lun 0 (ada0,pass0) > at scbus1 target 0 lun 0 (ada1,pass1) > at scbus2 target 0 lun 0 (ada2,pass2) > at scbus3 target 0 lun 0 (ada3,pass3) > at scbus4 target 0 lun 0 (da0,pass4) > > > But: > > > nas-pmh# camcontrol inquiry ada0 > nas-pmh# camcontrol readcap ada0 > (pass0:ahcich0:0:0:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 > (pass0:ahcich0:0:0:0): CAM status: CCB request was invalid > > > > Obvious question: is there a way to get the same information (NCQ support, > write cache status, ...) with the new driver? You want "camcontrol identify adaX". DO NOT confuse this with "camcontrol inquiry adaX" (this won't work). identify = for ATA inquiry = for SCSI See camcontrol(8) man page for specifics. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Backup tool fot ZFS with all "classic dump(8)" fetatures -- what should I use? (or is here any way to make dump -L works well on large FFS2+SU?)
Looked at rsync or tarsnap? On Mon, 28 Mar 2011 09:20:07 +0200, Lev Serebryakov wrote: Hello, Freebsd-stable. Now I'm backing up my HOME filesystem with dump(8). It works perfectly for 80GiB FS with many features: snapshot for consistency, levels, "nodump" flag (my users use it a lot!), ability to extract only one removed file from backup without restoring full FS, simple sctipy wrap-up for levels schedule, etc. On new server I have huge HOME (500GiB). And even if it is filled up only with 25GiB of data, creating snapshot takes about 10 minutes, freeze all I/O, and sometimes FAILS (!!!). I'm thinking to transfer GOME filesystem to ZFS. But I can not find appropriate tools for backing it up. Here is some requirements: (1) One-file (one-stream) backup. Not directory mirror. I need to store it on FTP server and upload with single command. (2) Levels & increment backups. Now I have "Monthly (0) - Weekly (1,2,3) - daily (4,5,6,7,8,9)" scheme. I could afford other schemes, but if they doesn't store full backup every day and doesn't need full backup more often than weekly. (3) Minimum of local metadata. Storing previous backups locally to calculate next one is not appropriate solution. "zfs send" needs previous snapshots for incremental backup, for example. (4) Working with snapshot (I think, it is trivial in case of ZFS). (5) Backup exclusions should be controlled by users (not super-user) themselves, like "nodump" flag in case of FFS/dump(8). "zfs send" can not provide this. I have very responsible users, so full backup now takes only up to 10GiB when all HOME FS is about 25GiB, so it is big help when backup is sent over Internet to other host. (6) Storing of ALL FS-specific information -- ACLs, etc. (7) Free :) Is here something like this for ZFS? "zfs send" looks promising, EXCEPT item (5) and, maybe, (3) :( gnu tar looks like everything but (6) :( ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: geli(4) memory leak
On Sat, Apr 02, 2011 at 12:04:09AM +0300, Mikolaj Golub wrote: > For me your patch look correct. But the same issue is for read :-). Also, to > avoid the leak I think we can just do g_destroy_bio() before "all sectors" > check. See the attached patch (had some testing). The patch looks good. Please commit. > Index: sys/geom/eli/g_eli.c > === > --- sys/geom/eli/g_eli.c (revision 220168) > +++ sys/geom/eli/g_eli.c (working copy) > @@ -160,13 +160,13 @@ g_eli_read_done(struct bio *bp) > pbp = bp->bio_parent; > if (pbp->bio_error == 0) > pbp->bio_error = bp->bio_error; > + g_destroy_bio(bp); > /* >* Do we have all sectors already? >*/ > pbp->bio_inbed++; > if (pbp->bio_inbed < pbp->bio_children) > return; > - g_destroy_bio(bp); > sc = pbp->bio_to->geom->softc; > if (pbp->bio_error != 0) { > G_ELI_LOGREQ(0, pbp, "%s() failed", __func__); > @@ -202,6 +202,7 @@ g_eli_write_done(struct bio *bp) > if (bp->bio_error != 0) > pbp->bio_error = bp->bio_error; > } > + g_destroy_bio(bp); > /* >* Do we have all sectors already? >*/ > @@ -215,7 +216,6 @@ g_eli_write_done(struct bio *bp) > pbp->bio_error); > pbp->bio_completed = 0; > } > - g_destroy_bio(bp); > /* >* Write is finished, send it up. >*/ -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://yomoli.com pgpmtOt5nWc4N.pgp Description: PGP signature
Kernel memory leak in 8.2-PRERELEASE?
Ahoy. This morning, I awoke to the following on one of my servers: pid 59630 (httpd), uid 80, was killed: out of swap space pid 59341 (find), uid 0, was killed: out of swap space pid 23134 (irssi), uid 1001, was killed: out of swap space pid 49332 (sshd), uid 1001, was killed: out of swap space pid 69074 (httpd), uid 0, was killed: out of swap space pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space ... And so on. The machine is: FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu Dec 2 11:39:21 EST 2010 sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS amd64 10:13AM up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00 The memory line from top intrigued me: Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free The machine has 8 gigs of memory, and I don't know what all that wired memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2 on it which has had a disk in the UNAVAIL state for a few months: # zpool status pool: home state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAMESTATE READ WRITE CKSUM homeDEGRADED 0 0 0 raidz2DEGRADED 0 0 0 ada0ONLINE 0 0 0 ada1ONLINE 0 0 0 ada2ONLINE 0 0 0 ada3ONLINE 0 0 0 ada4ONLINE 0 0 0 ada5UNAVAIL 08511 experienced I/O failures errors: No known data errors "vmstat -m" and "vmstat -z" output: http://acm.poly.edu/~spawk/vmstat-m.txt http://acm.poly.edu/~spawk/vmstat-z.txt Anyone have a clue? I know it's just going to happen again if I reboot the machine. It is still up in case there are diagnostics for me to run. -Boris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Kernel memory leak in 8.2-PRERELEASE?
On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote: > Ahoy. This morning, I awoke to the following on one of my servers: > > pid 59630 (httpd), uid 80, was killed: out of swap space > pid 59341 (find), uid 0, was killed: out of swap space > pid 23134 (irssi), uid 1001, was killed: out of swap space > pid 49332 (sshd), uid 1001, was killed: out of swap space > pid 69074 (httpd), uid 0, was killed: out of swap space > pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space > ... > > And so on. > > The machine is: > > FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: > Thu Dec 2 11:39:21 EST 2010 > sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS amd64 > > 10:13AM up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00 > > The memory line from top intrigued me: > > Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free > > The machine has 8 gigs of memory, and I don't know what all that > wired memory is being used for. There is a large-ish (6 x 1.5-TB) > ZFS RAID-Z2 on it which has had a disk in the UNAVAIL state for a > few months: The ZFS ARC is what's responsible for your large wired count. How much swap space do you have? You excluded that line from top. "swapinfo" would also be helpful, but would indicate the same thing. If you lack swap (which is a bad idea for a lot of reasons), then the machine running out of available memory for userspace (a process which grew too large, thus impacting others which were trying to malloc() at the time) would make sense. Can you please provide /boot/loader.conf and /etc/sysctl.conf ? > # zpool status > pool: home > state: DEGRADED > status: One or more devices could not be used because the label is > missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-4J > scrub: none requested > config: > > NAMESTATE READ WRITE CKSUM > homeDEGRADED 0 0 0 > raidz2DEGRADED 0 0 0 > ada0ONLINE 0 0 0 > ada1ONLINE 0 0 0 > ada2ONLINE 0 0 0 > ada3ONLINE 0 0 0 > ada4ONLINE 0 0 0 > ada5UNAVAIL 08511 experienced I/O failures > > errors: No known data errors I would also recommend fixing ada5; I'm not sure why any SA would let a bad disk sit in a machine for "a few months". Though, hopefully, this doesn't cause extra memory usage or something odd behind the scenes (in the kernel). I'm going to assume the two things are completely unrelated. > "vmstat -m" and "vmstat -z" output: > > http://acm.poly.edu/~spawk/vmstat-m.txt > http://acm.poly.edu/~spawk/vmstat-z.txt > > Anyone have a clue? I know it's just going to happen again if I > reboot the machine. It is still up in case there are diagnostics for > me to run. The above vmstat data won't be too helpful since you need to see what's going on "over time" and not what the values are right now. There may be one of them that indicates available userspace vs. available kmem. Basically what you need is the equivalent of Solaris sar(1), so that you can see memory usage of processes/etc. over time and find out if something went crazy and started going malloc-crazy. If the kernel itself ran out, you'd be seeing a panic. Sorry if these ideas/comments seem like a ramble, I've been up all night trying to decode a circa-1992 font routine in 65816 assembly, heh. :-) -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Kernel memory leak in 8.2-PRERELEASE?
On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote: > Ahoy. This morning, I awoke to the following on one of my servers: > > pid 59630 (httpd), uid 80, was killed: out of swap space > pid 59341 (find), uid 0, was killed: out of swap space > pid 23134 (irssi), uid 1001, was killed: out of swap space > pid 49332 (sshd), uid 1001, was killed: out of swap space > pid 69074 (httpd), uid 0, was killed: out of swap space > pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space > ... > > And so on. > > The machine is: > > FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu > Dec 2 11:39:21 EST 2010 > sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS amd64 > > 10:13AM up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00 > > The memory line from top intrigued me: > > Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free > > The machine has 8 gigs of memory, and I don't know what all that wired > memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2 > on it which has had a disk in the UNAVAIL state for a few months: > > # zpool status > pool: home > state: DEGRADED > status: One or more devices could not be used because the label is > missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-4J > scrub: none requested > config: > > NAMESTATE READ WRITE CKSUM > homeDEGRADED 0 0 0 > raidz2DEGRADED 0 0 0 > ada0ONLINE 0 0 0 > ada1ONLINE 0 0 0 > ada2ONLINE 0 0 0 > ada3ONLINE 0 0 0 > ada4ONLINE 0 0 0 > ada5UNAVAIL 08511 experienced I/O failures > > errors: No known data errors > > "vmstat -m" and "vmstat -z" output: > > http://acm.poly.edu/~spawk/vmstat-m.txt > http://acm.poly.edu/~spawk/vmstat-z.txt > > Anyone have a clue? I know it's just going to happen again if I reboot > the machine. It is still up in case there are diagnostics for me to run. Try r218795. Most likely, your issue is not leak. pgpDK3atxfMFJ.pgp Description: PGP signature
Re: Kernel memory leak in 8.2-PRERELEASE?
On 04/02/11 11:33, Kostik Belousov wrote: On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote: Ahoy. This morning, I awoke to the following on one of my servers: pid 59630 (httpd), uid 80, was killed: out of swap space pid 59341 (find), uid 0, was killed: out of swap space pid 23134 (irssi), uid 1001, was killed: out of swap space pid 49332 (sshd), uid 1001, was killed: out of swap space pid 69074 (httpd), uid 0, was killed: out of swap space pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space ... And so on. The machine is: FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu Dec 2 11:39:21 EST 2010 sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS amd64 10:13AM up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00 The memory line from top intrigued me: Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free The machine has 8 gigs of memory, and I don't know what all that wired memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2 on it which has had a disk in the UNAVAIL state for a few months: # zpool status pool: home state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAMESTATE READ WRITE CKSUM homeDEGRADED 0 0 0 raidz2DEGRADED 0 0 0 ada0ONLINE 0 0 0 ada1ONLINE 0 0 0 ada2ONLINE 0 0 0 ada3ONLINE 0 0 0 ada4ONLINE 0 0 0 ada5UNAVAIL 08511 experienced I/O failures errors: No known data errors "vmstat -m" and "vmstat -z" output: http://acm.poly.edu/~spawk/vmstat-m.txt http://acm.poly.edu/~spawk/vmstat-z.txt Anyone have a clue? I know it's just going to happen again if I reboot the machine. It is still up in case there are diagnostics for me to run. Try r218795. Most likely, your issue is not leak. Thanks. Will update to today's 8-STABLE and report back. -Boris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Constant rebooting after power loss
On Apr 1, 2011, at 23:35, Matthew Dillon wrote: >The solution to this first item is for the OS/filesystem to issue a >disk flush command to the drive at appropriate times. If I recall the >ZFS implementation in FreeBSD *DOES* do this for transaction groups, >which guarantees that a prior transaction group is fully synced before >a new ones starts running (HAMMER in DragonFly also does this). >(Just getting an 'ack' from the write transaction over the SATA bus only >means the data made it to the drive's cache, not that it made it to >the platter). It should also be noted that some drives ignore or lie about these flush commands: i.e., they say they flushed the buffers but did not in fact do so. This is sometimes done on cheap SATA drives, but also on expensive SANS. If the former's case it's often to help with benchmark numbers. In the latter's case, it's usually okay because the buffers are actually NVRAM, and so are safe across power cycles. There are also some USB-to-SATA chipsets that don't handle flush commands and simply ACK them without passing them to the drive, so yanking a drive can cause problems. There has been quite a bit of discussion on the zfs-discuss list on this topic of the years, especially when it comes to (consumer) SSDs. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Constant rebooting after power loss
On Sat, Apr 02, 2011 at 12:55:15PM -0400, David Magda wrote: > On Apr 1, 2011, at 23:35, Matthew Dillon wrote: > > >The solution to this first item is for the OS/filesystem to issue a > >disk flush command to the drive at appropriate times. If I recall the > >ZFS implementation in FreeBSD *DOES* do this for transaction groups, > >which guarantees that a prior transaction group is fully synced before > >a new ones starts running (HAMMER in DragonFly also does this). > >(Just getting an 'ack' from the write transaction over the SATA bus only > >means the data made it to the drive's cache, not that it made it to > >the platter). > > It should also be noted that some drives ignore or lie about these flush > commands: i.e., they say they flushed the buffers but did not in fact do so. > This is sometimes done on cheap SATA drives, but also on expensive SANS. If > the former's case it's often to help with benchmark numbers. In the latter's > case, it's usually okay because the buffers are actually NVRAM, and so are > safe across power cycles. There are also some USB-to-SATA chipsets that don't > handle flush commands and simply ACK them without passing them to the drive, > so yanking a drive can cause problems. SANs are *theoretically* safer because of their battery backed caches, however it's not guaranteed - I've seen an array controller crash and royally screw the data sets as a result, even when the cache was allegedly mirrored to the redundant controller in the array. NVRAM/battery backed cache protects against certain failures but introduces other failures in their place. You have to do your own risk/benefit analysis before seeing which is the best solution for your usage scenario. As long as it is "in transit" to permanent storage, it's at risk. All the disk redundancy/battery backed caches in the world is no replacement for a comprehensive *and regularly tested* backup strategy. Regards, Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Constant rebooting after power loss
:It should also be noted that some drives ignore or lie about these flush commands: i.e., they say they flushed the buffers but did not in fact do so. This is sometimes done on cheap SATA drives, but also on expensive SANS. If the former's case it's often to help with benchmark numbers. In the latter's case, it's usually okay because the buffers are actually NVRAM, and so are safe across power cycles. There are also some USB-to-SATA chipsets that don't handle flush commands and simply ACK them without passing them to the drive, so yanking a drive can cause problems. : :There has been quite a bit of discussion on the zfs-discuss list on this topic of the years, especially when it comes to (consumer) SSDs. Remember also that numerous ZFS studies have been debunked in recent years, though I agree with the idea that going that extra mile requires not trusting anything. In many respects ZFS's biggest enemy now is bugs in ZFS itself (or the OS it runs under), and not so much glitches in the underlying storage framework. I am unaware of *ANY* mainstream hard drive or SSD made in the last 10 years which ignores the disk flush command. In previous decades HD vendors played games with caching all the time but there are fewer HD vendors now and they all compete heavily with each other... they don't play those games any more for fear of losing their reputation. There is very little vendor loyalty in the hard drive business. When it comes to SSDs there are all sorts of fringe vendors, and I certainly would not trust any of those, but if you stick to well known vendors like Intel or OCZ it will work. Look for who's chipsets are under the hood more than for whos name is slapped onto the SSD and get as close to the source as you can. Most current-day disk flush command issues are at a higher level. For example, numerous VMs ignore the command (don't even bother to fsync() the underlying block devices or files!). There isn't anything you can do about a VM other than complain about it to the vendor. I've been hit by this precisely issue running HAMMER inside a VM on a windows box. If the VM blue-screen's the windows box (which happens quite often) the data on-disk can wind up corrupted beyond all measure. People who use VMs with direct-attached filesystems basically rely on the host computer never crashing and should really have no expectation of storage reliability short of running the VM inside an IBM mainframe. That is the unfortunate truth. With USB the primary culprit is virtually *all* USB/Firewire/SATA bridges, as you noted, because I think there are only like 2 or 3 actual manufacturers and they are all broken. The USB standard itself shares the blame for this. It is a really horrible standard. USB-sticks are the ones that typically either lock up or return success but don't actually flush their (fortunately limited) caches. Nobody in their right mind uses USB to attach a disk when reliability is important. It's fine to have it... I have lots of USB sticks and a few USB-attached HDs lying around, but I have *ZERO* expectation of reliability from them and neither should anyone else. SD cards are in the same category as USB. Useful but untrustworthy. Other fringe consumer crap, like fake-raid (BIOS-based RAID), is equally unreliable when it comes to dealing with outright crashes. Always fun to have drives which can't be moved to other machines if a mobo dies! Not! With network attached drives the standard itself is broken. It tries to define command completion as the data being on-media which is stupid when no other direct-attached standard requires that. Stupidity in standards is a primary factor in vendors ignoring portions of standards. In the case of network-attached drives implemented with direct-attached drives on machines with software drivers to bridge to the network, it comes down to whether the software deals with the flush command properly, because it sure as hell isn't going to sync each write command all the way to the media! But frankly, none of these issues should stop anyone from not using the command or rationalizing it away. Not that I am blaming anyone for trying to rationalize it away, I am simply pointing out that in a market as large as the generic 'storage' market is, there are always going to be tons of broken stuff out there to avoid. It's buyer beware. What we care about here, in this discussion, is direct-attached SATA/eSATA/SAS, port multipliers and other external enclosure bridges, high-end SCSI phys and, NVRAM aside (which is arguable), real RAID hardware. And well-known vendors (fringe SSDs do not count). That covers 90% of the market and 99% of the cases where protocol reliability is required.
Re: Constant rebooting after power loss
> I am unaware of *ANY* mainstream hard drive or SSD made in the > last 10 years which ignores the disk flush command. In previous > decades HD vendors played games with caching all the time but there > are fewer HD vendors now and they all compete heavily with each > other... they don't play those games any more for fear of losing > their reputation. There is very little vendor loyalty in the hard > drive business. > > When it comes to SSDs there are all sorts of fringe vendors, and I > certainly would not trust any of those, but if you stick to > well known vendors like Intel or OCZ it will work. Look for who's > chipsets are under the hood more than for whos name is slapped > onto the SSD and get as close to the source as you can. As far as my knowledges goes all mainstream non-enterprise SSD do not obey the cache flush command at all. The question for a recommended SSD for ZIL-use regularly comes up on the zfs-discuss mailing list and the general consensus seems that no non-enterprise SSD can really be recommended because of this issue. The only publicly available SSDs which do not exhibit this are those with a Sandforce enterprise controller (SF-1500 if my memory serves me correctly) and a capacitor (OCZ sells such models) and the Intel X25-E with it's write cache turned off (also resulting in horrible write performance all around). I've had the chance to verify this with a Corsair Force SSD with the SF-1200 controller. It consistently "lost" about 1,2 to 1,5 MB of data which it claimed to have committed to disk. Regards Florian signature.asc Description: PGP signature
Re: Backup tool fot ZFS with all "classic dump(8)" fetatures -- what should I use? (or is here any way to make dump -L works well on large FFS2+SU?)
On Mon, 28 Mar 2011 03:20:07 -0400, Lev Serebryakov wrote: I'm thinking to transfer GOME filesystem to ZFS. But I can not find appropriate tools for backing it up. Here is some requirements: Have you considered a full-up backup solution, like bacula? It's a client/server/server model backup system - there's a server process that coordinates all actions ('director'), various server process that run on machines with the devices/mounts/disks for storing the backups ('storage daemons') and then each client runs a little process to give access to the backup servers ('file daemons'). It allows you to specify a large amount of behavior. You can store backup to disk/file and to tape. If using disks/files, you can backup to the same file always, backup to files with 1gb max etc, or backup to a new file each time iirc. It has support for arbitrary schedules with each schedule being able to specify the dump level (full, incremental, differential) It uses a database in the director for metadata. And, iirc, it honors the nodump flag, stores ACLs, etc. Most importantly, it has support for pre- and post-backup hooks, so you can tell it to snapshot beforehand and then (probably, see below) use the post-hook to push the data where you want. Reading about your requirement #1, I'm guessing that the backup data is being collected locally and then sent over ftp for permanent storage. Do you have control over this remote machine? Could you replace ftp with bacula's networked client/server model? This might be the one spot that would be hard to make bacula work for you, I'm not sure since I haven't played with bacula in this configuration and I'm not exactly sure what your restrictions are. Even then, you could probably mount the FTP server as a 'file system' ala sshfs and have the storage daemon write it directly to the mounted file system. And yeah, it's free. http://www.bacula.org If you want to give it a shot, you can set it up on a little test machine and have it backup to itself. I might recommend doing this anyway since you'll want to be able to experiment with configuration and controls before trying it on your production machine. --Kevin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"