Re: Constant rebooting after power loss

2011-04-02 Thread Olivier Smedts
2011/4/2 Matthew Dillon :
>    The core of the issue here comes down to two things:
>
>    First, a power loss to the drive will cause the drive's dirty write cache
>    to be lost, that data will not make it to disk.  Nor do you really want
>    to turn of write caching on the physical drive.  Well, you CAN turn it
>    off, but if you do performance will become so bad that there's no point.
>    So turning off the write caching is really a non-starter.
>
>    The solution to this first item is for the OS/filesystem to issue a
>    disk flush command to the drive at appropriate times.  If I recall the
>    ZFS implementation in FreeBSD *DOES* do this for transaction groups,
>    which guarantees that a prior transaction group is fully synced before
>    a new ones starts running (HAMMER in DragonFly also does this).
>    (Just getting an 'ack' from the write transaction over the SATA bus only
>    means the data made it to the drive's cache, not that it made it to
>    the platter).

Amen !

>    I'm not sure about UFS vis-a-vie the recent UFS logging features...
>    it might be an option but I don't know if it is a default.  Perhaps
>    someone can comment on that.
>
>    One last note here.  Many modern drives have very large ram caches.
>    OCZ's SSDs have something like 256MB write caches and many modern HDs
>    now come with 32MB and 64MB caches.  Aged drives with lots of relocated
>    sectors and bit errors can also take a very long time to perform writes
>    on certain sectors.  So these large caches take time to drain and one
>    can't really assume that an acknowledged write to disk will actually
>    make it to the disk under adverse circumstances any more.  All sorts
>    of bad things can happen.
>
>    Finally, the drives don't order their writes to the platter (you can
>    set a bit to tell them to, but like many similar bits in the past there
>    is no real guarantee that the drives will honor it).  So if two
>    transactions do not have a disk flush command inbetween them it is
>    possible for data from the second transaction to commit to the platter
>    before all the data from the first transaction commits to the platter.
>    Or worse, for the non-transactional data to update out of order relative
>    to the transactional data which was supposed to commit first.
>
>    Hence IMHO the OS/filesystem must use the disk flush command in such
>    situations for good reliability.
>
>    --
>
>    The second problem is that a physical loss of power to the drive can
>    cause the drive to physically lose one or more sectors, and can even
>    effectively destroy the drive (even with the fancy auto-park)... if the
>    drive happens to be in the middle of a track write-back when power is
>    lost it is possible to lose far more than a single sector, including
>    sectors unrelated to recent filesystem operations.
>
>    The only solution to #2 is to make sure your machines (or at least the
>    drives if they happen to be in external enclosures) are connected to
>    a UPS and that the machines are communicating with the UPS via
>    something like the "apcupsd" port.  AND also that you test to make
>    sure the machines properly shut themselves down when AC is lost before
>    the UPS itself runs out of battery time.  After all, a UPS won't help
>    if the machines don't at least idle their drives before power is lost!!!
>
>    I learned this lesson the hard way about 3 years ago.  I had something
>    like a dozen drives in two raid arrays doing heavy write activity and
>    lost physical power and several of the drives were totally destroyed,
>    with thousands of sector errors.  Not just one or two... thousands.
>
>    (It is unclear how SSDs react to physical loss of power during heavy
>    writing activity.  Theoretically while they will certainly lose their
>    write cache they shouldn't wind up with any read errors).
>
>                                                -Matt
>
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>



-- 
Olivier Smedts                                                 _
                                        ASCII ribbon campaign ( )
e-mail: oliv...@gid0.org        - against HTML email & vCards  X
www: http://www.gid0.org    - against proprietary attachments / \

  "Il y a seulement 10 sortes de gens dans le monde :
  ceux qui comprennent le binaire,
  et ceux qui ne le comprennent pas."
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Any success stories for HAST + ZFS?

2011-04-02 Thread Pawel Jakub Dawidek
On Thu, Mar 24, 2011 at 01:36:32PM -0700, Freddie Cash wrote:
> [Not sure which list is most appropriate since it's using HAST + ZFS
> on -RELEASE, -STABLE, and -CURRENT.  Feel free to trim the CC: on
> replies.]
> 
> I'm having a hell of a time making this work on real hardware, and am
> not ruling out hardware issues as yet, but wanted to get some
> reassurance that someone out there is using this combination (FreeBSD
> + HAST + ZFS) successfully, without kernel panics, without core dumps,
> without deadlocks, without issues, etc.  I need to know I'm not
> chasing a dead rabbit.

I just committed a fix for a problem that might look like a deadlock.
With trociny@ patch and my last fix (to GEOM GATE and hastd) do you
still have any issues?

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp80PxT4EuiQ.pgp
Description: PGP signature


[releng_7 tinderbox] failure on sparc64/sparc64

2011-04-02 Thread FreeBSD Tinderbox
TB --- 2011-04-02 08:01:56 - tinderbox 2.6 running on freebsd-legacy.sentex.ca
TB --- 2011-04-02 08:01:56 - starting RELENG_7 tinderbox run for sparc64/sparc64
TB --- 2011-04-02 08:01:56 - cleaning the object tree
TB --- 2011-04-02 08:02:09 - cvsupping the source tree
TB --- 2011-04-02 08:02:09 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
-s /usr/home/tinderbox/RELENG_7/sparc64/sparc64/supfile
TB --- 2011-04-02 08:02:15 - building world
TB --- 2011-04-02 08:02:15 - MAKEOBJDIRPREFIX=/obj
TB --- 2011-04-02 08:02:15 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2011-04-02 08:02:15 - TARGET=sparc64
TB --- 2011-04-02 08:02:15 - TARGET_ARCH=sparc64
TB --- 2011-04-02 08:02:15 - TZ=UTC
TB --- 2011-04-02 08:02:15 - __MAKE_CONF=/dev/null
TB --- 2011-04-02 08:02:15 - cd /src
TB --- 2011-04-02 08:02:15 - /usr/bin/make -B buildworld
>>> World build started on Sat Apr  2 08:02:16 UTC 2011
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
>>> World build completed on Sat Apr  2 09:14:14 UTC 2011
TB --- 2011-04-02 09:14:14 - generating LINT kernel config
TB --- 2011-04-02 09:14:14 - cd /src/sys/sparc64/conf
TB --- 2011-04-02 09:14:14 - /usr/bin/make -B LINT
TB --- 2011-04-02 09:14:14 - building LINT kernel
TB --- 2011-04-02 09:14:14 - MAKEOBJDIRPREFIX=/obj
TB --- 2011-04-02 09:14:14 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2011-04-02 09:14:14 - TARGET=sparc64
TB --- 2011-04-02 09:14:14 - TARGET_ARCH=sparc64
TB --- 2011-04-02 09:14:14 - TZ=UTC
TB --- 2011-04-02 09:14:14 - __MAKE_CONF=/dev/null
TB --- 2011-04-02 09:14:14 - cd /src
TB --- 2011-04-02 09:14:14 - /usr/bin/make -B buildkernel KERNCONF=LINT
>>> Kernel build for LINT started on Sat Apr  2 09:14:14 UTC 2011
>>> stage 1: configuring the kernel
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3.1: making dependencies
>>> stage 3.2: building everything
[...]
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  
-I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS 
-include opt_global.h -fno-common -finline-limit=15000 --param 
inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin 
-mcmodel=medany -msoft-float -ffreestanding -Werror  /src/sys/kern/kern_intr.c
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  
-I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS 
-include opt_global.h -fno-common -finline-limit=15000 --param 
inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin 
-mcmodel=medany -msoft-float -ffreestanding -Werror  /src/sys/kern/kern_jail.c
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  
-I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS 
-include opt_global.h -fno-common -finline-limit=15000 --param 
inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin 
-mcmodel=medany -msoft-float -ffreestanding -Werror  /src/sys/kern/kern_kse.c
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  
-I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS 
-include opt_global.h -fno-common -finline-limit=15000 --param 
inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin 
-mcmodel=medany -msoft-float -ffreestanding -Werror  
/src/sys/kern/kern_kthread.c
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -nostdinc  
-I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS 
-include opt_global.h -fno-common -finline-limit=15000 --param 
inline-unit-growth=100 --param large-function-growth=1000 -fno-builtin 
-mcmodel=medany -msoft-float -ffreestanding -Werror  /src/sys/kern/kern_ktr.c
cc -c -O2 -pipe -fno-strict-aliasing  -std=c99  -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototype

ahci.ko in RELENG_8_2, what about atacontrol cap?

2011-04-02 Thread Patrick M. Hausen
Hi, all,

On my system using the "old" ATA driver, I can use a command like this
to get useful information about my disk drives:


nas-pmh# atacontrol cap ad4

Protocol  SATA revision 2.x
device model  ST32000542AS
serial number 5XW251QF
firmware revision CC34
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported   3907029168 sectors
dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Native Command Queuing (NCQ)   yes   -  31/0x1F
Tagged Command Queuing (TCQ)   no   no  31/0x1F
SMART  yes  yes
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  yes  yes 49344/0xC0C0
automatic acoustic management  yes  yes 254/0xFE254/0xFE


When I switch to the new AHCI driver the drives are connected to
the CAM subsystem:


nas-pmh# camcontrol devlist
at scbus0 target 0 lun 0 (ada0,pass0)
at scbus1 target 0 lun 0 (ada1,pass1)
at scbus2 target 0 lun 0 (ada2,pass2)
at scbus3 target 0 lun 0 (ada3,pass3)
  at scbus4 target 0 lun 0 (da0,pass4)


But:


nas-pmh# camcontrol inquiry ada0
nas-pmh# camcontrol readcap ada0
(pass0:ahcich0:0:0:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
(pass0:ahcich0:0:0:0): CAM status: CCB request was invalid



Obvious question: is there a way to get the same information (NCQ support,
write cache status, ...) with the new driver?

Thanks,
Patrick
-- 
punkt.de GmbH * Kaiserallee 13a * 76133 Karlsruhe
Tel. 0721 9109 0 * Fax 0721 9109 100
i...@punkt.de   http://www.punkt.de
Gf: Jürgen Egeling  AG Mannheim 108285

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ahci.ko in RELENG_8_2, what about atacontrol cap?

2011-04-02 Thread Jeremy Chadwick
On Sat, Apr 02, 2011 at 11:34:32AM +0200, Patrick M. Hausen wrote:
> Hi, all,
> 
> On my system using the "old" ATA driver, I can use a command like this
> to get useful information about my disk drives:
> 
> 
> nas-pmh# atacontrol cap ad4
> 
> Protocol  SATA revision 2.x
> device model  ST32000542AS
> serial number 5XW251QF
> firmware revision CC34
> cylinders 16383
> heads 16
> sectors/track 63
> lba supported 268435455 sectors
> lba48 supported   3907029168 sectors
> dma supported
> overlap not supported
> 
> Feature  Support  EnableValue   Vendor
> write cacheyesyes
> read ahead yesyes
> Native Command Queuing (NCQ)   yes -  31/0x1F
> Tagged Command Queuing (TCQ)   no no  31/0x1F
> SMART  yesyes
> microcode download yesyes
> security   yesno
> power management   yesyes
> advanced power management  yesyes 49344/0xC0C0
> automatic acoustic management  yesyes 254/0xFE254/0xFE
> 
> 
> When I switch to the new AHCI driver the drives are connected to
> the CAM subsystem:
> 
> 
> nas-pmh# camcontrol devlist
> at scbus0 target 0 lun 0 (ada0,pass0)
> at scbus1 target 0 lun 0 (ada1,pass1)
> at scbus2 target 0 lun 0 (ada2,pass2)
> at scbus3 target 0 lun 0 (ada3,pass3)
>   at scbus4 target 0 lun 0 (da0,pass4)
> 
> 
> But:
> 
> 
> nas-pmh# camcontrol inquiry ada0
> nas-pmh# camcontrol readcap ada0
> (pass0:ahcich0:0:0:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
> (pass0:ahcich0:0:0:0): CAM status: CCB request was invalid
> 
> 
> 
> Obvious question: is there a way to get the same information (NCQ support,
> write cache status, ...) with the new driver?

You want "camcontrol identify adaX".  DO NOT confuse this with
"camcontrol inquiry adaX" (this won't work).

identify = for ATA
inquiry  = for SCSI

See camcontrol(8) man page for specifics.

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.   PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Backup tool fot ZFS with all "classic dump(8)" fetatures -- what should I use? (or is here any way to make dump -L works well on large FFS2+SU?)

2011-04-02 Thread Ronald Klop

Looked at rsync or tarsnap?

On Mon, 28 Mar 2011 09:20:07 +0200, Lev Serebryakov   
wrote:



Hello, Freebsd-stable.

  Now I'm backing up my HOME filesystem with dump(8). It works
perfectly for 80GiB FS with many features: snapshot for consistency,
levels, "nodump" flag (my users use it a lot!), ability to extract
only one removed file from backup without restoring full FS, simple
sctipy wrap-up for levels schedule, etc.

  On new server I  have huge HOME (500GiB). And even if it is filled
up only with 25GiB of data, creating snapshot takes about 10 minutes,
freeze all I/O, and sometimes FAILS (!!!).

  I'm thinking to transfer GOME filesystem to ZFS. But I can not find
appropriate tools for backing it up. Here is some requirements:

 (1) One-file (one-stream) backup. Not directory mirror. I need to
 store it on FTP server and upload with single command.

 (2) Levels & increment backups. Now I have "Monthly (0) - Weekly
(1,2,3) - daily (4,5,6,7,8,9)" scheme. I could afford other schemes,
but if they doesn't store full backup every day and doesn't need full
backup more often than weekly.

 (3) Minimum of local metadata. Storing previous backups locally to
 calculate next one is not appropriate solution. "zfs send" needs
 previous snapshots for incremental backup, for example.

 (4) Working with snapshot (I think, it is trivial in case of ZFS).

 (5) Backup exclusions should be controlled by users (not super-user)  
themselves,

 like "nodump" flag in case of FFS/dump(8). "zfs send" can not
 provide this. I have very responsible users, so full backup
 now takes only up to 10GiB when all HOME FS is about 25GiB, so it
 is big help when backup is sent over Internet to other host.

 (6) Storing of ALL FS-specific information -- ACLs, etc.

 (7) Free :)

  Is here something like this for ZFS? "zfs send" looks promising,
 EXCEPT item (5) and, maybe, (3) :(

   gnu tar looks like everything but (6) :(

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: geli(4) memory leak

2011-04-02 Thread Pawel Jakub Dawidek
On Sat, Apr 02, 2011 at 12:04:09AM +0300, Mikolaj Golub wrote:
> For me your patch look correct. But the same issue is for read :-). Also, to
> avoid the leak I think we can just do g_destroy_bio() before "all sectors"
> check. See the attached patch (had some testing).

The patch looks good. Please commit.

> Index: sys/geom/eli/g_eli.c
> ===
> --- sys/geom/eli/g_eli.c  (revision 220168)
> +++ sys/geom/eli/g_eli.c  (working copy)
> @@ -160,13 +160,13 @@ g_eli_read_done(struct bio *bp)
>   pbp = bp->bio_parent;
>   if (pbp->bio_error == 0)
>   pbp->bio_error = bp->bio_error;
> + g_destroy_bio(bp);
>   /*
>* Do we have all sectors already?
>*/
>   pbp->bio_inbed++;
>   if (pbp->bio_inbed < pbp->bio_children)
>   return;
> - g_destroy_bio(bp);
>   sc = pbp->bio_to->geom->softc;
>   if (pbp->bio_error != 0) {
>   G_ELI_LOGREQ(0, pbp, "%s() failed", __func__);
> @@ -202,6 +202,7 @@ g_eli_write_done(struct bio *bp)
>   if (bp->bio_error != 0)
>   pbp->bio_error = bp->bio_error;
>   }
> + g_destroy_bio(bp);
>   /*
>* Do we have all sectors already?
>*/
> @@ -215,7 +216,6 @@ g_eli_write_done(struct bio *bp)
>   pbp->bio_error);
>   pbp->bio_completed = 0;
>   }
> - g_destroy_bio(bp);
>   /*
>* Write is finished, send it up.
>*/

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpmtOt5nWc4N.pgp
Description: PGP signature


Kernel memory leak in 8.2-PRERELEASE?

2011-04-02 Thread Boris Kochergin

Ahoy. This morning, I awoke to the following on one of my servers:

pid 59630 (httpd), uid 80, was killed: out of swap space
pid 59341 (find), uid 0, was killed: out of swap space
pid 23134 (irssi), uid 1001, was killed: out of swap space
pid 49332 (sshd), uid 1001, was killed: out of swap space
pid 69074 (httpd), uid 0, was killed: out of swap space
pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space
...

And so on.

The machine is:

FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu 
Dec  2 11:39:21 EST 2010 
sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS  amd64


10:13AM  up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00

The memory line from top intrigued me:

Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free

The machine has 8 gigs of memory, and I don't know what all that wired 
memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2 
on it which has had a disk in the UNAVAIL state for a few months:


# zpool status
  pool: home
 state: DEGRADED
status: One or more devices could not be used because the label is 
missing or

invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
homeDEGRADED 0 0 0
  raidz2DEGRADED 0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5UNAVAIL  08511  experienced I/O failures

errors: No known data errors

"vmstat -m" and "vmstat -z" output:

http://acm.poly.edu/~spawk/vmstat-m.txt
http://acm.poly.edu/~spawk/vmstat-z.txt

Anyone have a clue? I know it's just going to happen again if I reboot 
the machine. It is still up in case there are diagnostics for me to run.


-Boris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Kernel memory leak in 8.2-PRERELEASE?

2011-04-02 Thread Jeremy Chadwick
On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote:
> Ahoy. This morning, I awoke to the following on one of my servers:
> 
> pid 59630 (httpd), uid 80, was killed: out of swap space
> pid 59341 (find), uid 0, was killed: out of swap space
> pid 23134 (irssi), uid 1001, was killed: out of swap space
> pid 49332 (sshd), uid 1001, was killed: out of swap space
> pid 69074 (httpd), uid 0, was killed: out of swap space
> pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space
> ...
> 
> And so on.
>
> The machine is:
> 
> FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2:
> Thu Dec  2 11:39:21 EST 2010
> sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS  amd64
> 
> 10:13AM  up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00
> 
> The memory line from top intrigued me:
> 
> Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free
> 
> The machine has 8 gigs of memory, and I don't know what all that
> wired memory is being used for. There is a large-ish (6 x 1.5-TB)
> ZFS RAID-Z2 on it which has had a disk in the UNAVAIL state for a
> few months:

The ZFS ARC is what's responsible for your large wired count.

How much swap space do you have?  You excluded that line from top.
"swapinfo" would also be helpful, but would indicate the same thing.

If you lack swap (which is a bad idea for a lot of reasons), then the
machine running out of available memory for userspace (a process which
grew too large, thus impacting others which were trying to malloc() at
the time) would make sense.

Can you please provide /boot/loader.conf and /etc/sysctl.conf ?

> # zpool status
>   pool: home
>  state: DEGRADED
> status: One or more devices could not be used because the label is
> missing or
> invalid.  Sufficient replicas exist for the pool to continue
> functioning in a degraded state.
> action: Replace the device using 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-4J
>  scrub: none requested
> config:
> 
> NAMESTATE READ WRITE CKSUM
> homeDEGRADED 0 0 0
>   raidz2DEGRADED 0 0 0
> ada0ONLINE   0 0 0
> ada1ONLINE   0 0 0
> ada2ONLINE   0 0 0
> ada3ONLINE   0 0 0
> ada4ONLINE   0 0 0
> ada5UNAVAIL  08511  experienced I/O failures
> 
> errors: No known data errors

I would also recommend fixing ada5; I'm not sure why any SA would let a
bad disk sit in a machine for "a few months".  Though, hopefully, this
doesn't cause extra memory usage or something odd behind the scenes (in
the kernel).  I'm going to assume the two things are completely
unrelated.

> "vmstat -m" and "vmstat -z" output:
> 
> http://acm.poly.edu/~spawk/vmstat-m.txt
> http://acm.poly.edu/~spawk/vmstat-z.txt
> 
> Anyone have a clue? I know it's just going to happen again if I
> reboot the machine. It is still up in case there are diagnostics for
> me to run.

The above vmstat data won't be too helpful since you need to see what's
going on "over time" and not what the values are right now.  There may
be one of them that indicates available userspace vs. available kmem.

Basically what you need is the equivalent of Solaris sar(1), so that you
can see memory usage of processes/etc. over time and find out if
something went crazy and started going malloc-crazy.

If the kernel itself ran out, you'd be seeing a panic.

Sorry if these ideas/comments seem like a ramble, I've been up all night
trying to decode a circa-1992 font routine in 65816 assembly, heh.  :-)

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.   PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Kernel memory leak in 8.2-PRERELEASE?

2011-04-02 Thread Kostik Belousov
On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote:
> Ahoy. This morning, I awoke to the following on one of my servers:
> 
> pid 59630 (httpd), uid 80, was killed: out of swap space
> pid 59341 (find), uid 0, was killed: out of swap space
> pid 23134 (irssi), uid 1001, was killed: out of swap space
> pid 49332 (sshd), uid 1001, was killed: out of swap space
> pid 69074 (httpd), uid 0, was killed: out of swap space
> pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space
> ...
> 
> And so on.
> 
> The machine is:
> 
> FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu 
> Dec  2 11:39:21 EST 2010 
> sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS  amd64
> 
> 10:13AM  up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00
> 
> The memory line from top intrigued me:
> 
> Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free
> 
> The machine has 8 gigs of memory, and I don't know what all that wired 
> memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2 
> on it which has had a disk in the UNAVAIL state for a few months:
> 
> # zpool status
>   pool: home
>  state: DEGRADED
> status: One or more devices could not be used because the label is 
> missing or
> invalid.  Sufficient replicas exist for the pool to continue
> functioning in a degraded state.
> action: Replace the device using 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-4J
>  scrub: none requested
> config:
> 
> NAMESTATE READ WRITE CKSUM
> homeDEGRADED 0 0 0
>   raidz2DEGRADED 0 0 0
> ada0ONLINE   0 0 0
> ada1ONLINE   0 0 0
> ada2ONLINE   0 0 0
> ada3ONLINE   0 0 0
> ada4ONLINE   0 0 0
> ada5UNAVAIL  08511  experienced I/O failures
> 
> errors: No known data errors
> 
> "vmstat -m" and "vmstat -z" output:
> 
> http://acm.poly.edu/~spawk/vmstat-m.txt
> http://acm.poly.edu/~spawk/vmstat-z.txt
> 
> Anyone have a clue? I know it's just going to happen again if I reboot 
> the machine. It is still up in case there are diagnostics for me to run.

Try r218795. Most likely, your issue is not leak.


pgpDK3atxfMFJ.pgp
Description: PGP signature


Re: Kernel memory leak in 8.2-PRERELEASE?

2011-04-02 Thread Boris Kochergin

On 04/02/11 11:33, Kostik Belousov wrote:

On Sat, Apr 02, 2011 at 10:17:27AM -0400, Boris Kochergin wrote:

Ahoy. This morning, I awoke to the following on one of my servers:

pid 59630 (httpd), uid 80, was killed: out of swap space
pid 59341 (find), uid 0, was killed: out of swap space
pid 23134 (irssi), uid 1001, was killed: out of swap space
pid 49332 (sshd), uid 1001, was killed: out of swap space
pid 69074 (httpd), uid 0, was killed: out of swap space
pid 11879 (eggdrop-1.6.19), uid 1001, was killed: out of swap space
...

And so on.

The machine is:

FreeBSD exodus.poly.edu 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #2: Thu
Dec  2 11:39:21 EST 2010
sp...@exodus.poly.edu:/usr/obj/usr/src/sys/EXODUS  amd64

10:13AM  up 120 days, 20:06, 2 users, load averages: 0.00, 0.01, 0.00

The memory line from top intrigued me:

Mem: 16M Active, 48M Inact, 6996M Wired, 229M Cache, 828M Buf, 605M Free

The machine has 8 gigs of memory, and I don't know what all that wired
memory is being used for. There is a large-ish (6 x 1.5-TB) ZFS RAID-Z2
on it which has had a disk in the UNAVAIL state for a few months:

# zpool status
   pool: home
  state: DEGRADED
status: One or more devices could not be used because the label is
missing or
 invalid.  Sufficient replicas exist for the pool to continue
 functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
  scrub: none requested
config:

 NAMESTATE READ WRITE CKSUM
 homeDEGRADED 0 0 0
   raidz2DEGRADED 0 0 0
 ada0ONLINE   0 0 0
 ada1ONLINE   0 0 0
 ada2ONLINE   0 0 0
 ada3ONLINE   0 0 0
 ada4ONLINE   0 0 0
 ada5UNAVAIL  08511  experienced I/O failures

errors: No known data errors

"vmstat -m" and "vmstat -z" output:

http://acm.poly.edu/~spawk/vmstat-m.txt
http://acm.poly.edu/~spawk/vmstat-z.txt

Anyone have a clue? I know it's just going to happen again if I reboot
the machine. It is still up in case there are diagnostics for me to run.

Try r218795. Most likely, your issue is not leak.


Thanks. Will update to today's 8-STABLE and report back.

-Boris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Constant rebooting after power loss

2011-04-02 Thread David Magda
On Apr 1, 2011, at 23:35, Matthew Dillon wrote:

>The solution to this first item is for the OS/filesystem to issue a
>disk flush command to the drive at appropriate times.  If I recall the
>ZFS implementation in FreeBSD *DOES* do this for transaction groups,
>which guarantees that a prior transaction group is fully synced before
>a new ones starts running (HAMMER in DragonFly also does this).
>(Just getting an 'ack' from the write transaction over the SATA bus only
>means the data made it to the drive's cache, not that it made it to
>the platter).

It should also be noted that some drives ignore or lie about these flush 
commands: i.e., they say they flushed the buffers but did not in fact do so. 
This is sometimes done on cheap SATA drives, but also on expensive SANS. If the 
former's case it's often to help with benchmark numbers. In the latter's case, 
it's usually okay because the buffers are actually NVRAM, and so are safe 
across power cycles. There are also some USB-to-SATA chipsets that don't handle 
flush commands and simply ACK them without passing them to the drive, so 
yanking a drive can cause problems.

There has been quite a bit of discussion on the zfs-discuss list on this topic 
of the years, especially when it comes to (consumer) SSDs.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Constant rebooting after power loss

2011-04-02 Thread Gary Palmer
On Sat, Apr 02, 2011 at 12:55:15PM -0400, David Magda wrote:
> On Apr 1, 2011, at 23:35, Matthew Dillon wrote:
> 
> >The solution to this first item is for the OS/filesystem to issue a
> >disk flush command to the drive at appropriate times.  If I recall the
> >ZFS implementation in FreeBSD *DOES* do this for transaction groups,
> >which guarantees that a prior transaction group is fully synced before
> >a new ones starts running (HAMMER in DragonFly also does this).
> >(Just getting an 'ack' from the write transaction over the SATA bus only
> >means the data made it to the drive's cache, not that it made it to
> >the platter).
> 
> It should also be noted that some drives ignore or lie about these flush 
> commands: i.e., they say they flushed the buffers but did not in fact do so. 
> This is sometimes done on cheap SATA drives, but also on expensive SANS. If 
> the former's case it's often to help with benchmark numbers. In the latter's 
> case, it's usually okay because the buffers are actually NVRAM, and so are 
> safe across power cycles. There are also some USB-to-SATA chipsets that don't 
> handle flush commands and simply ACK them without passing them to the drive, 
> so yanking a drive can cause problems.

SANs are *theoretically* safer because of their battery backed caches, however
it's not guaranteed - I've seen an array controller crash and royally screw
the data sets as a result, even when the cache was allegedly mirrored to
the redundant controller in the array.

NVRAM/battery backed cache protects against certain failures but introduces
other failures in their place.  You have to do your own risk/benefit
analysis before seeing which is the best solution for your usage scenario.
As long as it is "in transit" to permanent storage, it's at risk.  All the
disk redundancy/battery backed caches in the world is no replacement for
a comprehensive *and regularly tested* backup strategy.

Regards,

Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Constant rebooting after power loss

2011-04-02 Thread Matthew Dillon
:It should also be noted that some drives ignore or lie about these flush 
commands: i.e., they say they flushed the buffers but did not in fact do so. 
This is sometimes done on cheap SATA drives, but also on expensive SANS. If the 
former's case it's often to help with benchmark numbers. In the latter's case, 
it's usually okay because the buffers are actually NVRAM, and so are safe 
across power cycles. There are also some USB-to-SATA chipsets that don't handle 
flush commands and simply ACK them without passing them to the drive, so 
yanking a drive can cause problems.
:
:There has been quite a bit of discussion on the zfs-discuss list on this topic 
of the years, especially when it comes to (consumer) SSDs.

Remember also that numerous ZFS studies have been debunked in recent
years, though I agree with the idea that going that extra mile requires
not trusting anything.  In many respects ZFS's biggest enemy now is
bugs in ZFS itself (or the OS it runs under), and not so much glitches
in the underlying storage framework.

I am unaware of *ANY* mainstream hard drive or SSD made in the last
10 years which ignores the disk flush command.  In previous decades HD
vendors played games with caching all the time but there are fewer
HD vendors now and they all compete heavily with each other... they
don't play those games any more for fear of losing their reputation.
There is very little vendor loyalty in the hard drive business.

When it comes to SSDs there are all sorts of fringe vendors, and I
certainly would not trust any of those, but if you stick to
well known vendors like Intel or OCZ it will work.  Look for who's
chipsets are under the hood more than for whos name is slapped onto
the SSD and get as close to the source as you can.

Most current-day disk flush command issues are at a higher level.  For
example, numerous VMs ignore the command (don't even bother to fsync()
the underlying block devices or files!).  There isn't anything you can
do about a VM other than complain about it to the vendor.  I've been hit
by this precisely issue running HAMMER inside a VM on a windows box.
If the VM blue-screen's the windows box (which happens quite often)
the data on-disk can wind up corrupted beyond all measure.

People who use VMs with direct-attached filesystems basically rely on
the host computer never crashing and should really have no expectation
of storage reliability short of running the VM inside an IBM mainframe.
That is the unfortunate truth.

With USB the primary culprit is virtually *all* USB/Firewire/SATA
bridges, as you noted, because I think there are only like 2 or 3
actual manufacturers and they are all broken.  The USB standard itself
shares the blame for this.  It is a really horrible standard.

USB-sticks are the ones that typically either lock up or return
success but don't actually flush their (fortunately limited) caches.
Nobody in their right mind uses USB to attach a disk when reliability
is important.  It's fine to have it... I have lots of USB sticks and
a few USB-attached HDs lying around, but I have *ZERO* expectation of
reliability from them and neither should anyone else.

SD cards are in the same category as USB.  Useful but untrustworthy.

Other fringe consumer crap, like fake-raid (BIOS-based RAID), is equally
unreliable when it comes to dealing with outright crashes.  Always fun
to have drives which can't be moved to other machines if a mobo dies!
Not!

With network attached drives the standard itself is broken.  It tries to
define command completion as the data being on-media which is stupid
when no other direct-attached standard requires that.  Stupidity in
standards is a primary factor in vendors ignoring portions of standards.

In the case of network-attached drives implemented with direct-attached
drives on machines with software drivers to bridge to the network,
it comes down to whether the software deals with the flush command
properly, because it sure as hell isn't going to sync each write
command all the way to the media!

But frankly, none of these issues should stop anyone from not using
the command or rationalizing it away.  Not that I am blaming anyone for
trying to rationalize it away, I am simply pointing out that in a
market as large as the generic 'storage' market is, there are always
going to be tons of broken stuff out there to avoid.  It's buyer beware.

What we care about here, in this discussion, is direct-attached
SATA/eSATA/SAS, port multipliers and other external enclosure bridges,
high-end SCSI phys and, NVRAM aside (which is arguable), real RAID
hardware.  And well-known vendors (fringe SSDs do not count).  That
covers 90% of the market and 99% of the cases where protocol reliability
is required.

   

Re: Constant rebooting after power loss

2011-04-02 Thread Florian Wagner
> I am unaware of *ANY* mainstream hard drive or SSD made in the
> last 10 years which ignores the disk flush command.  In previous
> decades HD vendors played games with caching all the time but there
> are fewer HD vendors now and they all compete heavily with each
> other... they don't play those games any more for fear of losing
> their reputation. There is very little vendor loyalty in the hard
> drive business.
> 
> When it comes to SSDs there are all sorts of fringe vendors, and I
> certainly would not trust any of those, but if you stick to
> well known vendors like Intel or OCZ it will work.  Look for who's
> chipsets are under the hood more than for whos name is slapped
> onto the SSD and get as close to the source as you can.

As far as my knowledges goes all mainstream non-enterprise SSD do not
obey the cache flush command at all. The question for a recommended SSD
for ZIL-use regularly comes up on the zfs-discuss mailing list and the
general consensus seems that no non-enterprise SSD can really be
recommended because of this issue.

The only publicly available SSDs which do not exhibit this are those
with a Sandforce enterprise controller (SF-1500 if my memory serves me
correctly) and a capacitor (OCZ sells such models) and the Intel X25-E
with it's write cache turned off (also resulting in horrible write
performance all around).

I've had the chance to verify this with a Corsair Force SSD with the
SF-1200 controller. It consistently "lost" about 1,2 to 1,5 MB of data
which it claimed to have committed to disk. 


Regards
Florian


signature.asc
Description: PGP signature


Re: Backup tool fot ZFS with all "classic dump(8)" fetatures -- what should I use? (or is here any way to make dump -L works well on large FFS2+SU?)

2011-04-02 Thread Kevin Thompson
On Mon, 28 Mar 2011 03:20:07 -0400, Lev Serebryakov   
wrote:




  I'm thinking to transfer GOME filesystem to ZFS. But I can not find
appropriate tools for backing it up. Here is some requirements:


Have you considered a full-up backup solution, like bacula? It's a  
client/server/server model backup system - there's a server process that  
coordinates all actions ('director'), various server process that run on  
machines with the devices/mounts/disks for storing the backups ('storage  
daemons') and then each client runs a little process to give access to the  
backup servers ('file daemons').


It allows you to specify a large amount of behavior. You can store backup  
to disk/file and to tape. If using disks/files, you can backup to the same  
file always, backup to files with 1gb max etc, or backup to a new file  
each time iirc. It has support for arbitrary schedules with each schedule  
being able to specify the dump level (full, incremental, differential) It  
uses a database in the director for metadata. And, iirc, it honors the  
nodump flag, stores ACLs, etc.


Most importantly, it has support for pre- and post-backup hooks, so you  
can tell it to snapshot beforehand and then (probably, see below) use the  
post-hook to push the data where you want.


Reading about your requirement #1, I'm guessing that the backup data is  
being collected locally and then sent over ftp for permanent storage. Do  
you have control over this remote machine? Could you replace ftp with  
bacula's networked client/server model? This might be the one spot that  
would be hard to make bacula work for you, I'm not sure since I haven't  
played with bacula in this configuration and I'm not exactly sure what  
your restrictions are.


Even then, you could probably mount the FTP server as a 'file system' ala  
sshfs and have the storage daemon write it directly to the mounted file  
system.


And yeah, it's free. http://www.bacula.org

If you want to give it a shot, you can set it up on a little test machine  
and have it backup to itself. I might recommend doing this anyway since  
you'll want to be able to experiment with configuration and controls  
before trying it on your production machine.


--Kevin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"