Re: [zfs-discuss] 6410 expansion shelf
I should be able to reply to you next Tuesday -- my 6140 SATA expansion tray is due to arrive. Meanwhile, what kind of problem do you have with the 3511? -- Just me, Wire ... On 3/23/07, Frank Cusack <[EMAIL PROTECTED]> wrote: Does anyone have a 6140 expansion shelf that they can hook directly to a host? Just wondering if this configuration works. Previously I though the expansion connector was proprietary but now I see it's just fibre channel. I tried this before with a 3511 and it "kind of" worked but ultimately had various problems and I had to give up on it. Hoping to avoid the cost of the raid controller. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote: > Again, thanks to devids, the autoreplace code would not kick in here at > all. You would end up with an identical pool. Eric, maybe I'm missing something, but why ZFS depend on devids at all? As I understand it, devid is something that never change for a block device, eg. disk serial number, but on the other hand it is optional, so we can rely on the fact it's always there (I mean for all block devices we use). Why we simply not forget about devids and just focus on on-disk metadata to detect pool components? The only reason I see is performance. This is probably why /etc/zfs/zpool.cache is used as well. In FreeBSD we have the GEOM infrastructure for storage. Each storage device (disk, partition, mirror, etc.) is simply a GEOM provider. If GEOM provider appears (eg. disk is inserted, partition is configured) all interested parties are informed about this I can 'taste' the provider by reading metadata specific for them. The same when provider goes away - all interested parties are informed and can react accordingly. We don't see any performance problems related to the fact that each disk that appears is read by many "GEOM classes". -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpm6A6Tnggir.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote: > On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote: > > Again, thanks to devids, the autoreplace code would not kick in here at > > all. You would end up with an identical pool. > > Eric, maybe I'm missing something, but why ZFS depend on devids at all? > As I understand it, devid is something that never change for a block > device, eg. disk serial number, but on the other hand it is optional, so > we can rely on the fact it's always there (I mean for all block devices s/can/can't/ > we use). > > Why we simply not forget about devids and just focus on on-disk metadata > to detect pool components? > > The only reason I see is performance. This is probably why > /etc/zfs/zpool.cache is used as well. > > In FreeBSD we have the GEOM infrastructure for storage. Each storage > device (disk, partition, mirror, etc.) is simply a GEOM provider. If > GEOM provider appears (eg. disk is inserted, partition is configured) > all interested parties are informed about this I can 'taste' the > provider by reading metadata specific for them. The same when provider > goes away - all interested parties are informed and can react > accordingly. > > We don't see any performance problems related to the fact that each disk > that appears is read by many "GEOM classes". -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpMjTSvCwNLk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS ontop of SVM - CKSUM errors
Hi. bash-3.00# uname -a SunOS nfs-14-2.srv 5.10 Generic_125101-03 i86pc i386 i86pc I created first zpool (stripe of 85 disks) and did some simple stress testing - everything seems almost alright (~700MB seq reads, ~430 seqential writes). Then I destroyed pool and put SVM stripe on top the same disks utilizing the fact that zfs already put EFI and s0 represents almost entire disk. The on top on SVM volume I put zfs and simple dd files, then zpool scrub and: bash-3.00# zpool status test pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 66 errors on Fri Mar 23 12:52:36 2007 config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 134 /dev/md/dsk/d100 ONLINE 0 0 134 errors: 66 data errors, use '-v' for a list bash-3.00# Disks are from Clariion CX3-40 with FC 15K disks using MPxIO (2x 4Gb links). I was changing watermarks for cache on the array and now I wonder - the array or SVM+ZFS? I'm a little bit suspicious about SVM as I can get ~80MB/s only on avarage with short burst upto ~380MB/s (no matter if it's ZFS, UFS or directly raw-device) which is much much less than ZFS (and on x4500 I can get ~2GB/s read with SVM). No errors in logs, metastat is clear. Of course fmdump -e reports errors from zfs but it's expected. So I destroyed zpool, created again, dd from /dev/zero to pool, and then read a file - and right a way I get CKSUM errors so it seems like repeatable (no watermarks fiddling this time). Later I destroyed pool and SVM device, create new pool on the same disks, the same dd and this time no CKSUM errors and much better performance. bash-3.00# metastat -p d100 d100 1 85 /dev/dsk/c6t6006016062231B003CBA35791CD9DB11d0s0 /dev/dsk/c6t6006016062231B0004D599691CD9DB11d0s0 /dev/dsk/c6t6006016062231B00BC373C571CD9DB11d0s0 /dev/dsk/c6t6006016062231B0032CCFE481CD9DB11d0s0 /dev/dsk/c6t6006016062231B0096CB093A1CD9DB11d0s0 /dev/dsk/c6t6006016062231B00D40FEB261CD9DB11d0s0 /dev/dsk/c6t6006016062231B00DC759B171CD9DB11d0s0 /dev/dsk/c6t6006016062231B00D68713071CD9DB11d0s0 /dev/dsk/c6t6006016062231B00CE8F64F71BD9DB11d0s0 /dev/dsk/c6t6006016062231B009005C0E61BD9DB11d0s0 /dev/dsk/c6t6006016062231B00CABCE6D81BD9DB11d0s0 /dev/dsk/c6t6006016062231B00F2B124C91BD9DB11d0s0 /dev/dsk/c6t6006016062231B0004FE5CBA1BD9DB11d0s0 /dev/dsk/c6t6006016062231B0034CFFBAB1BD9DB11d0s0 /dev/dsk/c6t6006016062231B00DCB4349F1BD9DB11d0s0 /dev/dsk/c6t6006016062231B0024C093921BD9DB11d0s0 /dev/dsk/c6t6006016062231B0090F561871BD9DB11d0s0 /dev/dsk/c6t6006016062231B000EB2C0751BD9DB11d0s0 /dev/dsk/c6t6006016062231B008CF5B2671BD9DB11d0s0 /dev/dsk/c6t6006016062231B002A6ED0561BD9DB11d0s0 /dev/dsk/c6t6006016062231B00441DFD4C1BD9DB11d0s0 /dev/dsk/c6t6006016062231B001CF022401BD9DB11d0s0 /dev/dsk/c6t6006016062231B00449925351BD9DB11d0s0 /dev/dsk/c6t6006016062231B00A01632271BD9DB11d0s0 /dev/dsk/c6t6006016062231B00F2344A1C1BD9DB11d0s0 /dev/dsk/c6t6006016062231B0048C112121BD9DB11d0s0 /dev/dsk/c6t6006016062231B004CE643031BD9DB11d0s0 /dev/dsk/c6t6006016062231B004E2E7FF61AD9DB11d0s0 /dev/dsk/c6t6006016062231B008CADB8EB1AD9DB11d0s0 /dev/dsk/c6t6006016062231B00C8C868DF1AD9DB11d0s0 /dev/dsk/c6t6006016062231B009CD37BCF1AD9DB11d0s0 /dev/dsk/c6t6006016062231B00E84C8BC31AD9DB11d0s0 /dev/dsk/c6t6006016062231B0086796DB71AD9DB11d0s0 /dev/dsk/c6t6006016062231B00B2098DA91AD9DB11d0s0 /dev/dsk/c6t6006016062231B00124185971AD9DB11d0s0 /dev/dsk/c6t6006016062231B003E7742871AD9DB11d0s0 /dev/dsk/c6t6006016062231B003C7EFE7A1AD9DB11d0s0 /dev/dsk/c6t6006016062231B00D48C6B711AD9DB11d0s0 /dev/dsk/c6t6006016062231B001C98CA641AD9DB11d0s0 /dev/dsk/c6t6006016062231B0054BE36541AD9DB11d0s0 /dev/dsk/c6t6006016062231B009A650C461AD9DB11d0s0 /dev/dsk/c6t6006016062231B005CBC5D3B1AD9DB11d0s0 /dev/dsk/c6t6006016062231B00201DD62F1AD9DB11d0s0 /dev/dsk/c6t6006016062231B00703483111AD9DB11d0s0 /dev/dsk/c6t6006016062231B00941573031AD9DB11d0s0 /dev/dsk/c6t6006016062231B00862C80F719D9DB11d0s0 /dev/dsk/c6t6006016062231B007E15C7ED19D9DB11d0s0 /dev/dsk/c6t6006016062231B00A07323E419D9DB11d0s0 /dev/dsk/c6t6006016062231B0096F8E0D819D9DB11d0s0 /dev/dsk/c6t6006016062231B00AAD5D3CC19D9DB11d0s0 /dev/dsk/c6t6006016062231B8FCDC319D9DB11d0s0 /dev/dsk/c6t6006016062231BCDE1B719D9DB11d0s0 /dev/dsk/c6t6006016062231B00BC24C8A919D9DB11d0s0 /dev/dsk/c6t6006016062231B008834709E19D9DB11d0s0 /dev/dsk/c6t6006016062231B00BC73BF9019D9DB11d0s0 /dev/dsk/c6t6006016062231B0026B0497919D9DB11d0s0 /dev/dsk/c6t6006016062231B0012E7F56319D9DB11d0s0 /dev/dsk/c6t6006016062231B00BA53C25A19D9DB11d0s0 /dev/dsk/c6t6006016062231B0052622F5119D9DB11d0s0 /dev/dsk/c6t6006016062231B008832394619D9DB11d0s0 /dev/dsk/c6t6006
[zfs-discuss] crash during snapshot operations
When I'm trying to do in kernel in zfs ioctl: 1. snapshot destroy PREVIOS 2. snapshot rename LATEST->PREVIOUS 3. snapshot create LATEST code is: /* delete previous snapshot */ zfs_unmount_snap(snap_previous, NULL); dmu_objset_destroy(snap_previous); /* rename snapshot */ zfs_unmount_snap(snap_latest, NULL); dmu_objset_rename(snap_latest, snap_previous); /* create snapshot */ dmu_objset_snapshot(zc->zc_name, REPLICATE_SNAPSHOT_LATEST, 0); I get kernel panic. MDB > ::status debugging crash dump vmcore.3 (32-bit) from zfs.dev operating system: 5.11 snv_56 (i86pc) panic message: BAD TRAP: type=8 (#df Double fault) rp=fec244f8 addr=d5904ffc dump content: kernel pages only This happens only when the ZFS filesystem is loaded with I/O operations. ( I copy studio11 folder on this filesystem. ) MDB ::stack show nothing, but walking threads I found: stack pointer for thread d8ff9e00: d421b028 d421b04c zio_pop_transform+0x45(d9aba380, d421b090, d421b070, d421b078) d421b094 zio_clear_transform_stack+0x23(d9aba380) d421b200 zio_done+0x12b(d9aba380) d421b21c zio_next_stage+0x66(d9aba380) d421b230 zio_checksum_verify+0x17(d9aba380) d421b24c zio_next_stage+0x66(d9aba380) d421b26c zio_wait_for_children+0x46(d9aba380, 11, d9aba570) d421b280 zio_wait_children_done+0x18(d9aba380) d421b298 zio_next_stage+0x66(d9aba380) d421b2d0 zio_vdev_io_assess+0x11a(d9aba380) d421b2e8 zio_next_stage+0x66(d9aba380) d421b368 vdev_cache_read+0x157(d9aba380) d421b394 vdev_disk_io_start+0x35(d9aba380) d421b3a4 vdev_io_start+0x18(d9aba380) d421b3d0 zio_vdev_io_start+0x142(d9aba380) d421b3e4 zio_next_stage_async+0xac(d9aba380) d421b3f4 zio_nowait+0xe(d9aba380) d421b424 vdev_mirror_io_start+0x151(deab5cc0) d421b450 zio_vdev_io_start+0x14f(deab5cc0) d421b460 zio_next_stage+0x66(deab5cc0) d421b470 zio_ready+0x124(deab5cc0) d421b48c zio_next_stage+0x66(deab5cc0) d421b4ac zio_wait_for_children+0x46(deab5cc0, 1, deab5ea8) d421b4c0 zio_wait_children_ready+0x18(deab5cc0) d421b4d4 zio_next_stage_async+0xac(deab5cc0) d421b4e4 zio_nowait+0xe(deab5cc0) d421b520 arc_read+0x3cc(d8a2cd00, da9f6ac0, d418e840, f9e55e5c, f9e249b0, d515c010) d421b590 dbuf_read_impl+0x11b(d515c010, d8a2cd00, d421b5cc) d421b5bc dbuf_read+0xa5(d515c010, d8a2cd00, 2) d421b5fc dmu_buf_hold+0x7c(d47cb854, 4, 0, 0, 0, 0) d421b654 zap_lockdir+0x38(d47cb854, 4, 0, 0, 1, 1) d421b690 zap_lookup+0x23(d47cb854, 4, 0, d421b6e0, 8, 0) d421b804 dsl_dir_open_spa+0x10a(da9f6ac0, d8fde000, f9e7378f, d421b85c, d421b860) d421b864 dsl_dataset_open_spa+0x2c(0, d8fde000, 1, debe83c0, d421b938) d421b88c dsl_dataset_open+0x19(d8fde000, 1, debe83c0, d421b938) d421b940 dmu_objset_open+0x2e(d8fde000, 5, 1, d421b970) d421b974 dmu_objset_snapshot_one+0x2c(d8fde000, d421b998) d421bdb0 dmu_objset_snapshot+0xaf(d8fde000, d4c6a3e8, 0) d421c9e8 zfs_ioc_replicate_send+0x1ab(d8fde000) d421ce18 zfs_ioc_sendbackup+0x126() d421ce40 zfsdev_ioctl+0x100(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce6c cdev_ioctl+0x2e(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce94 spec_ioctl+0x65(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ced4 fop_ioctl+0x27(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421cf84 ioctl+0x151() d421cfac sys_sysenter+0x101() > $r %cs = 0x0158%eax = 0x %ds = 0x0160%ebx = 0xe58abac0 %ss = 0x0160%ecx = 0x %es = 0x0160%edx = 0x0018 %fs = 0x%esi = 0x %gs = 0x01b0%edi = 0x %eip = 0xfe8ebd71 kmem_free+0x111 %ebp = 0x %esp = 0xfec24530 %eflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status= %uesp = 0xd5905000 %trapno = 0x8 %err = 0x0 I was trying to cause error from command line: [EMAIL PROTECTED] ~]# zfs destroy solaris/[EMAIL PROTECTED] ; zfs rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; zfs snapshot solaris/[EMAIL PROTECTED] but without success. Any idea ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS over iSCSI question
Dear all. I've setup the following scenario: Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining diskspace of the two internal drives with a total of 90GB is used as zpool for the two 32GB volumes "exported" via iSCSI The initiator is an up to date Solaris 10 11/06 x86 box using the above mentioned volumes as disks for a local zpool. I've now started rsync to copy about 1GB of data in several thousand files. During the operation I took the network interface on the iSCSI target down which resulted in no more disk IO on that server. On the other hand, the client happily dumps data into the ZFS cache actually completely finishing all of the copy operation. Now the big question: we plan to use that kind of setup for email or other important services so what happens if the client crashes while the network is down? Does it mean that all the data in the cache is gone forever? If so, is this a transport independent problem which can also happen if ZFS used Fibre Channel attached drives instead of iSCSI devices? Thanks for your help Thomas - GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS machine to be reinstalled
Hello, Our Solaris 10 machine need to be reinstalled. Inside we have 2 HDDs in striping ZFS with 4 filesystems. After Solaris is installed how can I "mount" or recover the 4 filesystems without losing the existing data? Thank you very much! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Is there any performance problem with hard
>See fsattr(5) It was helpful :). Thanks! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS machine to be reinstalled
On 3/23/07, Ionescu Mircea <[EMAIL PROTECTED]> wrote: Hello, Our Solaris 10 machine need to be reinstalled. Inside we have 2 HDDs in striping ZFS with 4 filesystems. After Solaris is installed how can I "mount" or recover the 4 filesystems without losing the existing data? Check "zfs import" -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ontop of SVM - CKSUM errors
Hello Robert, Forget it, silly me. Pool was mounted on one host, SVM metadevice was created on another host on the same disk at the same time and both hosts were issuing IOs. Once I corrected it I do no longer see CKSUM errors with ZFS on top of SVM and performance is similar. :))) I'm still wondering however why I'm getting only about 400MB/s of sequential writes while ~700MB/s of sequential reads to a stripe made of 85 FC disks. I guess in both cases I should be getting about 700MB/s. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
Thomas Nau writes: > Dear all. > I've setup the following scenario: > > Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining > diskspace of the two internal drives with a total of 90GB is used as zpool > for the two 32GB volumes "exported" via iSCSI > > The initiator is an up to date Solaris 10 11/06 x86 box using the above > mentioned volumes as disks for a local zpool. > > I've now started rsync to copy about 1GB of data in several thousand > files. During the operation I took the network interface on the iSCSI > target down which resulted in no more disk IO on that server. On the other > hand, the client happily dumps data into the ZFS cache actually completely > finishing all of the copy operation. > > Now the big question: we plan to use that kind of setup for email or other > important services so what happens if the client crashes while the network > is down? Does it mean that all the data in the cache is gone forever? > > If so, is this a transport independent problem which can also happen if > ZFS used Fibre Channel attached drives instead of iSCSI devices? > I assume the rsync is not issuing fsyncs (and it's files are not opened O_DSYNC). If so, rsync just works against the filesystem cache and does not commit the data to disk. You might want to run sync(1M) after a successful rsync. A larger rsync would presumably have blocked. It's just that the amount of data you needs to rsync fitted in a couple of transaction groups. -r > Thanks for your help > Thomas > > - > GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] crash during snapshot operations
On Mar 23, 2007, at 6:13 AM, Łukasz wrote: When I'm trying to do in kernel in zfs ioctl: 1. snapshot destroy PREVIOS 2. snapshot rename LATEST->PREVIOUS 3. snapshot create LATEST code is: /* delete previous snapshot */ zfs_unmount_snap(snap_previous, NULL); dmu_objset_destroy(snap_previous); /* rename snapshot */ zfs_unmount_snap(snap_latest, NULL); dmu_objset_rename(snap_latest, snap_previous); /* create snapshot */ dmu_objset_snapshot(zc->zc_name, REPLICATE_SNAPSHOT_LATEST, 0); I get kernel panic. MDB ::status debugging crash dump vmcore.3 (32-bit) from zfs.dev operating system: 5.11 snv_56 (i86pc) panic message: BAD TRAP: type=8 (#df Double fault) rp=fec244f8 addr=d5904ffc dump content: kernel pages only This is most likely due to stack overflow. You're stack is 0xd421cfac - 0xd421b04c = 0t8032 bytes. The PAGESIZE on x86/x64 machines is 4k and the DEFAULTSTKSZ for 32bit is 8k (2 * PAGESIZE) and 20k (5 * PAGESIZE) for amd64. So you''ve blown your stack of 8k. This is mostly due to: 6354519 stack overflow in zfs due to zio pipeline Running on a 64bit machine would also help. eric This happens only when the ZFS filesystem is loaded with I/O operations. ( I copy studio11 folder on this filesystem. ) MDB ::stack show nothing, but walking threads I found: stack pointer for thread d8ff9e00: d421b028 d421b04c zio_pop_transform+0x45(d9aba380, d421b090, d421b070, d421b078) d421b094 zio_clear_transform_stack+0x23(d9aba380) d421b200 zio_done+0x12b(d9aba380) d421b21c zio_next_stage+0x66(d9aba380) d421b230 zio_checksum_verify+0x17(d9aba380) d421b24c zio_next_stage+0x66(d9aba380) d421b26c zio_wait_for_children+0x46(d9aba380, 11, d9aba570) d421b280 zio_wait_children_done+0x18(d9aba380) d421b298 zio_next_stage+0x66(d9aba380) d421b2d0 zio_vdev_io_assess+0x11a(d9aba380) d421b2e8 zio_next_stage+0x66(d9aba380) d421b368 vdev_cache_read+0x157(d9aba380) d421b394 vdev_disk_io_start+0x35(d9aba380) d421b3a4 vdev_io_start+0x18(d9aba380) d421b3d0 zio_vdev_io_start+0x142(d9aba380) d421b3e4 zio_next_stage_async+0xac(d9aba380) d421b3f4 zio_nowait+0xe(d9aba380) d421b424 vdev_mirror_io_start+0x151(deab5cc0) d421b450 zio_vdev_io_start+0x14f(deab5cc0) d421b460 zio_next_stage+0x66(deab5cc0) d421b470 zio_ready+0x124(deab5cc0) d421b48c zio_next_stage+0x66(deab5cc0) d421b4ac zio_wait_for_children+0x46(deab5cc0, 1, deab5ea8) d421b4c0 zio_wait_children_ready+0x18(deab5cc0) d421b4d4 zio_next_stage_async+0xac(deab5cc0) d421b4e4 zio_nowait+0xe(deab5cc0) d421b520 arc_read+0x3cc(d8a2cd00, da9f6ac0, d418e840, f9e55e5c, f9e249b0, d515c010) d421b590 dbuf_read_impl+0x11b(d515c010, d8a2cd00, d421b5cc) d421b5bc dbuf_read+0xa5(d515c010, d8a2cd00, 2) d421b5fc dmu_buf_hold+0x7c(d47cb854, 4, 0, 0, 0, 0) d421b654 zap_lockdir+0x38(d47cb854, 4, 0, 0, 1, 1) d421b690 zap_lookup+0x23(d47cb854, 4, 0, d421b6e0, 8, 0) d421b804 dsl_dir_open_spa+0x10a(da9f6ac0, d8fde000, f9e7378f, d421b85c, d421b860) d421b864 dsl_dataset_open_spa+0x2c(0, d8fde000, 1, debe83c0, d421b938) d421b88c dsl_dataset_open+0x19(d8fde000, 1, debe83c0, d421b938) d421b940 dmu_objset_open+0x2e(d8fde000, 5, 1, d421b970) d421b974 dmu_objset_snapshot_one+0x2c(d8fde000, d421b998) d421bdb0 dmu_objset_snapshot+0xaf(d8fde000, d4c6a3e8, 0) d421c9e8 zfs_ioc_replicate_send+0x1ab(d8fde000) d421ce18 zfs_ioc_sendbackup+0x126() d421ce40 zfsdev_ioctl+0x100(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce6c cdev_ioctl+0x2e(2d8, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ce94 spec_ioctl+0x65(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421ced4 fop_ioctl+0x27(d6591780, 5a1e, 8046cac, 13, d5938650, d421cf78) d421cf84 ioctl+0x151() d421cfac sys_sysenter+0x101() $r %cs = 0x0158%eax = 0x %ds = 0x0160%ebx = 0xe58abac0 %ss = 0x0160%ecx = 0x %es = 0x0160%edx = 0x0018 %fs = 0x%esi = 0x %gs = 0x01b0%edi = 0x %eip = 0xfe8ebd71 kmem_free+0x111 %ebp = 0x %esp = 0xfec24530 %eflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status= %uesp = 0xd5905000 %trapno = 0x8 %err = 0x0 I was trying to cause error from command line: [EMAIL PROTECTED] ~]# zfs destroy solaris/[EMAIL PROTECTED] ; zfs rename solaris/[EMAIL PROTECTED] solaris/[EMAIL PROTECTED]; zfs snapshot solaris/ [EMAIL PROTECTED] but without success. Any idea ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-dis
[zfs-discuss] Re: ZFS machine to be reinstalled
where the name of the pool is xyx: zpool export xyz rebuild the system (Stay clear of the pool disks) zpool import xyx Ron Halstead This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] asize is 300MB smaller than lsize - why?
Robert Milkowski wrote: Basically we've implemented a mechanizm to replicate zfs file system implementing new ioctl based on zfs send|recv. The difference is that we sleep() for specified time (default 5s) and then ask for new transcation and if there's one we send it out. More details really soon I hope. ps. zdb output sent privately The smaller file has its first 320MB as a hole, while the larger file is entirely filled in. You can see this from the zdb output (the first number on each line is the offset): Indirect blocks: 0 L2 0:115be2400:1200 4000L/1200P F=10192 B=831417 1400 L1 0:c0028c00:400 4000L/400P F=30 B=831370 14c4 L0 0:b818:2 2L/2P F=1 B=831367 14c6 L0 0:b81a:2 2L/2P F=1 B=831367 ... vs. Indirect blocks: 0 L2 0:ea1a0800:1400 4000L/1400P F=12911 B=831388 0 L1 0:2553bb400:400 4000L/400P F=128 B=831346 0 L0 0:25540:2 2L/2P F=1 B=831346 2 L0 0:25542:2 2L/2P F=1 B=831346 4 L0 0:25544:2 2L/2P F=1 B=831346 ... How it got that way, I couldn't really say without looking at your code. If you are able to reproduce this using OpenSolaris bits, let me know. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
It looks like we're between a rock and a hard place. We want to use ZFS for one project because of snapshots and data integrity - both would give us considerable advantages over ufs (not to mention filesystem size). Unfortunately, this is critical company data and the access control has to be exactly right all the time: the default ACLs as implemented in UFS are exactly what we need and work perfectly. The original plan was to allow the inheritance of owner/group/other permissions. Unfortunately, during ARC reviews we were forced to remove that functionality, due to POSIX compliance and security concerns. We can look into alternatives to provide a way to force the creation of directory trees with a specified set of permissions. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
With latest Nevada setting zfs_arc_max in /etc/system is sufficient. Playing with mdb on a live system is more tricky and is what caused the problem here. -r [EMAIL PROTECTED] writes: > Jim Mauro wrote: > > > All righty...I set c_max to 512MB, c to 512MB, and p to 256MB... > > > > > arc::print -tad > > { > > ... > > c02e29e8 uint64_t size = 0t299008 > > c02e29f0 uint64_t p = 0t16588228608 > > c02e29f8 uint64_t c = 0t33176457216 > > c02e2a00 uint64_t c_min = 0t1070318720 > > c02e2a08 uint64_t c_max = 0t33176457216 > > ... > > } > > > c02e2a08 /Z 0x2000 > > arc+0x48: 0x7b9789000 = 0x2000 > > > c02e29f8 /Z 0x2000 > > arc+0x38: 0x7b9789000 = 0x2000 > > > c02e29f0 /Z 0x1000 > > arc+0x30: 0x3dcbc4800 = 0x1000 > > > arc::print -tad > > { > > ... > > c02e29e8 uint64_t size = 0t299008 > > c02e29f0 uint64_t p = 0t268435456 <-- p > > is 256MB > > c02e29f8 uint64_t c = 0t536870912 <-- c > > is 512MB > > c02e2a00 uint64_t c_min = 0t1070318720 > > c02e2a08 uint64_t c_max = 0t536870912<--- c_max is > > 512MB > > ... > > } > > > > After a few runs of the workload ... > > > > > arc::print -d size > > size = 0t536788992 > > > > > > > > > Ah - looks like we're out of the woods. The ARC remains clamped at 512MB. > > > Is there a way to set these fields using /etc/system? > Or does this require a new or modified init script to > run and do the above with each boot? > > Darren > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
> How it got that way, I couldn't really say without looking at your code. It works like this: In new ioctl operation zfs_ioc_replicate_send(zfs_cmd_t *zc) we open filesystem ( not snapshot ) dmu_objset_open(zc->zc_name, DMU_OST_ANY, DS_MODE_STANDARD | DS_MODE_READONLY, &filesystem); call dmu replicate send function dmu_replicate_send(filesystem, &txg, ...); ( txg - is tranzaction group number ) we set max_txg ba.max_txg = (spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg; and call traverse_dsl_dataset traverse_dsl_dataset(filesystem->os->os_dsl_dataset, *txg, ADVANCE_PRE | ADVANCE_HOLES | ADVANCE_DATA | ADVANCE_NOLOCK, replicate_cb, &ba); after traversing next txg is returned if (ba.got_data != 0) *txg = ba.max_txg + 1; in replicate_cb we do the same what backup_cb does, but at the beginning we are checking txg: /* remember last txg */ if (bc->bc_blkptr.blk_birth) { if (bc->bc_blkptr.blk_birth > ba->max_txg) return; ba->got_data = 1; } After 5 seconds delay we call ioctl with txg returned from last operation. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote: > > Eric, maybe I'm missing something, but why ZFS depend on devids at all? > As I understand it, devid is something that never change for a block > device, eg. disk serial number, but on the other hand it is optional, so > we can rely on the fact it's always there (I mean for all block devices > we use). > > Why we simply not forget about devids and just focus on on-disk metadata > to detect pool components? > > The only reason I see is performance. This is probably why > /etc/zfs/zpool.cache is used as well. > > In FreeBSD we have the GEOM infrastructure for storage. Each storage > device (disk, partition, mirror, etc.) is simply a GEOM provider. If > GEOM provider appears (eg. disk is inserted, partition is configured) > all interested parties are informed about this I can 'taste' the > provider by reading metadata specific for them. The same when provider > goes away - all interested parties are informed and can react > accordingly. > > We don't see any performance problems related to the fact that each disk > that appears is read by many "GEOM classes". We do use the on-disk metatdata for verification purposes, but we can't open the device based on the metadata. We don't have a corresponding interface in Solaris, so there is no way to say "open the device with this particular on-disk data". The devid is also unique to the device (it's based on manufacturer/model/serialnumber), so that we can uniquely identify devices for fault management purposes. The world of hotplug and device configuration in Solaris is quite complicated. Part of my time spent on this work has been just writing down the existing semantics. A scheme like that in FreeBSD would be nice, but unlikely to appear given the existing complexity. As part of the I/O retire work we will likely be introducing device contracts, which is a step in the right direction but it's a very long road. Thanks for sharing the details on FreeBSD, it's quite interesting. Since the majority of this work is Solaris-specific, I'll be interested to see how other platforms deal with this type of reconfiguration. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
On 3/23/07, Mark Shellenbaum <[EMAIL PROTECTED]> wrote: The original plan was to allow the inheritance of owner/group/other permissions. Unfortunately, during ARC reviews we were forced to remove that functionality, due to POSIX compliance and security concerns. What exactly is the POSIX compliance requirement here? (It's also not clear to me how *not* allowing control of permissions helps security in any way.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
Peter Tribble wrote: On 3/23/07, Mark Shellenbaum <[EMAIL PROTECTED]> wrote: The original plan was to allow the inheritance of owner/group/other permissions. Unfortunately, during ARC reviews we were forced to remove that functionality, due to POSIX compliance and security concerns. What exactly is the POSIX compliance requirement here? The ignoring of a users umask. (It's also not clear to me how *not* allowing control of permissions helps security in any way.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: crash during snapshot operations
Thanks for advice. I removed my buffers snap_previous and snap_latest and it helped. I'm using zc->value as buffer. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
On Fri, 23 Mar 2007, Roch - PAE wrote: I assume the rsync is not issuing fsyncs (and it's files are not opened O_DSYNC). If so, rsync just works against the filesystem cache and does not commit the data to disk. You might want to run sync(1M) after a successful rsync. A larger rsync would presumably have blocked. It's just that the amount of data you needs to rsync fitted in a couple of transaction groups. Thanks for the hints but this would make our worst nightmares become true. At least they could because it means that we would have to check every application handling critical data and I think it's not the apps responsibility. Up to a certain amount like a database transaction but not any further. There's always a time window where data might be cached in memory but I would argue that caching several GB of data, in our case written data, with thousands of files in unbuffered memory circumvents all the build in reliability of ZFS. I'm in a way still hoping that it's a iSCSI related Problem as detecting dead hosts in a network can be a non trivial problem and it takes quite some time for TCP to timeout and inform the upper layers. Just a guess/hope here that FC-AL, ... do better in this case Thomas - GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS machine to be reinstalled
Thank you all ! The machine crashed unexpectedly so no export was possible. Anyway just using "zpool import pool_name" helped me to recover everything. Thanks again for your help! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 6410 expansion shelf
On March 23, 2007 5:38:20 PM +0800 Wee Yeh Tan <[EMAIL PROTECTED]> wrote: I should be able to reply to you next Tuesday -- my 6140 SATA expansion tray is due to arrive. Meanwhile, what kind of problem do you have with the 3511? I'm not sure that it had anything to do with the raid controller being present or not. The initial configuration (5x250 original sata disks) worked well. Changing the disks to 750gb disks worked well. Then I had to get 7 more drive carriers and then some of the slots didn't work -- disks would not spin up. The 7 addt'l carriers had different electronics than the original 5. Just a hardware revision, I suppose. Oh, and they were "dot hill" labelled instead of Sun labelled (dot hill is the OEM for the 3510/3511). When I was able to replace the 7 new carriers with ones that looked like the original 5 (same electronics and Sun branding), I had better luck but there was still one or two slots that were SOL. Swapping hardware around, I identified that it was definitely the slot and not a carrier or drive problem. But maybe a bad carrier "broke" the slot itself. I dunno! I was tempted to just use the array with the 10 or 11 slots that worked, since I got it for a very good price, but I was worried that there'd be more failures in the future, and the cost savings wasn't worth even the potential hassle of having to deal with that. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
On March 23, 2007 6:51:10 PM +0100 Thomas Nau <[EMAIL PROTECTED]> wrote: Thanks for the hints but this would make our worst nightmares become true. At least they could because it means that we would have to check every application handling critical data and I think it's not the apps responsibility. I'd tend to disagree with that. POSIX/SUS does not guarantee data makes it to disk until you do an fsync() (or open the file with the right flags, or other techniques). If an application REQUIRES that data get to disk, it really MUST DTRT. Up to a certain amount like a database transaction but not any further. There's always a time window where data might be cached in memory but I would argue that caching several GB of data, in our case written data, with thousands of files in unbuffered memory circumvents all the build in reliability of ZFS. I'm in a way still hoping that it's a iSCSI related Problem as detecting dead hosts in a network can be a non trivial problem and it takes quite some time for TCP to timeout and inform the upper layers. Just a guess/hope here that FC-AL, ... do better in this case iscsi doesn't use TCP, does it? Anyway, the problem is really transport independent. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] gzip compression support
I recently integrated this fix into ON: 6536606 gzip compression for ZFS With this, ZFS now supports gzip compression. To enable gzip compression just set the 'compression' property to 'gzip' (or 'gzip-N' where N=1..9). Existing pools will need to upgrade in order to use this feature, and, yes, this is the second ZFS version number update this week. Recall that once you've upgraded a pool older software will no longer be able to access it regardless of whether you're using the gzip compression algorithm. I did some very simple tests to look at relative size and time requirements: http://blogs.sun.com/ahl/entry/gzip_for_zfs_update I've also asked Roch Bourbonnais and Richard Elling to do some more extensive tests. Adam >From zfs(1M): compression=on | off | lzjb | gzip | gzip-N Controls the compression algorithm used for this dataset. The "lzjb" compression algorithm is optimized for performance while providing decent data compression. Setting compression to "on" uses the "lzjb" compression algorithm. The "gzip" compression algorithm uses the same compression as the gzip(1) command. You can specify the gzip level by using the value "gzip-N", where N is an integer from 1 (fastest) to 9 (best compression ratio). Currently, "gzip" is equivalent to "gzip-6" (which is also the default for gzip(1)). This property can also be referred to by its shortened column name "compress". -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression support
On Fri, 23 Mar 2007, Adam Leventhal wrote: > I recently integrated this fix into ON: > > 6536606 gzip compression for ZFS Cool! Can you recall into which build it went? -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression support
On Fri, Mar 23, 2007 at 11:41:21AM -0700, Rich Teer wrote: > > I recently integrated this fix into ON: > > > > 6536606 gzip compression for ZFS > > Cool! Can you recall into which build it went? I put it back yesterday so it will be in build 62. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
>Peter Tribble wrote: >> On 3/23/07, Mark Shellenbaum <[EMAIL PROTECTED]> wrote: >>> >>> The original plan was to allow the inheritance of owner/group/other >>> permissions. Unfortunately, during ARC reviews we were forced to remove >>> that functionality, due to POSIX compliance and security concerns. >> >> What exactly is the POSIX compliance requirement here? >> >The ignoring of a users umask. Which is what made UFS ACLs useless until we "fixed" it to break POSIX semantics. (I think we should really have some form of uacl which, when set, forces the umask to 0 but which is used as default acl when there is no acl present) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: /tmp on ZFS?
Well, I am aware that /tmp can be mounted on swap as tmpfs and that this is really fast as most all writes go straight to memory, but this is of little to no value to the server in question. The server in question is running 2 enterprise third party applications. No compilers are installed...in fact its a super minimal Solaris 10 core install (06/06). The reasoning behind moving /tmp onto ZFS was to protect against the occasional misdirected administrator who accidently fills up tmp while transferring a file or what have you. As I said its a production server, so we are doing our best to insulate it from inadvertent errors When this server was build it was built with 8GB of swap on a dedicated slice. /tmp was left on / (root) and later mounted on a zpool. Is this dangerous given the server profile? Am i missing something here? Some other SUN engineers say that /tmp "is" swap and vice versa on Solaris, but my understanding is that my dedicated swap slice "is" swap and is not directly accessible. /tmp is just another filesystem that is happens to be mounted on a zpool with a quota so there is no fear of user/admin error. Based on how the system was setup is this a correct assertion? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
>I'd tend to disagree with that. POSIX/SUS does not guarantee data makes >it to disk until you do an fsync() (or open the file with the right flags, >or other techniques). If an application REQUIRES that data get to disk, >it really MUST DTRT. Indeed; want your data safe? Use: fflush(fp); fsync(fileno(fp)); fclose(fp); and check errors. (It's remarkable how often people get the above sequence wrong and only do something like fsync(fileno(fp)); fclose(fp); Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: /tmp on ZFS?
On Fri, 23 Mar 2007, Matt B wrote: > The server in question is running 2 enterprise third party > applications. No compilers are installed...in fact its a super minimal > Solaris 10 core install (06/06). The reasoning behind moving /tmp onto > ZFS was to protect against the occasional misdirected administrator who > accidently fills up tmp while transferring a file or what have you. As > I said its a production server, so we are doing our best to insulate it > from inadvertent errors In that case, I think the easiest approach would be to use the "size" tmpfs mount option, which limits the amount of VM /tmp can use. > Is this dangerous given the server profile? Am i missing something Dangerous? I think not. But most likely suboptimal. -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: /tmp on ZFS?
On Fri, Mar 23, 2007 at 11:57:40AM -0700, Matt B wrote: > > The server in question is running 2 enterprise third party > applications. No compilers are installed...in fact its a super minimal > Solaris 10 core install (06/06). The reasoning behind moving /tmp onto > ZFS was to protect against the occasional misdirected administrator > who accidently fills up tmp while transferring a file or what have > you. As I said its a production server, so we are doing our best to > insulate it from inadvertent errors You can solve that problem by putting a size limit on /tmp. For example, we do this in /etc/vfstab: swap- /tmptmpfs - yes size=500m The filesystem will still fill up, but you won't run out of swap space. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
On 3/23/07, Mark Shellenbaum <[EMAIL PROTECTED]> wrote: Peter Tribble wrote: > What exactly is the POSIX compliance requirement here? > The ignoring of a users umask. Where in POSIX does it specify the interaction of ACLs and a user's umask? -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration
Anton B. Rang wrote: Is this because C would already have a devid? If I insert an unlabeled disk, what happens? What if B takes five minutes to spin up? If it never does? N.B. You get different error messages from the disk. If a disk is not ready then it will return a not ready code and the sd driver will record this and patiently retry. The reason I know this in some detail is scar #523, which was inflicted when we realized that some/many/most RAID arrays don't do this. The difference is that the JBOD disk electronics start very quickly, perhaps a few seconds after power-on. A RAID array can take several minutes (or more) to get to a state where it will reply to any request. So, if you do not perform a full, simultaneous power-on test for your entire (cluster) system, then you may not hit the problem that the slow storage start makes Solaris think that the device doesn't exist -- which can be a bad thing for highly available services. Yes, this is yet another systems engineering problem. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration
workaround below... Richard Elling wrote: Anton B. Rang wrote: Is this because C would already have a devid? If I insert an unlabeled disk, what happens? What if B takes five minutes to spin up? If it never does? N.B. You get different error messages from the disk. If a disk is not ready then it will return a not ready code and the sd driver will record this and patiently retry. The reason I know this in some detail is scar #523, which was inflicted when we realized that some/many/most RAID arrays don't do this. The difference is that the JBOD disk electronics start very quickly, perhaps a few seconds after power-on. A RAID array can take several minutes (or more) to get to a state where it will reply to any request. So, if you do not perform a full, simultaneous power-on test for your entire (cluster) system, then you may not hit the problem that the slow storage start makes Solaris think that the device doesn't exist -- which can be a bad thing for highly available services. Yes, this is yet another systems engineering problem. Sorry, it was rude of me not to include the workaround. We put a delay in the SPARC OBP to slow down the power-on boot time of the servers to match the attached storage. While this worked, it is butugly. You can do this with GRUB, too. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] migration/acl4 problem
Peter Tribble wrote: On 3/23/07, Mark Shellenbaum <[EMAIL PROTECTED]> wrote: Peter Tribble wrote: > What exactly is the POSIX compliance requirement here? > The ignoring of a users umask. Where in POSIX does it specify the interaction of ACLs and a user's umask? Let me try and summarize the discussion that took place, a few years ago. The POSIX ACL draft stated: p 269: "The process umask is the user's way of specifying security for newly created objects. It was a goal to preserve this behavior //unless it is specifically overridden in a default ACL//." However, that is a withdrawn specification and Solaris is required to conform to a set of "approved standards". The main POSIX specification doesn't say anything specific about ACLs, but rather alternate and additional access control methods. POSIX gives clear rules for file access permissions based on umask, file mode bits, additional access control mechanisms, and alternate access control mechanisms. Most of this is discussed in section 2.3 "General Concepts". Since there is nothing in the spec that states that we *can* ignore the umask, we are therefore forced to honor it. At least until we find a way to I will open an RFE to look into alternative ways to work around this issue. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
Thomas Nau wrote: Dear all. I've setup the following scenario: Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining diskspace of the two internal drives with a total of 90GB is used as zpool for the two 32GB volumes "exported" via iSCSI The initiator is an up to date Solaris 10 11/06 x86 box using the above mentioned volumes as disks for a local zpool. Like this? disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app > I'm in a way still hoping that it's a iSCSI related Problem as detecting > dead hosts in a network can be a non trivial problem and it takes quite > some time for TCP to timeout and inform the upper layers. Just a > guess/hope here that FC-AL, ... do better in this case Actually, this is why NFS was invented. Prior to NFS we had something like: disk--raw--ndserver--network--ndclient--filesystem--app The problem is that the failure modes are very different for networks and presumably reliable local disk connections. Hence NFS has a lot of error handling code and provides well understood error handling semantics. Maybe what you really want is NFS? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS layout for 10 disk?
> Consider that 18GByte disks are old and their failure > rate will > increase dramatically over the next few years. I guess thats why i am asking about raidz and mirrors, not just creating a huge stripe them > Do something to > have redundancy. If raidz2 works for your workload, > I'd go with that. Well i thinks so, the file system is currently on a raidz with three disk with no complains [b]Food for thought[/b] How do you fit sata into a scsi array [b]More food for thought[/b] So how do these two 500gb HDD work then? Do i just leave them on the side and the magic of ISCSI allows then to work and shared between boxes? Sorry for the gitty response but i am in the uk (see my details) so this 160 pound idea of yours is not 160 pounds. It is cost of the hdd plus the new box that can take them, so its a nice cheap option then :) > BTW, I was just at Fry's, new 500 GByte Seagate > drives are $180. > Prices for new disks tend to approach $150 (USD) > after which they > are replaced by larger drives and the inventory is > price reduced > until gone. A 2-new disk mirror will be more reliable > than any > reasonable combination of 5-year old disks. Food for > thought. > -- richard > _ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss > This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: ZFS layout for 10 disk?
Just to clarify pool1 -> 5 disk raidz2 pool2 -> 4 disk raid 10 spare for both pools Is that correct? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: /tmp on ZFS?
Ok so you are suggesting that I simply mount /tmp as tmpfs on my existing 8GB swap slice and then put in the VM limit on /tmp? Will that limit only affect users writing data to /tmp or will it also affect the systems use of swap? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ontop of SVM - CKSUM errors
Robert Milkowski wrote: Hello Robert, Forget it, silly me. Pool was mounted on one host, SVM metadevice was created on another host on the same disk at the same time and both hosts were issuing IOs. Once I corrected it I do no longer see CKSUM errors with ZFS on top of SVM and performance is similar. :))) Smiles because ZFS detected the corruption :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: /tmp on ZFS?
For reference...here is my disk layout currently (one disk of two, but both are identical) s4 is for the MetaDB s5 is dedicated for ZFS partition> print Current partition table (original): Total disk cylinders available: 8921 + 2 (reserved cylinders) Part TagFlag CylindersSizeBlocks 0 rootwm 1 - 7655.86GB(765/0/0) 12289725 1 swapwu 766 - 17857.81GB(1020/0/0) 16386300 2 backupwm 0 - 8920 68.34GB(8921/0/0) 143315865 3varwm1786 - 25505.86GB(765/0/0) 12289725 4 unassignedwm2551 - 2557 54.91MB(7/0/0) 112455 5 unassignedwm2558 - 8824 48.01GB(6267/0/0) 100679355 6 unassignedwm 0 0 (0/0/0)0 7 unassignedwm 0 0 (0/0/0)0 8 bootwu 0 -07.84MB(1/0/0)16065 9 unassignedwm 0 0 (0/0/0)0 -- df output -- df -k Filesystemkbytesused avail capacity Mounted on /dev/md/dsk/d0 6050982 1172802 481767120%/ /devices 0 0 0 0%/devices ctfs 0 0 0 0%/system/contract proc 0 0 0 0%/proc mnttab 0 0 0 0%/etc/mnttab swap 9149740 436 9149304 1%/etc/svc/volatile objfs 0 0 0 0%/system/object /usr/lib/libc/libc_hwcap2.so.1 6050982 1172802 481767120%/lib/libc.so.1 fd 0 0 0 0%/dev/fd /dev/md/dsk/d3 6050982 43303 5947170 1%/var swap 9149312 8 9149304 1%/var/run zpool/home 4194304 91 4194212 1%/home zpool/data 49545216 3799227 45745635 8%/data zpool/tmp49545216 55 45745635 1%/tmp This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: /tmp on ZFS?
On Fri, 23 Mar 2007, Matt B wrote: > Ok so you are suggesting that I simply mount /tmp as tmpfs on my > existing 8GB swap slice and then put in the VM limit on /tmp? Will that Yes. > limit only affect users writing data to /tmp or will it also affect the > systems use of swap? Well, they'd potentially be sharing the slice, so yes, that's possible. If your (say) 1GB /tmp becomes full, only 7GB will remain for paging. However, if /tmp is empty, the whole 8GB will be available for paging. -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: /tmp on ZFS?
Ok, since I already have an 8GB swap slice i'd like to use, what would be the best way of setting up /tmp on this existing SWAP slice as tmpfs and then apply the 1GB quota limit? I know how to get rid of the zpool/tmp filesystem in ZFS, but I'm not sure how to actually get to the above in a post-install scenario with existing raw swap Thanks This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: /tmp on ZFS?
On Fri, 23 Mar 2007, Matt B wrote: > Ok, since I already have an 8GB swap slice i'd like to use, what > would be the best way of setting up /tmp on this existing SWAP slice as > tmpfs and then apply the 1GB quota limit? Have a line similar to the following in your /etc/vfstab: swap- /tmptmpfs - yes size=1024m -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: Re: /tmp on ZFS?
And just doing this will automatically target my /tmp at my 8GB swap slice on s1 as well as placing the quota in place? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: /tmp on ZFS?
On Fri, 23 Mar 2007, Matt B wrote: > And just doing this will automatically target my /tmp at my 8GB swap > slice on s1 as well as placing the quota in place? After a reboot, yes. -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: Re: /tmp on ZFS?
Oh, one other thing...s1 (8GB swap) is part of an SVM mirror (on d1) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: /tmp on ZFS?
On Fri, 23 Mar 2007, Matt B wrote: > Oh, one other thing...s1 (8GB swap) is part of an SVM mirror (on d1) That's not relevant in this case. -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member CEO, My Online Home Inventory Voice: +1 (250) 979-1638 URLs: http://www.rite-group.com/rich http://www.myonlinehomeinventory.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: Re: Re: /tmp on ZFS?
Worked great. Thanks This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS layout for 10 disk?
I'd take your 10 data disks and make a single raidz2 stripe. You can sustain two disk failures before losing data, and presumably you'd replace the failed disks before that was likely to happen. If you're very concerned about failures, I'd have a single 9-wide raidz2 stripe with a hot spare. Adam On Fri, Mar 23, 2007 at 01:44:06PM -0700, John-Paul Drawneek wrote: > Just to clarify > > pool1 -> 5 disk raidz2 > pool2 -> 4 disk raid 10 > > spare for both pools > > Is that correct? > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
Dear Fran & Casper I'd tend to disagree with that. POSIX/SUS does not guarantee data makes it to disk until you do an fsync() (or open the file with the right flags, or other techniques). If an application REQUIRES that data get to disk, it really MUST DTRT. Indeed; want your data safe? Use: fflush(fp); fsync(fileno(fp)); fclose(fp); and check errors. (It's remarkable how often people get the above sequence wrong and only do something like fsync(fileno(fp)); fclose(fp); Thanks for clarifying! Seems I really need to check the apps with truss or dtrace to see if they use that sequence. Allow me one more question: why is fflush() required prior to fsync()? Putting all pieces together this means that if the app doesn't do it it suffered from the problem with UFS anyway just with typically smaller caches, right? Thanks again Thomas - GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression support
snv_62 On Fri, 23 Mar 2007, Rich Teer wrote: Date: Fri, 23 Mar 2007 11:41:21 -0700 (PDT) From: Rich Teer <[EMAIL PROTECTED]> To: Adam Leventhal <[EMAIL PROTECTED]> Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] gzip compression support On Fri, 23 Mar 2007, Adam Leventhal wrote: I recently integrated this fix into ON: 6536606 gzip compression for ZFS Cool! Can you recall into which build it went? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
Richard, Like this? disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app exactly I'm in a way still hoping that it's a iSCSI related Problem as detecting dead hosts in a network can be a non trivial problem and it takes quite some time for TCP to timeout and inform the upper layers. Just a guess/hope here that FC-AL, ... do better in this case Actually, this is why NFS was invented. Prior to NFS we had something like: disk--raw--ndserver--network--ndclient--filesystem--app The problem is that our NFS, Mail, DB and other servers use mirrrored disks located in different building on campus. Currently we use FCAL devices and recently switched from UFS to ZFS. The drawback with FCAL is that you always need to have a second infrastructure (not the real problem) but with different components. Having all ethernet would be much easier. The problem is that the failure modes are very different for networks and presumably reliable local disk connections. Hence NFS has a lot of error handling code and provides well understood error handling semantics. Maybe what you really want is NFS? We thought about using NFS as backend for as much as possible applications but we need to have redundancy for the fileserver itself too Thomas - GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
>Thanks for clarifying! Seems I really need to check the apps with truss or >dtrace to see if they use that sequence. Allow me one more question: why >is fflush() required prior to fsync()? When you use stdio, you need to make sure the data is in the system buffers prior to call fsync. fclose() will otherwise write the rest of the data which is not sync'ed. (In S10 I fixed this for /etc/*_* driver files , they are generally under 8 K and therefor never written to disk before fsync'ed if not preceeded by fflush(). Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Łukasz wrote: How it got that way, I couldn't really say without looking at your code. It works like this: ... we set max_txg ba.max_txg = (spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg; So, how do you send the initial stream? Presumably you need to do it with ba.max_txg = 0? If, say the first 320MB were written before your first ba.max_txg, then you wouldn't be sending that data, thus explaining the behavior you're seeing. It seems to me that your algorithm is fundamentally flawed -- if the filesystem is changing, it will not result in a consistent (from the ZPL's point if view) filesystem. For example: There are two directories, A and B. You last sent txg 10. In txg 13, a file is renamed from directory A to directory B. It is now txg 15, and you begin traversing to do a send, from txg 10 -> 15. While that's in progress, a new file is created in directory A, and synced out in txg 16. When you visit directory A, you see that its birth time is 16 > 15, so you don't send it. When you visit directory B, you see that its birth time is 13 <= 15 so you send it. Now the other side has two links to the file, when it should have one. Given that you don't actually have the data from txg 15 (because you didn't take a snapshot), I don't see how you could make this work. (Also FYI, traversing changing filesystems in this way will almost certainly break once we rewrite as part of the pool space reduction work.) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mirror question
If I create a mirror, presumably if possible I use two or more identically sized devices, since it can only be as large as the smallest. However, if later I want to replace a disk with a larger one, and detach the mirror (and anything else on the disk), replace the disk (and if applicable repartition it), since it _is_ a larger disk (and/or the partitions will likely be larger since they mustn't be smaller, and blocks per cylinder will likely differ, and partitions are on cylinder boundaries), once I reattach everything, I'll now have two different sized devices in the mirror. So far, the mirror is still the original size. But what if I later replace the other disks with ones identical to the first one I replaced? With all the devices within the mirror now the larger size, will the mirror and the zpool of which it is a part expand? And if that won't happen automatically, can it (without inordinate trickery, and online, i.e. without backup and restore) be forced to do so? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror question
Yes, this is supported now. Replacing one half of a mirror with a larger device; letting it resilver; then replacing the other half does indeed get a larger mirror. I believe this is described somewhere but I can't remember where now. Neil. Richard L. Hamilton wrote On 03/23/07 20:45,: If I create a mirror, presumably if possible I use two or more identically sized devices, since it can only be as large as the smallest. However, if later I want to replace a disk with a larger one, and detach the mirror (and anything else on the disk), replace the disk (and if applicable repartition it), since it _is_ a larger disk (and/or the partitions will likely be larger since they mustn't be smaller, and blocks per cylinder will likely differ, and partitions are on cylinder boundaries), once I reattach everything, I'll now have two different sized devices in the mirror. So far, the mirror is still the original size. But what if I later replace the other disks with ones identical to the first one I replaced? With all the devices within the mirror now the larger size, will the mirror and the zpool of which it is a part expand? And if that won't happen automatically, can it (without inordinate trickery, and online, i.e. without backup and restore) be forced to do so? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Backup of ZFS Filesystem with ACL 4
HI Guys ! Please share you experience on how to backup zfs with ACL using commercially available backup softwares. Has any one tested backup of zfs with acl using Tivoli (TSM) thanks Ayaz This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote: > >I'm in a way still hoping that it's a iSCSI related Problem as detecting > >dead hosts in a network can be a non trivial problem and it takes quite > >some time for TCP to timeout and inform the upper layers. Just a > >guess/hope here that FC-AL, ... do better in this case > > iscsi doesn't use TCP, does it? Anyway, the problem is really transport > independent. It does use TCP. Were you thinking UDP? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss