from:"Bill"

Re: [zfs-discuss] Wrong rpool used after reinstall!

2011-08-05 Thread Bill

On Thu, Aug 04, 2011 at 03:52:39AM -0700, Stuart James Whitefish wrote:
> Jim wrote:
> 
> >> But I may be wrong, and anyway the single user shell in the u9 DVD also 
> >> panics when I try to import tank so maybe that won't help.
> 
> Ian wrote:
> 
> > Put your old drive in a USB enclosure and connect it
> > to another system in order to read back the data.
> 
> Given that update 9 can't import the pool is this really worth trying?
> I would have to buy the enclosures, if I had them already I would have tried 
> it in 
> desperation.
> 
> Jim wrote:
> 
> > > I have only 4 sata ports on this Intel box so I have to keep pulling 
> > > cables to be 
> > > able to boot from a DVD and then I won't have all my drives available. I 
> > > cannot 
> > > move these drives to any other box because they are consumer drives and 
> > > my 
> > > servers all have ultras.
> 
> Ian wrote:
> 
> > Most modern boards will be boot from a live USB
> > stick.
> 
> True but I haven't found a way to get an ISO onto a USB that my system can 
> boot from it. I was using DD to copy the iso to the usb drive. Is there some 
> other way?


Maybe give http://unetbootin.sourceforge.net/ a try.

Bill



> 
> This is really frustrating. I haven't had any problems with Linux filesystems 
> but I heard ZFS was safer. It's really ironic that I lost access to so much 
> data after moving it to ZFS. Isn't there any way to get them back on my newly 
> installed U8 system? If I disconnect this pool the system starts fine. 
> Otherwise my questions above in my summary post might be key to getting this 
> working.
> 
> Thanks,
> Jim
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS web admin - No items found.

2006-08-23 Thread Bill

Hi, experts,

I install Solaris 10 06/06 x86 on vmware 5.5, and admin zfs by command line and 
web, all is good. Web admin is more convenient, I needn't type commands. But 
after my computer lost power , and restarted, I get a problem on zfs web admin 
(https://hostname:6789/zfs).

The problem is, when I try to create a new storage pool from web , it always 
shows "No items found" , but in fact there are 10 harddisks available.  

I still can use zpool/zfs command line to create new pool, file system, 
volumes. the command way works quickly and correctly.

I have tried to restart the service (smcwebserver),  no use.

Anyone have the experience on it, is it a bug?

Regards,
Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS web admin - No items found.

2006-08-24 Thread Bill

When I run the command, it prompts:

# /usr/lib/zfs/availdevs -d
Segmentation Fault - core dumped.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS web admin - No items found.

2006-08-27 Thread Bill

# /usr/lib/zfs/availdevs -d
Segmentation Fault - core dumped.

# pstack core
core 'core' of 2350:./availdevs -d
-  lwp# 1 / thread# 1  
 d2d64b3c strlen   (0) + c
 d2fa2f82 get_device_name (8063400, 0, 804751c, 1c) + 3e
 d2fa3015 get_disk (8063400, 0, 804751c, 8067430) + 4d
 d2fa3bbf dmgt_avail_disk_iter (8050ddb, 8047554) + a1
 08051305 main (2, 8047584, 8047590) + 110
 08050ce6  (2, 80476b0, 80476bc, 0, 80476bf, 80476f9)
-  lwp# 2 / thread# 2  
 d2de1a81 _door_return (0, 0, 0, 0) + 31
 d29f0d3d door_create_func (0) + 29
 d2ddf93e _thr_setup (d2992400) + 4e
 d2ddfc20 _lwp_start (d2992400, 0, 0, d2969ff8, d2ddfc20, d2992400)
-  lwp# 3 / thread# 3  
 d2ddfc99 __lwp_park (809afc0, 809afd0, 0) + 19
 d2dda501 cond_wait_queue (809afc0, 809afd0, 0, 0) + 3b
 d2dda9fa _cond_wait (809afc0, 809afd0) + 66
 d2ddaa3c cond_wait (809afc0, 809afd0) + 21
 d2a92bc8 subscriber_event_handler (80630c0) + 3f
 d2ddf93e _thr_setup (d275) + 4e
 d2ddfc20 _lwp_start (d275, 0, 0, d2865ff8, d2ddfc20, d275)
-  lwp# 4 / thread# 4  
 d2de0cd5 __pollsys (d274df78, 1, 0, 0) + 15
 d2d8a6d2 poll (d274df78, 1, ) + 52
 d2d0ee1e watch_mnttab (0) + af
 d2ddf93e _thr_setup (d2750400) + 4e
 d2ddfc20 _lwp_start (d2750400, 0, 0, d274dff8, d2ddfc20, d2750400)
-  lwp# 5 / thread# 5  
 d2ddfc99 __lwp_park (8064ef0, 8064f00, 0) + 19
 d2dda501 cond_wait_queue (8064ef0, 8064f00, 0, 0) + 3b
 d2dda9fa _cond_wait (8064ef0, 8064f00) + 66
 d2ddaa3c cond_wait (8064ef0, 8064f00) + 21
 d2a92bc8 subscriber_event_handler (8064be0) + 3f
 d2ddf93e _thr_setup (d2750800) + 4e
 d2ddfc20 _lwp_start (d2750800, 0, 0, d24edff8, d2ddfc20, d2750800)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compressed root pool at installation time with flash archive predeployment script

2010-03-02 Thread Bill Sommerfeld


On 03/02/10 12:57, Miles Nordin wrote:

"cc" == chad campbell  writes:


 cc>  I was trying to think of a way to set compression=on
 cc>  at the beginning of a jumpstart.

are you sure grub/ofwboot/whatever can read compressed files?


Grub and the sparc zfs boot blocks can read lzjb-compressed blocks in zfs.

I have compression=on (and copies=2) for both sparc and x86 roots; I'm 
told that grub's zfs support also knows how to fall back to ditto blocks 
if the first copy fails to be readable or has a bad checksum.


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] swap across multiple pools

2010-03-03 Thread Bill Sommerfeld


On 03/03/10 05:19, Matt Keenan wrote:

In a multipool environment, would be make sense to add swap to a pool outside or
the root pool, either as the sole swap dataset to be used or as extra swap ?


Yes.  I do it routinely, primarily to preserve space on boot disks on 
large-memory systems.


swap can go in any pool, while dump has the same limitations as root: 
single top-level vdev, single-disk or mirrors only.



Would this have any performance implications ?


If the non-root pool has many spindles, random read I/O should be faster 
and thus swap i/o should be faster.  I haven't attempted to measure if 
this makes a difference.


I generally set primarycache=metadata on swap zvols but I also haven't 
been able to measure whether it makes any difference.


My users do complain when /tmp fills because there isn't sufficient swap 
so I do know I need large amounts of swap on these systems.  (when 
migrating one such system from Nevada to Opensolaris recently I forgot 
to add swap to /etc/vfstab).


    - Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshot recycle freezes system activity

2010-03-08 Thread Bill Sommerfeld


On 03/08/10 12:43, Tomas Ögren wrote:
So we tried adding 2x 4GB USB sticks (Kingston Data

Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the
snapshot times down to about 30 seconds.


Out of curiosity, how much physical memory does this system have?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-08 Thread Bill Sommerfeld


On 03/08/10 17:57, Matt Cowger wrote:

Change zfs options to turn off checksumming (don't want it or need it), atime, 
compression, 4K block size (this is the applications native blocksize) etc.


even when you disable checksums and compression through the zfs command, 
zfs will still compress and checksum metadata.


the evil tuning guide describes an unstable interface to turn off 
metadata compression, but I don't see anything in there for metadata 
checksums.


if you have an actual need for an in-memory filesystem, will tmpfs fit 
the bill?


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub not completing?

2010-03-17 Thread Bill Sommerfeld


On 03/17/10 14:03, Ian Collins wrote:

I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100%
done, but not complete:

   scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go


Don't panic.  If "zpool iostat" still shows active reads from all disks 
in the pool, just step back and let it do its thing until it says the 
scrub is complete.


There's a bug open on this:

6899970 scrub/resilver percent complete reporting in zpool status can be 
overly optimistic


scrub/resilver progress reporting compares the number of blocks read so 
far to the number of blocks currently allocated in the pool.


If blocks that have already been visited are freed and new blocks are 
allocated, the seen:allocated ratio is no longer an accurate estimate of 
how much more work is needed to complete the scrub.


Before the scrub prefetch code went in, I would routinely see scrubs 
last 75 hours which had claimed to be "100.00% done" for over a day.


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sympathetic (or just multiple) drive failures

2010-03-20 Thread Bill Sommerfeld


On 03/19/10 19:07, zfs ml wrote:

What are peoples' experiences with multiple drive failures?


1985-1986.  DEC RA81 disks.  Bad glue that degraded at the disk's 
operating temperature.  Head crashes.  No more need be said.


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Bill Sommerfeld

On 03/22/10 11:02, Richard Elling wrote:
> Scrub tends to be a random workload dominated by IOPS, not bandwidth.

you may want to look at this again post build 128; the addition of
metadata prefetch to scrub/resilver in that build appears to have
dramatically changed how it performs (largely for the better).

    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Bill Sommerfeld


On 04/05/10 15:24, Peter Schuller wrote:

In the urxvt case, I am basing my claim on informal observations.
I.e., "hit terminal launch key, wait for disks to rattle, get my
terminal". Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.


Are you sure you're not seeing unrelated disk update activity like atime 
updates, mtime updates on pseudo-terminals, etc., ?


I'd want to start looking more closely at I/O traces (dtrace can be very 
helpful here) before blaming any specific system component for the 
unexpected I/O.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD sale on newegg

2010-04-06 Thread Bill Sommerfeld


On 04/06/10 17:17, Richard Elling wrote:

You could probably live with an X25-M as something to use for all three,
but of course you're making tradeoffs all over the place.


That would be better than almost any HDD on the planet because
the HDD tradeoffs result in much worse performance.


Indeed.  I've set up a couple small systems (one a desktop workstation, 
and the other a home fileserver) with root pool plus the l2arc and slog 
for a data pool on an 80G X25-M and have been very happy with the result.


The recipe I'm using is to slice the ssd, with the rpool in s0 with 
roughly half the space, 1GB in s3 for slog, and the rest of the space as 
L2ARC in s4.  That may actually be overly generous for the root pool, 
but I run with copies=2 on rpool/ROOT and I tend to keep a bunch of BE's 
around.


    - Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-11 Thread Bill Sommerfeld


On 04/11/10 10:19, Manoj Joseph wrote:

Earlier writes to the file might have left
older copies of the blocks lying around which could be recovered.


Indeed; to be really sure you need to overwrite all the free space in 
the pool.


If you limit yourself to worrying about data accessible via a regular 
read on the raw device, it's possible to do this without an outage if 
you have a spare disk and a lot of time:


rough process:

 0) delete the files and snapshots containing the data you wish to purge.

 1) replace a previously unreplaced disk in the pool with the spare 
disk using "zpool replace"


 2) wait for the replace to complete

 3) wipe the removed disk, using the "purge" command of format(1m)'s 
analyze subsystem or equivalent; the wiped disk is now the spare disk.


 4) if all disks have not been replaced yet, go back to step 1.

This relies on the fact that the resilver kicked off by "zpool replace" 
copies only allocated data.


There are some assumptions in the above.  For one, I'm assuming that 
that all disks in the pool are the same size.  A bigger one is that a 
"purge" is sufficient to wipe the disks completely -- probably the 
biggest single assumption, given that the underlying storage devices 
themselves are increasingly using copy-on-write techniques.


The most paranoid will replace all the disks and then physically destroy 
the old ones.


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-11 Thread Bill Sommerfeld


On 04/11/10 12:46, Volker A. Brandt wrote:

The most paranoid will replace all the disks and then physically
destroy the old ones.


I thought the most paranoid will encrypt everything and then forget
the key... :-)


Actually, I hear that the most paranoid encrypt everything *and then*
destroy the physical media when they're done with it.


Seriously, once encrypted zfs is integrated that's a viable method.


It's certainly a new tool to help with the problem, but consider that
forgetting a key requires secure deletion of the key.

Like most cryptographic techniques, filesystem encryption only changes
the size of the problem we need to solve.

    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggestions about current ZFS setup

2010-04-14 Thread Bill Sommerfeld


On 04/14/10 12:37, Christian Molson wrote:

First I want to thank everyone for their input, It is greatly appreciated.

To answer a few questions:

Chassis I have: 
http://www.supermicro.com/products/chassis/4U/846/SC846E2-R900.cfm

Motherboard:
http://www.tyan.com/product_board_detail.aspx?pid=560

RAM:
24 GB (12 x 2GB)

10 x 1TB Seagates 7200.11
10 x 1TB Hitachi
4   x 2TB WD WD20EARS (4K blocks)


If you have the spare change for it I'd add one or two SSD's to the mix, 
with space on them allocated to the root pool plus l2arc cache, and slog 
for the data pool(s).


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup screwing up snapshot deletion

2010-04-14 Thread Bill Sommerfeld


On 04/14/10 19:51, Richard Jahnel wrote:

This sounds like the known issue about the dedupe map not fitting in ram.


Indeed, but this is not correct:


When blocks are freed, dedupe scans the whole map to ensure each block is not 
is use before releasing it.


That's not correct.

dedup uses a data structure which is indexed by the hash of the contents 
of each block.  That hash function is effectively random, so it needs to 
access a *random* part of the map for each free which means that it (as 
you correctly stated):



... takes a veeery long time if the map doesn't fit in ram.

If you can try adding more ram to the system.


Adding a flash-based ssd as an cache/L2ARC device is also very 
effective; random i/o to ssd is much faster than random i/o to spinning 
rust.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it safe/possible to idle HD's in a ZFS Vdev to save wear/power?

2010-04-16 Thread Bill Sommerfeld


On 04/16/10 20:26, Joe wrote:

I was just wondering if it is possible to spindown/idle/sleep hard disks that are 
part of a Vdev&  pool SAFELY?


it's possible.

my ultra24 desktop has this enabled by default (because it's a known 
desktop type).  see the power.conf man page; I think you may need to add 
an "autopm enable" if the system isn't recognized as a known desktop.


the disks spin down when the system is idle; there's a delay of a few 
seconds when they spin back up.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-17 Thread Bill Sommerfeld


On 04/17/10 07:59, Dave Vrona wrote:

1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs
be mirrored ?


L2ARC cannot be mirrored -- and doesn't need to be.  The contents are
checksummed; if the checksum doesn't match, it's treated as a cache miss
and the block is re-read from the main pool disks.

The ZIL can be mirrored, and mirroring it improves your ability to 
recover the pool in the face of multiple failures.



2) ZIL write cache.  It appears some have disabled the write cache on
the X-25E.  This results in a 5 fold performance hit but it
eliminates a potential mechanism for data loss.  Is this valid?


With the ZIL disabled, you may lose the last ~30s of writes to the pool 
(the transaction group being assembled and written at the time of the 
crash).


With the ZIL on a device with a write cache that ignores cache flush 
requests, you may lose the tail of some of the intent logs, starting 
with the first block in each log  which wasn't readable after the 
restart.  (I say "may" rather than "will" because some failures may not 
result in the loss of the write cache).  Depending on how quickly your 
ZIL device pushes writes from cache to stable storage, this may narrow 
the window from ~30s to less than 1s, but doesn't close the window entirely.



If I can mirror ZIL, I imagine this is no longer a concern?


Mirroring a ZIL device with a volatile write cache doesn't eliminate 
this risk.  Whether it reduces the risk depends on precisely *what* 
caused your system to crash and reboot; if the failure also causes loss 
of the write cache contents on both sides of the mirror, mirroring won't 
help.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single-disk pool corrupted after controller failure

2010-05-01 Thread Bill Sommerfeld


On 05/01/10 13:06, Diogo Franco wrote:

After seeing that on some cases labels were corrupted, I tried running
zdb -l on mine:

...
(labels 0, 1 not there, labels 2, 3 are there).


I'm looking for pointers on how to fix this situation, since the disk
still has available metadata.


there are two reasons why you could get this:
 1) the labels are gone.

 2) the labels are not at the start of what solaris sees as p1, and 
thus are somewhere else on the disk.  I'd look more closely at how 
freebsd computes the start of the partition or slice '/dev/ad6s1d'

that contains the pool.

I think #2 is somewhat more likely.

    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] confused about zpool import -f and export

2010-05-07 Thread Bill McGonigle

Hi, all,

I think I'm missing a concept with import and export.  I'm working on 
installing a Nexenta b134 system under Xen, and I have to run the installer 
under hvm mode, then I'm trying to get it back up under pv mode.  In that 
process the controller names change, and that's where I'm getting tripped up. 

I do a successful install, then I boot OK, but can't export the root pool (OK, 
fine).  So, I boot from the installer cd in rescue mode, do an 'import -f' and 
then 'export'.  That all goes well.

When I reconfigure the VM and boot back up in pv mode, if I bring it up under 
the CD image and do 'zpool import', I get:

==
r...@nexenta_safemode:~# zpool import
  pool: syspool
id: 5607125904664422185
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

syspool   UNAVAIL  missing device
  mirror-0ONLINE
c0t0d0s0  ONLINE
c0t1d0s0  ONLINE

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
===

I thought the purpose of the export was to remove concerns about which devices 
are in the pool so it could be reassembled on the other side.  But, like I 
said, I think I'm missing something because 'export' doesn't seem to clear this 
up.  Or maybe it does, but I'm not understanding the other thing that's 
supposed to be cleared up.   This worked back on a 20081207 build, so perhaps 
something has changed?

I'm adding format's view of the disks and a zdb list below.

Thanks,
-Bill



r...@nexenta_safemode:~# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c0t0d0 
  /xpvd/x...@51712
   1. c0t1d0 
  /xpvd/x...@51728
Specify disk (enter its number): ^D
r...@nexenta_safemode:~# zdb -l /dev/rdsk/c0t0d0s0

LABEL 0

version: 22
name: 'syspool'
state: 1
txg: 384
pool_guid: 5607125904664422185
hostid: 4905600
hostname: 'nexenta_safemode'
top_guid: 7124011680357776878
guid: 15556832564812580834
vdev_children: 1
vdev_tree:
type: 'mirror'
id: 0
guid: 7124011680357776878
metaslab_array: 23
metaslab_shift: 32
ashift: 9
asize: 750041956352
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15556832564812580834
path: '/dev/dsk/c0d0s0'
devid: 'id1,c...@aqemu_harddisk=qm1/a'
phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@0,0:a'
whole_disk: 0
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 544113268733868414
path: '/dev/dsk/c0d1s0'
devid: 'id1,c...@aqemu_harddisk=qm2/a'
phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@1,0:a'
whole_disk: 0
create_txg: 4

LABEL 1

version: 22
name: 'syspool'
state: 1
txg: 384
pool_guid: 5607125904664422185
hostid: 4905600
hostname: 'nexenta_safemode'
top_guid: 7124011680357776878
guid: 15556832564812580834
vdev_children: 1
vdev_tree:
type: 'mirror'
id: 0
guid: 7124011680357776878
metaslab_array: 23
metaslab_shift: 32
ashift: 9
asize: 750041956352
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15556832564812580834
path: '/dev/dsk/c0d0s0'
devid: 'id1,c...@aqemu_harddisk=qm1/a'
phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@0,0:a'
whole_disk: 0
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 544113268733868414
path: '/dev/dsk/c0d1s0'
devid: 'id1,c...@aqemu_harddisk=qm2/a'
phys_path: '/p...@0,0/pci-...@1,1/i...@0/c...@1,0:a'
whole_disk: 0
create_txg: 4

LABEL 2

version: 22
name: 'syspool'
state: 0
txg: 11520
pool_guid: 15023076366841556794
hostid: 8399112
hostname: 'repository'
top_guid: 12107281337513313186

Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-07 Thread Bill McGonigle


On 05/07/2010 11:08 AM, Edward Ned Harvey wrote:

I'm going to continue encouraging you to staying "mainstream," because what
people do the most is usually what's supported the best.


If I may be the contrarian, I hope Matt keeps experimenting with this, 
files bugs, and they get fixed.  His use case is very compelling - I 
know lots of SOHO folks who could really use a NAS where this 'just worked'


The ZFS team has done well by thinking liberally about conventional 
assumptions.


-Bill

--
Bill McGonigle, Owner
BFC Computing, LLC
http://bfccomputing.com/
Telephone: +1.603.448.4440
Email, IM, VOIP: b...@bfccomputing.com
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - USB 3.0 SSD disk

2010-05-07 Thread Bill McGonigle


On 05/06/2010 11:00 AM, Bruno Sousa wrote:

Going on the specs it seems to me that if this device has a good price
it might be quite useful for caching purposes on ZFS based storage.


Not bad, they claim 1TB transfer in 47 minutes:

  http://www.google.com/search?hl=en&q=1TB%2F47+minutes

That's about double what I usually get out of a cheap 'desktop' SATA 
drive with OpenSolaris.  Slower than a RAID-Z2 of 10 of them, though. 
Still, the power savings could be appreciable.


-Bill

--
Bill McGonigle, Owner
BFC Computing, LLC
http://bfccomputing.com/
Telephone: +1.603.448.4440
Email, IM, VOIP: b...@bfccomputing.com
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS root ARC memory usage on VxFS system...

2010-05-07 Thread Bill Sommerfeld


On 05/07/10 15:05, Kris Kasner wrote:

Is ZFS swap cached in the ARC? I can't account for data in the ZFS filesystems
to use as much ARC as is in use without the swap files being cached.. seems a
bit redundant?


There's nothing to explicitly disable caching just for swap; from zfs's 
point of view, the swap zvol is just like any other zvol.


But, you can turn this off (assuming sufficiently recent zfs).  try:

zfs set primarycache=metadata rpool/swap

(or whatever your swap zvol is named).

(you probably want metadata rather than "none" so that things like 
indirect blocks for the swap device get cached).


    - Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-20 Thread Bill Sommerfeld


On 05/20/10 12:26, Miles Nordin wrote:

I don't know, though, what to do about these reports of devices that
almost respect cache flushes but seem to lose exactly one transaction.
AFAICT this should be a works/doesntwork situation, not a continuum.


But there's so much brokenness out there.  I've seen similar "tail drop" 
behavior before -- last write or two before a hardware reset goes into 
the bit bucket, but ones before that are durable.


So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for 
some use cases) to narrow the window of data loss from ~30 seconds to a 
sub-second value.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Bill Sommerfeld


On 06/15/10 10:52, Erik Trimble wrote:

Frankly, dedup isn't practical for anything but enterprise-class
machines. It's certainly not practical for desktops or anything remotely
low-end.


We're certainly learning a lot about how zfs dedup behaves in practice. 
 I've enabled dedup on two desktops and a home server and so far 
haven't regretted it on those three systems.


However, they each have more than typical amounts of memory (4G and up) 
a data pool in two or more large-capacity SATA drives, plus an X25-M ssd 
sliced into a root pool as well as l2arc and slog slices for the data 
pool (see below: [1])


I tried enabling dedup on a smaller system (with only 1G memory and a 
single very slow disk), observed serious performance problems, and 
turned it off pretty quickly.


I think, with current bits, it's not a simple matter of "ok for 
enterprise, not ok for desktops".  with an ssd for either main storage 
or l2arc, and/or enough memory, and/or a not very demanding workload, it 
seems to be ok.


For one such system, I'm seeing:

# zpool list z
NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
z  464G   258G   206G55%  1.25x  ONLINE  -
# zdb -D z
DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core
DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core

dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies 
= 1.80

    - Bill

[1] To forestall responses of the form: "you're nuts for putting a slog 
on an x25-m", which is off-topic for this thread and being discussed 
elsewhere":


Yes, I'm aware of the write cache issues on power fail on the x25-m. 
For my purposes, it's a better robustness/performance tradeoff than 
either zil-on-spinning-rust or zil disabled, because:
 a) for many potential failure cases on whitebox hardware running 
bleeding edge opensolaris bits, the x25-m will not lose power and thus 
the write cache will stay intact across a crash.
 b) even if it loses power and loses some writes-in-flight, it's not 
likely to lose *everything* since the last txg sync.


It's good enough for my personal use.  Your mileage will vary.  As 
always, system design involves tradeoffs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Bill Sommerfeld


On 07/20/10 14:10, Marcelo H Majczak wrote:

It also seems to be issuing a lot more
writing to rpool, though I can't tell what. In my case it causes a
lot of read contention since my rpool is a USB flash device with no
cache. iostat says something like up to 10w/20r per second. Up to 137
the performance has been enough, so far, for my purposes on this
laptop.


if pools are more than about 60-70% full, you may be running into 6962304

workaround: add the following to /etc/system, run
bootadm update-archive, and reboot

-cut here-
* Work around 6962304
set zfs:metaslab_min_alloc_size=0x1000
* Work around 6965294
set zfs:metaslab_smo_bonus_pct=0xc8
-cut here-

no guarantees, but it's helped a few systems..

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Bill Sommerfeld


On 07/20/10 14:10, Marcelo H Majczak wrote:

It also seems to be issuing a lot more
writing to rpool, though I can't tell what. In my case it causes a
lot of read contention since my rpool is a USB flash device with no
cache. iostat says something like up to 10w/20r per second. Up to 137
the performance has been enough, so far, for my purposes on this
laptop.


if pools are more than about 60-70% full, you may be running into 6962304

workaround: add the following to /etc/system, run
bootadm update-archive, and reboot

-cut here-
* Work around 6962304
set zfs:metaslab_min_alloc_size=0x1000
* Work around 6965294
set zfs:metaslab_smo_bonus_pct=0xc8
-cut here-

no guarantees, but it's helped a few systems..

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] L2ARC and ZIL on same SSD?

2010-07-22 Thread Bill Sommerfeld


On 07/22/10 04:00, Orvar Korvar wrote:

Ok, so the bandwidth will be cut in half, and some people use this
configuration. But, how bad is it to have the bandwidth cut in half?
Will it hardly notice?


For a home server, I doubt you'll notice.

I've set up several systems (desktop & home server) as follows:
- two large conventional disks, mirrored, as data pool.

- single X25-M, 80GB, divided in three slices:
50% in slice 0 as root pool,
(with dedup & compression enabled, and
copies=2 for rpool/ROOT)
1GB in slice 3 as ZIL for data pool
remainder in slice 4 as L2ARC for data pool.

two conventional disks + 1 ssd performs much better than two disks 
alone.  If I needed more space (I haven't, yet), I'd add another mirror 
pair or two to the data pool.


I've been very happy with the results.

    - Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Increase resilver priority

2010-07-23 Thread Bill Sommerfeld


On 07/23/10 02:31, Giovanni Tirloni wrote:

  We've seen some resilvers on idle servers that are taking ages. Is it
possible to speed up resilver operations somehow?

  Eg. iostat shows<5MB/s writes on the replaced disks.


What build of opensolaris are you running?  There were some recent 
improvements (notably the addition of prefetch to the pool traverse used 
by scrub and resilver) which sped this up significantly for my systems.


Also: if there are large numbers of snapshots, pools seem to take longer 
to resilver, particularly when there's a lot of metadata divergence 
between snapshots.  Turning off atime updates (if you and your 
applications can cope with this) may also help going forward.


    - Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilvering, amount of data on disk, etc.

2009-10-26 Thread Bill Sommerfeld

On Mon, 2009-10-26 at 10:24 -0700, Brian wrote:
> Why does resilvering an entire disk, yield different amounts of data that was 
> resilvered each time.
> I have read that ZFS only resilvers what it needs to, but in the case of 
> replacing an entire disk with another formatted clean disk, you would think 
> the amount of data would be the same each time a disk is replaced with an 
> empty formatted disk. 
> I'm getting different results when viewing the 'zpool status' info (below)

replacing a disk adds an entry to the "zpool history" log, which
requires allocating blocks, which will change what's stored in the pool.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sched regularily writing a lots of MBs to the pool?

2009-11-04 Thread Bill Sommerfeld

zfs groups writes together into transaction groups; the physical writes
to disk are generally initiated by kernel threads (which appear in
dtrace as threads of the "sched" process).  Changing the attribution is
not going to be simple as a single physical write to the pool may
contain data and metadata changes triggered by multiple user processes.

You need to go up a level of abstraction and look at the vnode layer to
attribute writes to particular processes.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe question

2009-11-07 Thread Bill Sommerfeld

On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote:
> Does the dedupe functionality happen at the file level or a lower block
> level?

it occurs at the block allocation level.

> I am writing a large number of files that have the fol structure :
> 
> -- file begins
> 1024 lines of random ASCII chars 64 chars long
> some tilde chars .. about 1000 of then
> some text ( english ) for 2K
> more text ( english ) for 700 bytes or so
> --

ZFS's default block size is 128K and is controlled by the "recordsize"
filesystem property.  Unless you changed "recordsize", each of the files
above would be a single block distinct from the others.

you may or may not get better dedup ratios with a smaller recordsize
depending on how the common parts of the file line up with block
boundaries.

the cost of additional indirect blocks might overwhelm the savings from
deduping a small common piece of the file.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] This is the scrub that never ends...

2009-11-10 Thread Bill Sommerfeld

On Fri, 2009-09-11 at 13:51 -0400, Will Murnane wrote:
> On Thu, Sep 10, 2009 at 13:06, Will Murnane  wrote:
> > On Wed, Sep 9, 2009 at 21:29, Bill Sommerfeld  wrote:
> >>> Any suggestions?
> >>
> >> Let it run for another day.
> > I'll let it keep running as long as it wants this time.
>  scrub: scrub completed after 42h32m with 0 errors on Thu Sep 10 17:20:19 2009
> 
> And the people rejoiced.  So I guess the issue is more "scrubs may
> report ETA very inaccurately" than "scrubs never finish".  Thanks for
> the suggestions and support.

One of my pools routinely does this -- the scrub gets to 100% after
about 50 hours but keeps going for another day or more after that.

It turns out that zpool reports "number of blocks visited" vs "number of
blocks allocated", but clamps the ratio at 100%.

If there is substantial turnover in the pool, it appears you may end up
needing to visit more blocks than are actually allocated at any one
point in time.

I made a modified version of the zpool command and this is what it
prints for me:

...
 scrub: scrub in progress for 74h25m, 119.90% done, 0h0m to go
 5428197411840 blocks examined, 4527262118912 blocks allocated
...

This is the (trivial) source change I made to see what's going on under
the covers:

diff -r 12fb4fb507d6 usr/src/cmd/zpool/zpool_main.c
--- a/usr/src/cmd/zpool/zpool_main.cMon Oct 26 22:25:39 2009 -0700
+++ b/usr/src/cmd/zpool/zpool_main.cTue Nov 10 17:07:59 2009 -0500
@@ -2941,12 +2941,15 @@
 
if (examined == 0)
examined = 1;
-   if (examined > total)
-   total = examined;
 
fraction_done = (double)examined / total;
-   minutes_left = (uint64_t)((now - start) *
-   (1 - fraction_done) / fraction_done / 60);
+   if (fraction_done < 1) {
+   minutes_left = (uint64_t)((now - start) *
+   (1 - fraction_done) / fraction_done / 60);
+   } else {
+   minutes_left = 0;
+   }
+
minutes_taken = (uint64_t)((now - start) / 60);
 
(void) printf(gettext("%s in progress for %lluh%um, %.2f%% done,
"
@@ -2954,6 +2957,9 @@
scrub_type, (u_longlong_t)(minutes_taken / 60),
(uint_t)(minutes_taken % 60), 100 * fraction_done,
(u_longlong_t)(minutes_left / 60), (uint_t)(minutes_left %
60));
+   (void) printf(gettext("\t %lld blocks examined, %lld blocks
allocated\n"),
+   examined,
+   total);
 }
 
 static void

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs eradication

2009-11-11 Thread Bill Sommerfeld

On Wed, 2009-11-11 at 10:29 -0800, Darren J Moffat wrote:
> Joerg Moellenkamp wrote:
> > Hi,
> > 
> > Well ... i think Darren should implement this as a part of
> zfs-crypto. Secure Delete on SSD looks like quite challenge, when wear
> leveling and bad block relocation kicks in ;)
> 
> No I won't be doing that as part of the zfs-crypto project. As I said 
> some jurisdictions are happy that if the data is encrypted then 
> overwrite of the blocks isn't required.   For those that aren't use 
> dd(1M) or format(1M) may be sufficient - if that isn't then nothing 
> short of physical destruction is likely good enough.

note that "eradication" via overwrite makes no sense if the underlying
storage uses copy-on-write, because there's no guarantee that the newly
written block actually will overlay the freed block.

IMHO the sweet spot here may be to overwrite once with zeros (allowing
the block to be compressed out of existance if the underlying storage is
a compressed zvol or equivalent) or to use the TRIM command.

(It may also be worthwhile for zvols exported via various protocols to
themselves implement the TRIM command -- freeing the underlying
storage).

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver/scrub times?

2009-11-22 Thread Bill Sommerfeld

Yesterday's integration of 

6678033 resilver code should prefetch

as part of changeset 74e8c05021f1 (which should be in build 129 when it
comes out) may improve scrub times, particularly if you have a large
number of small files and a large number of snapshots.  I recently
tested an early version of the fix, and saw one pool go from an elapsed
time of 85 hours to 20 hours; another (with many fewer snapshots) went
from 35 to 17.  

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] USB sticks show on one set of devices in zpool, different devices in format

2009-12-04 Thread Bill Hutchison

Hello,

I had snv_111b running for a while on a HP DL160G5.  With two 16GB USB sticks 
comprising the mirrored rpool for boot.  And four 1TB drives comprising another 
pool, pool1, for data.

So that's been working just fine for a few months.  Yesterday I get it into my 
mind to upgrade the OS to latest, then was snv_127.  That worked, and all was 
well.  Also did an upgrade to the DL160G5's BIOs firmware.  All was cool and 
running as snv_127 just fine.  Upgraded zfs from 13 to 19.

See pool status post-upgrade:

r...@arc:/# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
pool1   ONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
c7t1d0  ONLINE   0 0 0
c7t2d0  ONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
c7t4d0  ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror-0ONLINE   0 0 0
c2t0d0s0  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0

errors: No known data errors


Today I went to activate the BE for the new snv_127 install that I've been 
manually booting into, but "beadm activate..." will always fail, see here:

r...@arc:~# export BE_PRINT_ERR=true
r...@arc:~# beadm activate opensolaris-snv127
be_do_installgrub: installgrub failed for device c2t0d0s0.
Unable to activate opensolaris-snv127.
Unknown external error.

So I tried the installgrub manually and get this:

r...@arc:~# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2t0d0s0
cannot open/stat device /dev/rdsk/c2t0d0s2

OK, wtf?  The rpool status shows both of my USB sticks alive and well at 
c2t0d0s0 and c1t0d0s0...

But when I run "format -e" I see this:

r...@arc:/# format -e
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c7t1d0 
  /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@1,0
   1. c7t2d0 
  /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@2,0
   2. c7t3d0 
  /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@3,0
   3. c7t4d0 
  /p...@0,0/pci8086,4...@1/pci103c,3...@0/s...@4,0
   4. c8t0d0 
  /p...@0,0/pci103c,3...@1d,7/stor...@8/d...@0,0
   5. c11t0d0 
  /p...@0,0/pci103c,3...@1d,7/stor...@6/d...@0,0
Specify disk (enter its number): 4
selecting c8t0d0
[disk formatted]
/dev/dsk/c8t0d0s0 is part of active ZFS pool rpool. Please see zpool(1M).


It shows my two USB sticks of the rpool being at c8t0d0 and c11t0d0... ! 

How is this system even working?  What do I need to do to clear this up...?


Thanks for your time,

-Bill
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on ssd

2009-12-11 Thread Bill Sommerfeld

On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote:
> > "sh" == Seth Heeren  writes:
> 
> sh> If you don't want/need log or cache, disable these? You might
> sh> want to run your ZIL (slog) on ramdisk.
> 
> seems quite silly.  why would you do that instead of just disabling
> the ZIL?  I guess it would give you a way to disable it pool-wide
> instead of system-wide.
> 
> A per-filesystem ZIL knob would be awesome.

for what it's worth, there's already a per-filesystem ZIL knob: the
"logbias" property.  It can be set either to "latency" or 
"throughput".  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool fragmentation issues?

2009-12-15 Thread Bill Sprouse


Hi Everyone,

I hope this is the right forum for this question.  A customer is using  
a Thumper as an NFS file server to provide the mail store for multiple  
email servers (Dovecot).  They find that when a zpool is freshly  
created and populated with mail boxes, even to the extent of 80-90%  
capacity, performance is ok for the users, backups and scrubs take a  
few hours (4TB of data). There are around 100 file systems.  After  
running for a while (couple of months) the zpool seems to get  
"fragmented", backups take 72 hours and a scrub takes about 180  
hours.  They are running mirrors with about 5TB usable per pool (500GB  
disks).  Being a mail store, the writes and reads are small and  
random.  Record size has been set to 8k (improved performance  
dramatically).  The backup application is Amanda.  Once backups become  
too tedious, the remedy is to replicate the pool and start over.   
Things get fast again for a while.


Is this expected behavior given the application (email - small, random  
writes/reads)?  Are there recommendations for system/ZFS/NFS  
configurations to improve this sort of thing?  Are there best  
practices for structuring backups to avoid a directory walk?


Thanks,
bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] force 4k writes?

2009-12-15 Thread Bill Sprouse

This is most likely a naive question on my part.  If recordsize is set  
to 4k (or a multiple of 4k), will ZFS ever write a record that is less  
than 4k or not a multiple of 4k?  This includes metadata.  Does  
compression have any effect on this?


thanks for the help,
bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-15 Thread Bill Sommerfeld

On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote:
> After  
> running for a while (couple of months) the zpool seems to get  
> "fragmented", backups take 72 hours and a scrub takes about 180  
> hours. 

Are there periodic snapshots being created in this pool?  

Can they run with atime turned off?

(file tree walks performed by backups will update the atime of all
directories; this will generate extra write traffic and also cause
snapshots to diverge from their parents and take longer to scrub).

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] force 4k writes

2009-12-16 Thread Bill Sprouse


Hi Richard,

How's the ranch?  ;-)



This is most likely a naive question on my part.  If recordsize is
set to 4k (or a multiple of 4k), will ZFS ever write a record that
is less than 4k or not a multiple of 4k?


Yes.  The recordsize is the upper limit for a file record.


This includes metadata.


Yes.  Metadata is compressed and seems to usually be one block.


Does compression have any effect on this?


Yes. 4KB is the minimum size that can be compressed for regular data.

NB. Physical writes may be larger because they are coalesced.  But
if you are worried about recordsize, then you are implicitly worried
about
reads.


The question behind the question is, given the really bad things that  
can happen performance-wise with writes that are not 4k aligned when  
using flash devices, is there any way to insure that any and all  
writes from ZFS are 4k aligned?




 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Bill Sprouse



On Dec 15, 2009, at 6:24 PM, Bill Sommerfeld wrote:


On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote:

After
running for a while (couple of months) the zpool seems to get
"fragmented", backups take 72 hours and a scrub takes about 180
hours.


Are there periodic snapshots being created in this pool?


Yes, every two hours.



Can they run with atime turned off?


I'm not sure, but I expect they can.  I'll ask.



(file tree walks performed by backups will update the atime of all
directories; this will generate extra write traffic and also cause
snapshots to diverge from their parents and take longer to scrub).

    - Bill



Thanks!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Bill Sprouse


Hi Bob,

On Dec 15, 2009, at 6:41 PM, Bob Friesenhahn wrote:


On Tue, 15 Dec 2009, Bill Sprouse wrote:


Hi Everyone,

I hope this is the right forum for this question.  A customer is  
using a Thumper as an NFS file server to provide the mail store for  
multiple email servers (Dovecot).  They find that when a zpool is  
freshly created and


It seems that Dovecot's speed optimizations for mbox format are  
specially designed to break zfs


 "http://wiki.dovecot.org/MailboxFormat/mbox#Dovecot.27s_Speed_Optimizations 
"


and explains why using a tiny 8k recordsize temporarily "improved"  
performance.  Tiny updates seem to be abnormal for a mail server.  
The many tiny updates combined with zfs COW conspire to spread the  
data around the disk, requiring a seek for each 8k of data.  If more  
data was written at once, and much larger blocks were used, then the  
filesystem would continue to perform much better, although perhaps  
less well initially.  If the system has sufficient RAM, or a large  
enough L2ARC, then Dovecot's optimizations to diminish reads become  
meaningless.


I think one of the reasons they went to small recordsizes was an issue  
where they were getting killed with reads of small messages and having  
to pull in 128K records each time.  The smaller recordsizes seem to  
have improved that aspect at least.  Thanks for the pointer to the  
Dovecot notes.




Is this expected behavior given the application (email - small,  
random writes/reads)?  Are there recommendations for system/ZFS/NFS  
configurations to improve this sort of thing?  Are there best  
practices for structuring backups to avoid a directory walk?


Zfs works best when whole files are re-written rather than updated  
in place as Dovecot seems to want to do.  Either the user mailboxes  
should be re-written entirely when they are "expunged" or else a  
different mail storage format which writes entire files, or much  
larger records, should be used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Bill Sprouse


Thanks MIchael,

Useful stuff to try.  I wish we could add more memory, but the x4500  
is limited to 16GB.  Compression was a question.  Its currently off,  
but they were thinking of turning it on.


bill

On Dec 15, 2009, at 7:02 PM, Michael Herf wrote:

I have also had slow scrubbing on filesystems with lots of files,  
and I agree that it does seem to degrade badly. For me, it seemed to  
go from 24 hours to 72 hours in a matter of a few weeks.


I did these things on a pool in-place, which helped a lot (no  
rebuilding):
1. reduced number of snapshots (auto snapshots can generate a lot of  
files).
2. disabled compression and rebuilt affected datasets (is  
compression on?)
3. upgraded to b129, which has metadata prefetch for scrub, seems to  
help by ~2x?

4. tar'd up some extremely large folders
5. added 50% more RAM.
6. turned off atime

My scrubs went from 80 hours to 12 with these changes. (4TB used,  
~10M files + 10 snapshots each.)


I haven't figured out if "disable compression" vs. "fewer snapshots/ 
files and more RAM" made a bigger difference. I'm assuming that once  
the number of files exceeds ARC, you get dramatically lower  
performance, and maybe that compression has some additional  
overhead, but I don't know, this is just what worked.


It would be nice to have a benchmark set for features like this &  
general recommendations for RAM/ARC size, based on number of files,  
etc. How does ARC usage scale with snapshots? Scrub on a huge  
maildir machine seems like it would make a nice benchmark.


I used "zdb -d pool" to figure out which filesystems had a lot of  
objects, and figured out places to trim based on that.


mike

On Tue, Dec 15, 2009 at 6:41 PM, Bob Friesenhahn > wrote:

On Tue, 15 Dec 2009, Bill Sprouse wrote:

Hi Everyone,

I hope this is the right forum for this question.  A customer is  
using a Thumper as an NFS file server to provide the mail store for  
multiple email servers (Dovecot).  They find that when a zpool is  
freshly created and


It seems that Dovecot's speed optimizations for mbox format are  
specially designed to break zfs


 "http://wiki.dovecot.org/MailboxFormat/mbox#Dovecot.27s_Speed_Optimizations 
"


and explains why using a tiny 8k recordsize temporarily "improved"  
performance.  Tiny updates seem to be abnormal for a mail server.  
The many tiny updates combined with zfs COW conspire to spread the  
data around the disk, requiring a seek for each 8k of data.  If more  
data was written at once, and much larger blocks were used, then the  
filesystem would continue to perform much better, although perhaps  
less well initially.  If the system has sufficient RAM, or a large  
enough L2ARC, then Dovecot's optimizations to diminish reads become  
meaningless.



Is this expected behavior given the application (email - small,  
random writes/reads)?  Are there recommendations for system/ZFS/NFS  
configurations to improve this sort of thing?  Are there best  
practices for structuring backups to avoid a directory walk?


Zfs works best when whole files are re-written rather than updated  
in place as Dovecot seems to want to do.  Either the user mailboxes  
should be re-written entirely when they are "expunged" or else a  
different mail storage format which writes entire files, or much  
larger records, should be used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Bill Sprouse


Hi Brent,

I'm not sure why Dovecot was chosen.  It was most likely a  
recommendation by a fellow University.  I agree that it lacking in  
efficiencies in a lot of areas.  I don't think I would be successful  
in suggesting a change at this point as I have already suggested a  
couple of alternatives without success.


Do you a have a pointer to the "block/parity rewrite" tool mentioned  
below?


bill

On Dec 15, 2009, at 9:38 PM, Brent Jones wrote:

On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse   
wrote:

Hi Everyone,

I hope this is the right forum for this question.  A customer is  
using a
Thumper as an NFS file server to provide the mail store for  
multiple email
servers (Dovecot).  They find that when a zpool is freshly created  
and

populated with mail boxes, even to the extent of 80-90% capacity,
performance is ok for the users, backups and scrubs take a few  
hours (4TB of
data). There are around 100 file systems.  After running for a  
while (couple
of months) the zpool seems to get "fragmented", backups take 72  
hours and a
scrub takes about 180 hours.  They are running mirrors with about  
5TB usable
per pool (500GB disks).  Being a mail store, the writes and reads  
are small

and random.  Record size has been set to 8k (improved performance
dramatically).  The backup application is Amanda.  Once backups  
become too
tedious, the remedy is to replicate the pool and start over.   
Things get

fast again for a while.

Is this expected behavior given the application (email - small,  
random
writes/reads)?  Are there recommendations for system/ZFS/NFS  
configurations
to improve this sort of thing?  Are there best practices for  
structuring

backups to avoid a directory walk?

Thanks,
bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Anyone reason in particular they chose to use Dovecot with the old  
Mbox format?

Mbox has been proven many times over to be painfully slow when the
files get larger, and in this day and age, I can't imagine anyone
having smaller than a 50MB mailbox. We have about 30,000 e-mail users
on various systems, and it seems the average size these days is
approaching close to a GB. Though Dovecot has done a lot to improve
the performance of Mbox mailboxes, Maildir might be more rounded for
your system.

I wonder if the "soon to be released" block/parity rewrite tool will
"freshen" up a pool thats heavily fragmented, without having to redo
the pools.

--
Brent Jones
br...@servuhome.net


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Bill Sprouse

Just checked w/customer and they are using the MailDir functionality
with Dovecot.

On Dec 16, 2009, at 11:28 AM, Toby Thain wrote:

On 16-Dec-09, at 10:47 AM, Bill Sprouse wrote:

Hi Brent,

I'm not sure why Dovecot was chosen. It was most likely a
recommendation by a fellow University. I agree that it lacking in
efficiencies in a lot of areas. I don't think I would be
successful in suggesting a change at this point as I have already
suggested a couple of alternatives without success.

(As Damon pointed out) The problem seems not Dovecot per se but the
choice of mbox format, which is rather self-evidently inefficient.

Do you a have a pointer to the "block/parity rewrite" tool
mentioned below?

It headlines the informal roadmap presented by Jeff Bonwick.

http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf

--Toby

bill

On Dec 15, 2009, at 9:38 PM, Brent Jones wrote:

On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse
wrote:

Hi Everyone,

I hope this is the right forum for this question. A customer is
using a
Thumper as an NFS file server to provide the mail store for
multiple email
servers (Dovecot). They find that when a zpool is freshly
created and

populated with mail boxes, even to the extent of 80-90% capacity,
performance is ok for the users, backups and scrubs take a few
hours (4TB of
data). There are around 100 file systems. After running for a
while (couple
of months) the zpool seems to get "fragmented", backups take 72
hours and a
scrub takes about 180 hours. They are running mirrors with about
5TB usable
per pool (500GB disks). Being a mail store, the writes and reads
are small

and random. Record size has been set to 8k (improved performance
dramatically). The backup application is Amanda. Once backups
become too
tedious, the remedy is to replicate the pool and start over.
Things get

fast again for a while.

Is this expected behavior given the application (email - small,
random
writes/reads)? Are there recommendations for system/ZFS/NFS
configurations
to improve this sort of thing? Are there best practices for
structuring

backups to avoid a directory walk?

Thanks,
bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Anyone reason in particular they chose to use Dovecot with the old
Mbox format?

Mbox has been proven many times over to be painfully slow when the
files get larger, and in this day and age, I can't imagine anyone
having smaller than a 50MB mailbox. We have about 30,000 e-mail
users

on various systems, and it seems the average size these days is
approaching close to a GB. Though Dovecot has done a lot to improve
the performance of Mbox mailboxes, Maildir might be more rounded for
your system.

I wonder if the "soon to be released" block/parity rewrite tool will
"freshen" up a pool thats heavily fragmented, without having to redo
the pools.

--
Brent Jones
br...@servuhome.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write bursts cause short app stalls

2010-01-02 Thread Bill Werner

Thanks for this thread!  I was just coming here to discuss this very same 
problem.  I'm running 2009.06 on a Q6600 with 8GB of RAM.  I have a Windows 
system writing multiple OTA HD video streams via CIFS to the 2009.06 system 
running Samba.

I then have multiple clients reading back other HD video streams.  The write 
client never skips a beat, but the read clients have constant problems getting 
data when the "burst" writes occur.

I am now going to try the txg_timeout and see if that helps.   It would be nice 
if these tunables were settable on a per-pool basis though.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disks and caches

2010-01-07 Thread Bill Sommerfeld

On Thu, 2010-01-07 at 11:07 -0800, Anil wrote:
> There is talk about using those cheap disks for rpool. Isn't rpool
> also prone to a lot of writes, specifically when the /tmp is in a SSD?

Huh?  By default, solaris uses tmpfs for /tmp, /var/run,
and /etc/svc/volatile; writes to those filesystems won't hit the SSD
unless the system is short on physical memory.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Degrated pool menbers excluded from writes ?

2010-01-24 Thread Bill Sommerfeld


On 01/24/10 12:20, Lutz Schumann wrote:

One can see that the degrated mirror is excluded from the writes.

I think this is expected behaviour right ?
(data protection over performance)


That's correct.  It will use the space if it needs to but it prefers to 
avoid "sick" top-level vdevs if there are healthy ones available.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol being charged for double space

2010-01-27 Thread Bill Sommerfeld


On 01/27/10 21:17, Daniel Carosone wrote:

This is as expected.  Not expected is that:

  usedbyrefreservation = refreservation

I would expect this to be 0, since all the reserved space has been
allocated.


This would be the case if the volume had no snapshots.


As a result, used is over twice the size of the volume (+
a few small snapshots as well).


I'm seeing essentially the same thing with a recently-created zvol
with snapshots that I export via iscsi for time machine backups on a
mac.

% zfs list  -r -o 
name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett

NAMEREFER   USED  USEDREFRESERV  REFRESERV  VOLSIZE
z/tm/mcgarrett  26.7G  88.2G60G60G  60G

The actual volume footprint is a bit less than half of the volume
size, but the refreservation ensures that there is enough free space
in the pool to allow me to overwrite every block of the zvol with
uncompressable data without any writes failing due to the pool being
out of space.

If you were to disable time-based snapshots and then overwrite a measurable
fraction of the zvol you I'd expect "USEDBYREFRESERVATION" to shrink as
the reserved blocks were actually used.

If you want to allow for overcommit, you need to delete the refreservation.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] server hang with compression on, ping timeouts from remote machine

2010-01-31 Thread Bill Sommerfeld


On 01/31/10 07:07, Christo Kutrovsky wrote:

I've also experienced similar behavior (short freezes) when running
zfs send|zfs receive with compression on LOCALLY on ZVOLs again.

Has anyone else experienced this ? Know any of bug? This is on
snv117.


you might also get better results after the fix to:

6881015 ZFS write activity prevents other threads from running in a 
timely manner


which was fixed in build 129.

As a workaround, try a lower gzip compression level -- higher gzip
levels usually burn lots more CPU without significantly increasing the
compression ratio.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] most of my space is gone

2010-02-06 Thread Bill Sommerfeld


On 02/06/10 08:38, Frank Middleton wrote:

AFAIK there is no way to get around this. You can set a flag so that pkg
tries to empty /var/pkg/downloads, but even though it looks empty, it
won't actually become empty until you delete the snapshots, and IIRC
you still have to manually delete the contents. I understand that you
can try creating a separate dataset and mounting it on /var/pkg, but I
haven't tried it yet, and I have no idea if doing so gets around the
BE snapshot problem.


You can set the environment variable PKG_CACHEDIR to place the cache in 
an alternate filesystem.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-11 Thread Bill Sommerfeld


On 02/11/10 10:33, Lori Alt wrote:

This bug is closed as a dup of another bug which is not readable from
the opensolaris site, (I'm not clear what makes some bugs readable and
some not).


the other bug in question was opened yesterday and probably hasn't had 
time to propagate.


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup

2010-02-12 Thread Bill Sommerfeld


On 02/12/10 09:36, Felix Buenemann wrote:

given I've got ~300GB L2ARC, I'd
need about 7.2GB RAM, so upgrading to 8GB would be enough to satisfy the
L2ARC.


But that would only leave ~800MB free for everything else the server 
needs to do.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 10:45, Paul B. Henson wrote:

I've already posited as to an approach that I think would make a pure-ACL
deployment possible:


http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html

Via this concept or something else, there needs to be a way to configure
ZFS to prevent the attempted manipulation of legacy permission mode bits
from breaking the security policy of the ACL.


I believe this proposal is sound.

In it, you wrote:


The feedback was that the internal Sun POSIX compliance police
wouldn't like that ;).


There are already per-filesystem tunables for ZFS which allow the
system to escape the confines of POSIX (noatime, for one); I don't see
why a "chmod doesn't truncate acls" option couldn't join it so long as
it was off by default and left off while conformance tests were run.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 11:42, Lutz Schumann wrote:

Idea:
   - If the guest writes a block with 0's only, the block is freed again
   - if someone reads this block again - it wil get the same 0's it would get 
if the 0's would be written
- The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison 
for "is this a "0 only" block is easy.

With this in place, a host wishing to free thin provisioned zvol space can fill 
the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero 
of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side.


You've just described how ZFS behaves when compression is enabled -- a 
block of zeros is compressed to a hole represented by an all-zeros block 
pointer.


> Does anyone know why this is not incorporated into ZFS ?

It's in there.  Turn on compression to use it.


    - Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 17:38, Paul B. Henson wrote:

As I wrote in that new sub-thread, I see no option that isn't surprising
in some way.  My preference would be for what I labeled as option (b).


And I think you absolutely should be able to configure your fileserver to
implement your preference. Why shouldn't I be able to configure my
fileserver to implement mine :)?


acl-chmod interactions have been mishandled so badly in the past that i 
think a bit of experimentation with differing policies is in order.


Based on the amount of wailing I see around acls, I think that, based on 
personal experience with both systems, AFS had it more or less right and 
POSIX got it more or less wrong -- once you step into the world of acls, 
the file mode should be mostly ignored, and an accidental chmod should 
*not* destroy carefully crafted acls.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compression and deduplication on root pool on SSD

2010-02-28 Thread Bill Sommerfeld

On 02/28/10 15:58, valrh...@gmail.com wrote:

Also, I don't have the numbers to prove this, but it seems to me

> that the actual size of rpool/ROOT has grown substantially since I
> did a clean install of build 129a (I'm now at build133). WIthout
> compression, either, that was around 24 GB, but things seem
> to have accumulated by an extra 11 GB or so.

One common source for this is slowly accumulating files under
/var/pkg/download.

Clean out /var/pkg/download and delete all but the most recent boot 
environment to recover space (you need to do this to get the space back 
because the blocks are referenced by the snapshots used by each clone as 
its base version).

To avoid this in the future, set PKG_CACHEDIR in your environment to 
point at a filesystem which isn't cloned by beadm -- something outside 
rpool/ROOT, for instance.

On several systems which have two pools (root & data) I've relocated it 
to the data pool - it doesn't have to be part of the root pool.  This 
has significantly slimmed down my root filesystem on systems which are 
chasing the dev branch of opensolaris.

> At present, my rpool/ROOT has no compression, and no deduplication. I 
> was wondering about whether it would be a good idea, from a

> performance and data integrity standpoint, to use one, the other, or
> both, on the root pool.

I've used the combination of copies=2 and compression=yes on rpool/ROOT 
for a while and have been happy with the result.

On one system I recently moved to an ssd root, I also turned on dedup 
and it seems to be doing just fine:

NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
r2  37G  14.7G  22.3G39%  1.31x  ONLINE  -

(the relatively high dedup ratio is because I have one live upgrade BE 
with nevada build 130, and a beadm BE with opensolaris build 130, which 
is mostly the same)

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-01 Thread Bill Sommerfeld


On 03/01/10 13:50, Miles Nordin wrote:

"dd" == David Dyer-Bennet  writes:


 dd>  Okay, but the argument goes the other way just as well -- when
 dd>  I run "chmod 6400 foobar", I want the permissions set that
 dd>  specific way, and I don't want some magic background feature
 dd>  blocking me.

This will be true either way.  Even if chmod isn't ignored, it will
reach into the nest of ACL's and mangle them in some non-obvious way
with unpredictable consequences, and the mangling will be implemented
by a magical background feature.


actually, you can be surprised even if there are no acls in use -- if, 
unbeknownst to you, some user has been granted file_dac_read or 
file_dac_write privilege, they will be able to bypass the file modes for 
read and/or for write.


Likewise if that user has been delegated zfs "send" rights on the 
filesystem the file is in, they'll be able to read every bit of the file.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Bill Sommerfeld


On 03/02/10 08:13, Fredrich Maney wrote:

Why not do the same sort of thing and use that extra bit to flag a
file, or directory, as being an ACL only file and will negate the rest
of the mask? That accomplishes what Paul is looking for, without
breaking the existing model for those that need/wish to continue to
use it?


While we're designing on the fly: Another possibility would be to use an 
additional umask bit or two to influence the mode-bit - acl interaction.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?

2011-12-05 Thread Bill Sommerfeld

On 12/05/11 10:47, Lachlan Mulcahy wrote:
> zfs`lzjb_decompress10   0.0%
> unix`page_nextn31   0.0%
> genunix`fsflush_do_pages   37   0.0%
> zfs`dbuf_free_range   183   0.1%
> genunix`list_next5822   3.7%
> unix`mach_cpu_idle 150261  96.1%

your best bet in a situation like this -- where there's a lot of cpu time
spent in a generic routine -- is to use an alternate profiling method that
shows complete stack traces rather than just the top function on the stack.

often the names of functions two or three or four deep in the stack will point
at what's really responsible.

something as simple as:

dtrace -n 'profile-1001 { @[stack()] = count(); }'

(let it run for a bit then interrupt it).

should show who's calling list_next() so much.

    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-28 Thread Bill Sommerfeld


On 05/28/12 17:13, Daniel Carosone wrote:

There are two problems using ZFS on drives with 4k sectors:

  1) if the drive lies and presents 512-byte sectors, and you don't
 manually force ashift=12, then the emulation can be slow (and
 possibly error prone). There is essentially an internal RMW cycle
 when a 4k sector is partially updated.  We use ZFS to get away
 from the perils of RMW :)

  2) with ashift=12, whther forced manually or automatically because
 the disks present 4k sectors, ZFS is less space-efficient for
 metadata and keeps fewer historical uberblocks.


two, more specific, problems I've run into recently:

 1) if you move a disk with an ashift=9 pool on it from a 
controller/enclosure/.. combo where it claims to have 512 byte sectors 
to a path where it is detected as having 4k sectors (even if it can cope 
with 512-byte aligned I/O), the pool will fail to import and appear to 
be gravely corrupted; the error message you get will make no mention of 
the sector size change.  Move the disk back to the original location and 
it imports cleanly.


 2) if you have a pool with ashift=9 and a disk dies, and the intended 
replacement is detected as having 4k sectors, it will not be possible to 
attach the disk as a replacement drive..


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "shareiscsi" and COMSTAR

2012-06-26 Thread Bill Pijewski

On Tue, Jun 26, 2012 at 1:47 PM, Jim Klimov  wrote:
> 1) Is COMSTAR still not-integrated with shareiscsi ZFS attributes?
>   Or can the pool use the attribute, and the correct (new COMSTAR)
>   iSCSI target daemon will fire up?

I can't speak for Solaris 11, but for illumos, you need to use the
stmfadm, itadm, and related tools, not the shareiscsi ZFS property.

> 2) What would be the best way to migrate iSCSI server configuration
>   (LUs, views, allowed client lists, etc.) - is it sufficient to
>   just export the SMF config of "stmf" service, or do I also need
>   some other services and/or files (/etc/iscsi, something else?)

If you're migrating from the old iSCSI target daemon to COMSTAR, I
would recommend doing the migration manually and rebuilding the iSCSI
configuration.  While this blog entry is written to users of the 7000
series storage appliance, it may be useful as you're thinking about
how to proceed:

https://blogs.oracle.com/wdp/entry/comstar_iscsi

- Bill

-- 
Bill Pijewski, Joyent       http://dtrace.org/blogs/wdp/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bill Sommerfeld

On 07/11/12 02:10, Sašo Kiselkov wrote:
> Oh jeez, I can't remember how many times this flame war has been going
> on on this list. Here's the gist: SHA-256 (or any good hash) produces a
> near uniform random distribution of output. Thus, the chances of getting
> a random hash collision are around 2^-256 or around 10^-77.

I think you're correct that most users don't need to worry about this --
sha-256 dedup without verification is not going to cause trouble for them.

But your analysis is off.  You're citing the chance that two blocks picked at
random will have the same hash.  But that's not what dedup does; it compares
the hash of a new block to a possibly-large population of other hashes, and
that gets you into the realm of "birthday problem" or "birthday paradox".

See http://en.wikipedia.org/wiki/Birthday_problem for formulas.

So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
collision in really large collections of data - still not likely enough to
worry about.

Of course, that assumption goes out the window if you're concerned that an
adversary may develop practical ways to find collisions in sha-256 within the
deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
http://en.wikipedia.org/wiki/SHA-1#SHA-1).

on a somewhat less serious note, perhaps zfs dedup should contain "chinese
lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation)
which asks the sysadmin to report a detected sha-256 collision to
eprint.iacr.org or the like...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

2012-07-20 Thread Bill Sommerfeld


On 07/19/12 18:24, Traffanstead, Mike wrote:

iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.

So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.


For random writes to work without forcing a lot of read i/o and 
read-modify-write sequences, set the recordsize on the filesystem used 
for the test to match the iozone recordsize.  For instance:


zfs set recordsize=64k $fsname

and ensure that the files used for the test are re-created after you 
make this setting change ("recordsize" is sticky at file creation time).





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zvol vs zfs send/zfs receive

2012-09-14 Thread Bill Sommerfeld

On 09/14/12 22:39, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Dave Pooser

Unfortunately I did not realize that zvols require disk space sufficient
to duplicate the zvol, and my zpool wasn't big enough. After a false start
(zpool add is dangerous when low on sleep) I added a 250GB mirror and a
pair of 3GB mirrors to miniraid and was able to successfully snapshot the
zvol: miniraid/RichRAID@exportable


This doesn't make any sense to me.  The snapshot should not take up any 
(significant) space on the sending side.  It's only on the receiving side, 
trying to receive a snapshot, that you require space.  Because it won't clobber 
the existing zvol on the receiving side until the complete new zvol was 
received to clobber it with.

But simply creating the snapshot on the sending side should be no problem.


By default, zvols have reservations equal to their size (so that writes 
don't fail due to the pool being out of space).


Creating a snapshot in the presence of a reservation requires reserving 
enough space to overwrite every block on the device.


You can remove or shrink the reservation if you know that the entire 
device won't be overwritten.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Bill Sommerfeld


On 08/21/10 10:14, Ross Walker wrote:

I am trying to figure out the best way to provide both performance and 
resiliency given the Equallogic provides the redundancy.


(I have no specific experience with Equallogic; the following is just 
generic advice)


Every bit stored in zfs is checksummed at the block level; zfs will not 
use data or metadata if the checksum doesn't match.


zfs relies on redundancy (storing multiple copies) to provide 
resilience; if it can't independently read the multiple copies and pick 
the one it likes, it can't recover from bitrot or failure of the 
underlying storage.


if you want resilience, zfs must be responsible for redundancy.

You imply having multiple storage servers.  The simplest thing to do is 
export one large LUN from each of two different storage servers, and 
have ZFS mirror them.


While this reduces the available space, depending on your workload, you 
can make some of it back by enabling compression.


And, given sufficiently recent software, and sufficient memory and/or 
ssd for l2arc, you can enable dedup.


Of course, the effectiveness of both dedup and compression depends on 
your workload.



Would I be better off forgoing resiliency for simplicity, putting all my faith 
into the Equallogic to handle data resiliency?


IMHO, no; the resulting system will be significantly more brittle.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Bill Sommerfeld


On 09/09/10 20:08, Edward Ned Harvey wrote:

Scores so far:
2 No
1 Yes


No.  resilver does not re-layout your data or change whats in the block 
pointers on disk.  if it was fragmented before, it will be fragmented after.



C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


"maybe".  If there is sufficient contiguous freespace in the destination 
pool, files may be less fragmented.


But if you do incremental sends of multiple snapshots, you may well 
replicate some or all the fragmentation on the origin (because snapshots 
only copy the blocks that change, and receiving an incremental send does 
the same).


And if the destination pool is short on space you may end up more 
fragmented than the source.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] How do you use >1 partition on x86?

2010-10-25 Thread Bill Werner

So when I built my new workstation last year, I partitioned the one and only 
disk in half, 50% for Windows, 50% for 2009.06.   Now, I'm not using Windows, 
so I'd like to use the other half for another ZFS pool, but I can't figure out 
how to access it.

I have used fdisk to create a second Solaris2 partition, did a re-con reboot, 
but format still only shows the 1 available partition.  How do I used the 
second partition?

selecting c7t0d0
 Total disk size is 30401 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 Other OS  0 4   5  0
  2 IFS: NTFS 5  19171913  6
  3   ActiveSolaris2   1917  1497113055 43
  4 Solaris2   14971  3017015200 50

 format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c7t0d0 
  /p...@0,0/pci1028,2...@1f,2/d...@0,0


Thanks for any idea.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Bill Sommerfeld


On 11/17/10 12:04, Miles Nordin wrote:

black-box crypto is snake oil at any level, IMNSHO.


Absolutely.


Congrats again on finishing your project, but every other disk
encryption framework I've seen taken remotely seriously has a detailed
paper describing the algorithm, not just a list of features and a
configuration guide.  It should be a requirement for anything treated
as more than a toy.  I might have missed yours, or maybe it's coming
soon.


In particular, the mechanism by which dedup-friendly block IV's are 
chosen based on the plaintext needs public scrutiny.  Knowing Darren, 
it's very likely that he got it right, but in crypto, all the details 
matter and if a spec detailed enough to allow for interoperability isn't 
available, it's safest to assume that some of the details are wrong.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Looking for 3.5" SSD for ZIL

2010-12-23 Thread Bill Werner

> > got it attached to a UPS with very conservative
> shut-down timing. Or
> > are there other host failures aside from power a
> ZIL would be
> > vulnerable too (system hard-locks?)?
> 
> Correct, a system hard-lock is another example...

How about comparing a non-battery backed ZIL to running a ZFS dataset with 
sync=disabled.  Which is more risky?

This has been an educational thread for me...I was not aware that SSD drives 
had some DRAM in front of the SSD part?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] BOOT, ZIL, L2ARC one one SSD?

2010-12-23 Thread Bill Werner

60GB SSD drives using the SF 1222 controller can be had now for around $100.

 I know ZFS likes to use the entire disk to do it's magic, but under X86, is 
the entire disk the entire disk, or is it one physical X86 partition?

In the past I have created 2 partitions with FDISK, but format will only show 
one of them?  Did I do something wrong, or is that the way it works?

So, maybe what I want to do won't workBut this is my thought

on a single 60GB SSD drive, use FDISK to create 3 physical partitions, a 20GB 
for boot, a 30GB for L2ARC and a 10GB for ZIL?   Or is 3 physical Solaris 
partitions on a disk not considered the entire disk as far as ZFS is concerned?

Can a ZIL and/or L2ARC be shared amongst 1+ ZPOOLs, or must each pool have it's 
own?  If each pool must have it's own, can a disk be partitioned so a single 
fast SSD can be shared amongst 1+ pools?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] BOOT, ZIL, L2ARC one one SSD?

2010-12-25 Thread Bill Werner

Understood Edward, and if this was a production data center, I wouldn't be 
doing it this way.  This is for my home lab, so spending hundreds of dollars on 
SSD devices isn't practical.

Can several datasets share a single ZIL and a single L2ARC, or much must each 
dataset have their own?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS advice for laptop

2011-01-04 Thread Bill Sommerfeld


On 01/04/11 18:40, Bob Friesenhahn wrote:

Zfs will disable write caching if it sees that a partition is being used


This is backwards.

ZFS will enable write caching on a disk if a single pool believes it 
owns the whole disk.


Otherwise, it will do nothing to caching.  You can enable it yourself 
with the format command and ZFS won't disable it.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 11:49, Yi Zhang wrote:

The reason why I
tried that is to get the side effect of no buffering, which is my
ultimate goal.


ultimate = "final".  you must have a goal beyond the elimination of 
buffering in the filesystem.


if the writes are made durable by zfs when you need them to be durable, 
why does it matter that it may buffer data while it is doing so?


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 12:49, Yi Zhang wrote:

If buffering is on, the running time of my app doesn't reflect the
actual I/O cost. My goal is to accurately measure the time of I/O.
With buffering on, ZFS would batch up a bunch of writes and change
both the original I/O activity and the time.


if batching main pool writes improves the overall throughput of the 
system over a more naive i/o scheduling model, don't you want your users 
to see the improvement in performance from that batching?


why not set up a steady-state sustained workload that will run for 
hours, and measure how long it takes the system to commit each 1000 or 
1 transactions in the middle of the steady state workload?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/recv initial data load

2011-02-16 Thread Bill Sommerfeld


On 02/16/11 07:38, white...@gmail.com wrote:

 Is it possible to use a portable drive to copy the
initial zfs filesystem(s) to the remote location and then make the
subsequent incrementals over the network?


Yes.

> If so, what would I need to do

to make sure it is an exact copy? Thank you,


Rough outline:

plug removable storage into source or a system near the source.
zpool create backup pool on removable storage
use an appropriate combination of zfs send & zfs receive to copy bits.
zpool export backup pool.
unplug removable storage
move it
plug it in to remote server
zpool import backup pool
use zfs send -i to verify that incrementals work

(I did something like the above when setting up my home backup because I 
initially dinked around with the backup pool hooked up to a laptop and 
then moved it to a desktop system).


optional: use zpool attach to mirror the removable storage to something 
faster/better/..., then after the mirror completes zpool detach to free 
up the removable storage.


    - Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] time-sliderd doesn't remove snapshots

2011-02-18 Thread Bill Shannon


In the last few days my performance has gone to hell.  I'm running:

# uname -a
SunOS nissan 5.11 snv_150 i86pc i386 i86pc

(I'll upgrade as soon as the desktop hang bug is fixed.)

The performance problems seem to be due to excessive I/O on the main
disk/pool.

The only things I've changed recently is that I've created and destroyed
a snapshot, and I used "zpool upgrade".

Here's what I'm seeing:

# zpool iostat rpool 5
 capacity operationsbandwidth
poolalloc   free   read  write   read  write
--  -  -  -  -  -  -
rpool   13.3G   807M  7 85  15.9K   548K
rpool   13.3G   807M  3 89  1.60K   723K
rpool   13.3G   810M  5 91  5.19K   741K
rpool   13.3G   810M  3 94  2.59K   756K

Using iofileb.d from the dtrace toolkit shows:

# iofileb.d
Tracing... Hit Ctrl-C to end.
^C
 PID CMD  KB FILE
   0 sched 6 
   5 zpool-rpool7770 

zpool status doesn't show any problems:

# zpool status rpool
pool: rpool
   state: ONLINE
   scan: none requested
config:

  NAMESTATE READ WRITE CKSUM
  rpool   ONLINE   0 0 0
c3d0s0ONLINE   0 0 0


Perhaps related to this or perhaps not, I discovered recently that time-sliderd
was doing just a ton of "close" requests.  I disabled time-sliderd while trying
to solve my performance problem.

I was also getting these error messages in the time-sliderd log file:

Warning: Cleanup failed to destroy: 
rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01
Details:
['/usr/bin/pfexec', '/usr/sbin/zfs', 'destroy', '-d', 
'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01'] failed with exit code 1
cannot destroy 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01': unsupported 
version


That was the reason I did the zpool upgrade.

I discovered that I had a *ton* of snapshots from time-slider that
hadn't been destroyed, over 6500 of them, presumably all because of this
version problem?

I manually removed all the snapshots and my performance returned to normal.

I don't quite understand what the "-d" option to "zfs destroy" does.
Why does time-sliderd use it, and why does it prevent these snapshots
from being destroyed?

Shouldn't time-sliderd detect that it can't destroy any of the snapshots
it's created and stop creating snapshots?

And since I don't quite understand why time-sliderd was failing to begin with,
I'm nervous about re-enabling it.  Do I need to do a "zpool upgrade" on all
my pools to make it work?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] time-sliderd doesn't remove snapshots

2011-02-18 Thread Bill Shannon


One of my old pools was version 10, another was version 13.
I guess that explains the problem.

Seems like time-sliderd should refuse to run on pools that
aren't of a sufficient version.


Cindy Swearingen wrote on 02/18/11 12:07 PM:

Hi Bill,

I think the root cause of this problem is that time slider implemented
the zfs destroy -d feature but this feature is only available in later
pool versions. This means that the routine removal of time slider
generated snapshots fails on older pool versions.

The zfs destroy -d feature (snapshot user holds) was introduced in pool
version 18.

I think this bug describes some or all of the problem:

https://defect.opensolaris.org/bz/show_bug.cgi?id=16361

Thanks,

Cindy



On 02/18/11 12:34, Bill Shannon wrote:

In the last few days my performance has gone to hell.  I'm running:

# uname -a
SunOS nissan 5.11 snv_150 i86pc i386 i86pc

(I'll upgrade as soon as the desktop hang bug is fixed.)

The performance problems seem to be due to excessive I/O on the main
disk/pool.

The only things I've changed recently is that I've created and destroyed
a snapshot, and I used "zpool upgrade".

Here's what I'm seeing:

# zpool iostat rpool 5
  capacity operationsbandwidth
poolalloc   free   read  write   read  write
--  -  -  -  -  -  -
rpool   13.3G   807M  7 85  15.9K   548K
rpool   13.3G   807M  3 89  1.60K   723K
rpool   13.3G   810M  5 91  5.19K   741K
rpool   13.3G   810M  3 94  2.59K   756K

Using iofileb.d from the dtrace toolkit shows:

# iofileb.d
Tracing... Hit Ctrl-C to end.
^C
  PID CMD  KB FILE
0 sched 6
5 zpool-rpool7770

zpool status doesn't show any problems:

# zpool status rpool
 pool: rpool
state: ONLINE
scan: none requested
config:

   NAMESTATE READ WRITE CKSUM
   rpool   ONLINE   0 0 0
 c3d0s0ONLINE   0 0 0


Perhaps related to this or perhaps not, I discovered recently that
time-sliderd
was doing just a ton of "close" requests.  I disabled time-sliderd while
trying
to solve my performance problem.

I was also getting these error messages in the time-sliderd log file:

Warning: Cleanup failed to destroy:
rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01
Details:
['/usr/bin/pfexec', '/usr/sbin/zfs', 'destroy', '-d',
'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01'] failed with exit code 1
cannot destroy 'rpool/ROOT@zfs-auto-snap_hourly-2010-11-10-15h01':
unsupported version

That was the reason I did the zpool upgrade.

I discovered that I had a *ton* of snapshots from time-slider that
hadn't been destroyed, over 6500 of them, presumably all because of this
version problem?

I manually removed all the snapshots and my performance returned to normal.

I don't quite understand what the "-d" option to "zfs destroy" does.
Why does time-sliderd use it, and why does it prevent these snapshots
from being destroyed?

Shouldn't time-sliderd detect that it can't destroy any of the snapshots
it's created and stop creating snapshots?

And since I don't quite understand why time-sliderd was failing to begin
with,
I'm nervous about re-enabling it.  Do I need to do a "zpool upgrade" on all
my pools to make it work?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Format returning bogus controller info

2011-02-26 Thread Bill Sommerfeld


On 02/26/11 17:21, Dave Pooser wrote:

  While trying to add drives one at a time so I can identify them for later
use, I noticed two interesting things: the controller information is
unlike any I've seen before, and out of nine disks added after the boot
drive all nine are attached to c12 -- and no single controller has more
than eight ports.


on your system, c12 is the mpxio virtual controller; any disk which is 
potentially multipath-able (and that includes the SAS drives) will 
appear as a child of the virtual controller (rather than appear as the 
child of two or more different physical controllers).


see stmsboot(1m) for information on how to turn that off if you don't 
need multipathing and don't like the longer device names.


    - Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Old posts to zfs-discuss

2011-05-10 Thread Bill Rushmore

Sorry for the old posts that some of you are seeing to zfs-discuss.  The 
link between Jive and mailman was broken so I fixed that.  However, once 
this was fixed Jive started sending every single post from the 
zfs-discuss board on Jive to the mail list.  Quite a few posts were sent 
before I realized what was happening and was able to kill the process.


Bill Rushmore
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] not sure how to make filesystems

2011-05-31 Thread BIll Palin

I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a 
couple of them.

I want to migrate /, /var, /opt, /export/home and also want swap and /tmp.  I 
don't care about any of the others.

The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd 
disk is c0t1d0.

I've been told that /tmp is supposed to be part of swap.  So far I have:

lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m 
/export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m 
-:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool

And then set quotas for them.  Is this right?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Bill Sommerfeld

On 05/31/11 09:01, Anonymous wrote:
> Hi. I have a development system on Intel commodity hardware with a 500G ZFS
> root mirror. I have another 500G drive same as the other two. Is there any
> way to use this disk to good advantage in this box? I don't think I need any
> more redundancy, I would like to increase performance if possible. I have
> only one SATA port left so I can only use 3 drives total unless I buy a PCI
> card. Would you please advise me. Many thanks.

I'd use the extra SATA port for an ssd, and use that ssd for some
combination of boot/root, ZIL, and L2ARC.

I have a couple systems in this configuration now and have been quite
happy with the config.  While slicing an ssd and using one slice for
root, one slice for zil, and one slice for l2arc isn't optimal from a
performance standpoint and won't scale up to a larger configuration, it
is a noticeable improvement from a 2-disk mirror.

I used an 80G intel X25-M, with 1G for zil, with the rest split roughly
50:50 between root pool and l2arc for the data pool.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Available space confusion

2011-06-06 Thread Bill Sommerfeld


On 06/06/11 08:07, Cyril Plisko wrote:

zpool reports space usage on disks, without taking into account RAIDZ overhead.
zfs reports net capacity available, after RAIDZ overhead accounted for.


Yup.  Going back to the original numbers:

nebol@filez:/$ zfs list tank2
NAMEUSED  AVAIL  REFER  MOUNTPOINT
tank2  3.12T   902G  32.9K  /tank2

Given that it's a 4-disk raidz1, you have (roughly) one block of parity 
for every three blocks of data.


3.12T / 3 = 1.04T

so 3.12T + 1.04T = 4.16T, which is close to the 4.18T showed by zpool list:

NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
tank2  5.44T  4.18T  1.26T76%  ONLINE



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Wired write performance problem

2011-06-08 Thread Bill Sommerfeld


On 06/08/11 01:05, Tomas Ögren wrote:

And if pool usage is>90%, then there's another problem (change of
finding free space algorithm).


Another (less satisfying) workaround is to increase the amount of free 
space in the pool, either by reducing usage or adding more storage. 
Observed behavior is that allocation is fast until usage crosses a 
threshhold, then performance hits a wall.


I have a small sample size (maybe 2-3 samples), but the threshhold point 
varies from pool to pool but tends to be consistent for a given pool.  I 
suspect some artifact of layout/fragmentation is at play.  I've seen 
things hit the wall at as low as 70% on one pool.


The original poster's pool is about 78% full.  If possible, try freeing 
stuff until usage goes back under 75% or 70% and see if your performance 
returns.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk replacement need to scan full pool ?

2011-06-14 Thread Bill Sommerfeld

On 06/14/11 04:15, Rasmus Fauske wrote:
> I want to replace some slow consumer drives with new edc re4 ones but
> when I do a replace it needs to scan the full pool and not only that
> disk set (or just the old drive)
> 
> Is this normal ? (the speed is always slow in the start so thats not
> what I am wondering about, but that it needs to scan all of my 18.7T to
> replace one drive)

This is normal.  The resilver is not reading all data blocks; it's
reading all of the metadata blocks which contain one or more block
pointers, which is the only way to find all the allocated data (and in
the case of raidz, know precisely how it's spread and encoded across the
members of the vdev).  And it's reading all the data blocks needed to
reconstruct the disk to be replaced.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow

2011-06-16 Thread Bill Sommerfeld

On 06/16/11 15:36, Sven C. Merckens wrote:
> But is the L2ARC also important while writing to the device? Because
> the storeges are used most of the time only for writing data on it,
> the Read-Cache (as I thought) isn´t a performance-factor... Please
> correct me, if my thoughts are wrong.

if you're using dedup, you need a large read cache even if you're only
doing application-layer writes, because you need fast random read access
to the dedup tables while you write.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Bill Sommerfeld

On 06/27/11 15:24, David Magda wrote:
> Given the amount of transistors that are available nowadays I think
> it'd be simpler to just create a series of SIMD instructions right
> in/on general CPUs, and skip the whole co-processor angle.

see: http://en.wikipedia.org/wiki/AES_instruction_set

Present in many current Intel CPUs; also expected to be present in AMD's
"Bulldozer" based CPUs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "zfs diff" performance disappointing

2011-09-26 Thread Bill Sommerfeld

On 09/26/11 12:31, Nico Williams wrote:
> On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea  wrote:
>> Should I disable "atime" to improve "zfs diff" performance? (most data
>> doesn't change, but "atime" of most files would change).
> 
> atime has nothing to do with it.

based on my experiences with time-based snapshots and atime on a server which
had cron-driven file tree walks running every night, I can easily believe
atime has a lot to do with it - the atime updates associated with a tree walk
will mean that that much of a filesystem's metadata will diverge between the
writeable filesystem and its last snapshot.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-17 Thread Bill Sommerfeld

I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14
2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 2
  mirror  ONLINE   0 0 2
c4t0d0s0  ONLINE   0 0 4
c4t1d0s0  ONLINE   0 0 4

I ran it again, and it's now reporting the same errors, but still says
"applications are unaffected":

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 4
  mirror  ONLINE   0 0 4
c4t0d0s0  ONLINE   0 0 8
c4t1d0s0  ONLINE   0 0 8

errors: No known data errors


I wonder if I'm running into some combination of:

6725341 Running 'zpool scrub' repeatedly on a pool show an ever
increasing error count

and maybe:

6437568 ditto block repair is incorrectly propagated to root vdev

Any way to dig further to determine what's going on?

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread Bill Sommerfeld

On Fri, 2008-07-18 at 10:28 -0700, JÃ¼rgen Keil wrote:
> > I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
> > errors:
> 
> Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
> on a system that is running post snv_94 bits:  It also found checksum errors
> 
> # zpool status files
>   pool: files
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>   attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>   using 'zpool clear' or replace the device with 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
> config:
> 
>   NAME  STATE READ WRITE CKSUM
>   files DEGRADED 0 018
> mirror  DEGRADED 0 018
>   c8t0d0s6  DEGRADED 0 036  too many errors
>   c9t0d0s6  DEGRADED 0 036  too many errors
> 
> errors: No known data errors

out of curiosity, is this a root pool?  

A second system of mine with a mirrored root pool (and an additional
large multi-raidz pool) shows the same symptoms on the mirrored root
pool only.

once is accident.  twice is coincidence.  three times is enemy
action :-)

I'll file a bug as soon as I can (I'm travelling at the moment with
spotty connectivity), citing my and your reports.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can I trust ZFS?

2008-08-03 Thread Bill Sommerfeld

On Sun, 2008-08-03 at 11:42 -0500, Bob Friesenhahn wrote:
> Zfs makes human error really easy.  For example
> 
>$ zpool destroy mypool

Note that "zpool destroy" can be undone by "zpool import -D" (if you get
to it before the disks are overwritten).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Checksum error: which of my files have failed scrubbing?

2008-08-05 Thread Bill Sommerfeld

On Tue, 2008-08-05 at 12:11 -0700, soren wrote:
> > soren wrote:
> > > ZFS has detected that my root filesystem has a
> > small number of errors.  Is there a way to tell which
> > specific files have been corrupted?
> >
> > After a scrub a zpool status -v should give you a
> > list of files with 
> > unrecoverable errors.
> 
> Hmm, I just tried that.  Perhaps "No known data errors" means that my files 
> are OK.  In that case I wonder what the checksum failure was from.

If this is build 94 and you have one or more unmounted filesystems, 
(such as alternate boot environments), these errors are false positives.
There is no actual error; the scrubber misinterpreted the end of an
intent log block chain as a checksum error.

the bug id is:

6727872 zpool scrub: reports checksum errors for pool with zfs and
unplayed ZIL

This bug is fixed in build 95.  One workaround is to mount the
filesystems and then unmount them to apply the intent log changes.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Block unification in ZFS

2008-08-05 Thread Bill Sommerfeld

See the long thread titled "ZFS deduplication", last active
approximately 2 weeks ago.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

2008-08-07 Thread Bill Sommerfeld

On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote:
> How would you describe the difference between the data recovery
> utility and ZFS's normal data recovery process?

I'm not Anton but I think I see what he's getting at.

Assume you have disks which once contained a pool but all of the
uberblocks have been clobbered.  So you don't know where the root of the
block tree is, but all the actual data is there, intact, on the disks.  

Given the checksums you could rebuild one or more plausible structure of
the pool from the bottom up.

I'd think that you could construct an offline zpool data recovery tool
where you'd start with N disk images and a large amount of extra working
space, compute checksums of all possible data blocks on the images, scan
the disk images looking for things that might be valid block pointers,
and attempt to stitch together subtrees of the filesystem and recover as
much as you can even if many upper nodes in the block tree have had
holes shot in them by a miscreant device.

    - Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best layout for 15 disks?

2008-08-22 Thread Bill Sommerfeld

On Thu, 2008-08-21 at 21:15 -0700, mike wrote:
> I've seen 5-6 disk zpools are the most recommended setup.

This is incorrect.

Much larger zpools built out of striped redundant vdevs (mirror, raidz1,
raidz2) are recommended and also work well.

raidz1 or raidz2 vdevs of more than a single-digit number of drives are
not recommended.

so, for instance, the following is an appropriate use of 12 drives in
two raidz2 sets of 6 disks, with 8 disks worth of raw space available:

zpool create mypool raidz2 disk0 disk1 disk2 disk3 disk4 disk5
zpool add mypool raidz2 disk6 disk7 disk8 disk9 disk10 disk11

> In traditional RAID terms, I would like to do RAID5 + hot spare (13
> disks usable) out of the 15 disks (like raidz2 I suppose). What would
> make the most sense to setup 15 disks with ~ 13 disks of usable space?

Enable compression, and set up multiple raidz2 groups.  Depending on
what you're storing, you may get back more than you lose to parity.

>  This is for a home fileserver, I do not need HA/hotplugging/etc. so I
> can tolerate a failure and replace it with plenty of time. It's not
> mission critical.

That's a lot of spindles for a home fileserver.   I'd be inclined to go
with a smaller number of larger disks in mirror pairs, allowing me to
buy larger disks in pairs as they come on the market to increase
capacity.

> Same question, but 10 disks, and I'd sacrifice one for parity then.
> Not two. so ~9 disks usable roughly (like raidz)

zpool create mypool raidz1 disk0 disk1 disk2 disk3 disk4
zpool add mypool raidz1 disk5 disk6 disk7 disk8 disk9

8 disks raw capacity, can survive the loss of any one disk or the loss
of two disks in different raidz groups.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld

On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
> A better option would be to not use this to perform FMA diagnosis, but
> instead work into the mirror child selection code.  This has already
> been alluded to before, but it would be cool to keep track of latency
> over time, and use this to both a) prefer one drive over another when
> selecting the child and b) proactively timeout/ignore results from one
> child and select the other if it's taking longer than some historical
> standard deviation.  This keeps away from diagnosing drives as faulty,
> but does allow ZFS to make better choices and maintain response times.
> It shouldn't be hard to keep track of the average and/or standard
> deviation and use it for selection; proactively timing out the slow I/Os
> is much trickier.

tcp has to solve essentially the same problem: decide when a response is
"overdue" based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about "reasonable"
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld

On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote:
> 2. The algorithm *must* be computationally efficient.
>We are looking down the tunnel at I/O systems that can
>deliver on the order of 5 Million iops.  We really won't
>have many (any?) spare cycles to play with.

If you pick the constants carefully (powers of two) you can do the TCP
RTT + variance estimation using only a handful of shifts, adds, and
subtracts.

> In both of these cases, the solutions imply multi-minute timeouts are
> required to maintain a stable system.  

Again, there are different uses for timeouts:
 1) how long should we wait on an ordinary request before deciding to
try "plan B" and go elsewhere (a la B_FAILFAST)
 2) how long should we wait (while trying all alternatives) before
declaring an overall failure and giving up.

The RTT estimation approach is really only suitable for the former,
where you have some alternatives available (retransmission in the case
of TCP; trying another disk in the case of mirrors, etc.,).  

when you've tried all the alternatives and nobody's responding, there's
no substitute for just retrying for a long time.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld

On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote:

> It's sort of like network QoS, but not quite, because: 
> 
>   (a) you don't know exactly how big the ``pipe'' is, only
>   approximately, 

In an ip network, end nodes generally know no more than the pipe size of
the first hop -- and in some cases (such as true CSMA networks like
classical ethernet or wireless) only have an upper bound on the pipe
size.  

beyond that, they can only estimate the characteristics of the rest of
the network by observing its behavior - all they get is end-to-end
latency, and *maybe* a 'congestion observed' mark set by an intermediate
system.

>   (c) all the fabrics are lossless, so while there are queues which
>   undesireably fill up during congestion, these queues never drop
>   ``packets'' but instead exert back-pressure all the way up to
>   the top of the stack.

hmm.  I don't think the back pressure makes it all the way up to zfs
(the top of the block storage stack) except as added latency.  

(on the other hand, if it did, zfs could schedule around it both for
reads and writes, avoiding pouring more work on already-congested
paths..)

> I'm surprised we survive as well as we do without disk QoS.  Are the
> storage vendors already doing it somehow?

I bet that (as with networking) in many/most cases overprovisioning the
hardware and running at lower average utilization is often cheaper in
practice than running close to the edge and spending a lot of expensive
expert time monitoring performance and tweaking QoS parameters.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 >

1 - 100 of 355 matches

Mail list logo