Re: [zfs-discuss] Re: ZFS RAID10

2006-08-11 Thread Roch


RM:
  > I do not understand - why in some cases with smaller block writing
  > block twice could be actually faster than doing it once every time?
  > I definitely am missing something here...

In addition to what Neil said, I want to add that

when an application O_DSYNC write cover only parts of a file
record you have the choice of issuing a log I/O that
contains only the newly written data or do a full record I/O 
(using the up-to-date cached record) along with a small log
I/O to match.

So if you do 8K writes to a file stored using 128K records,
you truly want each 8K writes to go to the log and then
every txg, take the state of a record and I/O that. You
certainly don't want to I/O 128K every 8K writes.

But then if you do a 100K write, it's not as clear a win.
Should I cough up the full 128K I/O now, hoping that the
record will not be modified further before the txg clock
hits ? That's part of what goes into zfs_immediate_write_sz.

And even for full record writes, there are some block
allocation issues that come into play and complicates things 
further.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: ZFS RAID10

2006-08-11 Thread Roch

Robert Milkowski writes:
 > Hello Neil,
 > 
 > Thursday, August 10, 2006, 7:02:58 PM, you wrote:
 > 
 > NP> Robert Milkowski wrote:
 > >> Hello Matthew,
 > >> 
 > >> Thursday, August 10, 2006, 6:55:41 PM, you wrote:
 > >> 
 > >> MA> On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote:
 > >> 
 > btw: wouldn't it be possible to write block only once (for synchronous
 > IO) and than just point to that block instead of copying it again?
 > >> 
 > >> 
 > >> MA> We actually do exactly that for larger (>32k) blocks.
 > >> 
 > >> Why such limit (32k)?
 > 
 > NP> By experimentation that was the cutoff where it was found to be
 > NP> more efficient. It was recently reduced from 64K with a more
 > NP> efficient dmu-sync() implementaion.
 > NP> Feel free to experiment with the dynamically changable tunable:
 > 
 > NP> ssize_t zfs_immediate_write_sz = 32768;
 > 
 > 
 > I've just checked using dtrace on one of production nfs servers that
 > 90% of the time arg5 in zfs_log_write() is exactly 32768 and the rest
 > is always smaller.

Those should not be O_DSYNC though. Are they ?

The I/O should be deferred to a subsequent COMMIT but then
I'm not sure how it's handled then.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Removing a device from a zfs pool

2006-08-11 Thread Louwtjie Burger
Hi there

Are there any consideration given to this feature...?

I would also agree that this will not only be a "testing" feature, but will 
find it's way into production.

It would probably work on the same princaple of swap -a and swap -d ;) Just a 
little bit more complex.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance using slices vs. entire disk?

2006-08-11 Thread Roch

Darren:

 > > With all of the talk about performance problems due to
 > > ZFS doing a sync to force the drives to commit to data
 > > being on disk, how much of a benefit is this - especially
 > > for NFS?

I would not call those things as problems, more like setting 
proper expectations.

My understanding is that enabling write cache helps by
providing I/O concurrency for drives that do not implement
other form of Command Queuing. In other cases, WCE should
not buy much if anything. I'd be interested in analysing any 
cases that shows otherwise...

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] user quotas vs filesystem quotas?

2006-08-11 Thread Jeff A. Earickson

Hi,

I'm looking at moving two UFS quota-ed filesystems to ZFS under
Solaris 10 release 6/06, and the quota issue is gnarly.

One filesystem is user home directories and I'm aiming towards the
"one zfs filesystem per user" model, attempting to use Casper
Dik's auto_home script for on-the-fly zfs filesystem creation.
I'm having problems there, but that is an automounter issue, not
ZFS.

The other filesystem is /var/mail on my mail server.  I've traditionally
run (big) user quotas in mailboxes just to keep some malicious
emailer from filling up /var/mail, maybe.   The notion of having
one zfs filesystem per mailbox seems unwieldy, just to run quotas
per user.

Are there any plans/schemes for per-user quotas within a ZFS filesystem,
akin to the UFS quotaon(1M) mechanism?  I take it that quotaon won't
work with a ZFS filesystem, right?  Suggestions please?  My notion 
right now is to drop quotas for /var/mail.


Jeff Earickson
Colby College
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Removing a device from a zfs pool

2006-08-11 Thread Matthew Ahrens
On Fri, Aug 11, 2006 at 02:47:19AM -0700, Louwtjie Burger wrote:
> Are there any consideration given to this feature...?

Yes, this is on our radar.  We have some ideas about how to implement
it, but it will probably be at least 6 months until it is ready.  We
have several higher-priority tasks to finish before then (eg. continuing
to improve performance, boot and install off zfs).

> It would probably work on the same princaple of swap -a and swap -d ;)
> Just a little bit more complex.

Just a bit ;-)

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang


On Aug 9, 2006, at 8:18 AM, Roch wrote:




So while I'm feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock.

Hi Anton, Optimistic a little yes.

The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?


When I repeated this with just 512K written in 1K chunks via dd,
I saw six 16K writes.  Those were the largest.  The others were
around 1K-4K.  No O_DSYNC.

  dd if=/dev/zero of=xyz bs=1k count=512

So some writes are being aggregated, but we're missing a lot.


Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
becauseof ditto blocks  they  go  to physically seperate
locations, by design.


We shouldn't have to wait for the data blocks to reach disk,
though.  We know where they're going in advance.  One of the
key advantages of the überblock scheme is that we can, in a
sense, speculatively write to disk.  We don't need the tight
ordering that UFS requires to avoid security exposures and
allow the file system to be repaired.  We can lay out all of
the data and metadata, write them all to disk, choose new
locations if the writes fail, etc. and not worry about any
ordering or state issues, because the on-disk image doesn't
change until we commit it.

You're right, the ditto block mechanism will mean that some
writes will be spread around (at least when using a
non-redundant pool like mine), but then we should have at
most three writes followed by the überblock update, assuming
three degrees of replication.


All of these though are normally done asynchronously to
applications, unless the disks are flooded.


Which is a good thing (I think they're asynchronous anyway,
unless the cache is full).


But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.


Hmmm.  I guess my point is that we shouldn't need to iterate
at all.  There are no dependencies between these writes; only
between the complete set of writes and the überblock update.

-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS LVM and EVMS

2006-08-11 Thread Humberto Ramirez
Thanks for replying (I thought nobody would bother.) 

So, If understand correctly, I won't give up ANYTHING available in 
EVMS. LVM , Linux Raid -by going to ZFS and Raid -Z  Right ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proposal: zfs create -o

2006-08-11 Thread Eric Schrock
Following up on earlier mail, here's a proposal for create-time
properties.  As usual, any feedback or suggestions is welcome.

For those curious about the implementation, this finds its way all the
way down to the create callback, so that we can pick out true
create-time properties (e.g. volblocksize, future crypto properties).
The remaining properties are handled by the generic creation code.

- Eric

A. INTRODUCTION

A complicated ZFS installation will typically create a number of
datasets, each with their own property settings.  Currently, this
requires several steps, one for creating the dataset, and one for each
property that must be configured:

# zfs create pool/fs
# zfs set compression=on pool/fs
# zfs set mountpoint=/export pool/fs
...

This has several drawbacks, the first of which is simply unnecessary
steps.  For these complicated setups, it would be simpler to create the
dataset and all its properties at the same time.  This has been
requested by the ZFS community, and resulted in the following RFE:

6367103 create-time properties

More importantly, it forces the user to instantiate (and often mount)
the dataset before assigning properties.  In the case of the
'mountpoint' property, it means that we create an inherited mountpoint,
only to be later changed when the property is modified.  This also makes
setting the 'canmount' property (PSARC 2006/XXX) more intuitive.

This RFE is also required for crypto support, as the encryption
algorithm must be known when the filesystem is created It also has the
benefit of cleaning up the implementation of other creation-time
properties (volsize and volblocksize) that were previously special
cases.

B. DESCRIPTION

This case adds a new option, 'zfs create -o', which allows for any ZFS
property to be set at creation time.  Multiple '-o' options can appear
in the same subcommand.  Specifying the same property multiple times in
the same command results in an error.  For example:

# zfs create -o compression=on -o mountpoint=/export pool/fs

The option '-o' was chosen over '-p' (for 'property') to reserve this
for a future RFE:

6290249 zfs {create,clone,rename} -p to create parents

The functionality of 'zfs create -b' has been superceded by this new
option, though it will be retained for backwards compatibility.  There
is no plan to formally obsolete or remove this options.  For example:

# zfs create -b 16k -V 10M pool/vol

is equivalent to

# zfs create -o volblocksize=16k -V 10M pool/vol

If '-o volblocksize' is specified in addition to '-b', the resulting
behavior is undefined.

C. MANPAGE CHANGES

TBD

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS LVM and EVMS

2006-08-11 Thread Eric Schrock
No, there are some features we haven't implemented, that may or may not
be available in other RAID solutions.  In particular:

- ZFS storage pool cannot be 'shrunk', i.e. removing an entire toplevel
  device (mirror, RAID group, etc).  Devices can be removed by attaching
  and detaching to existing mirrors, but you cannot shrink the overall
  size of the pool.

- ZFS RAID-Z stripes cannot be expanded.  ZFS storage pools are all
  dynamically striped across all device groups.  So you can add a new
  RAID-Z group ((5+1) -> 2x(5+1) for example), but you cannot expand
  an existing stripe ((5+1) -> (6+1)).

There are likely other features that are different and/or missing from
other solutions, so it's a little extreme to say you "won't give up
ANYTHING".  But in terms of large scale features, there's not much
besides the two above, and remember that you have a lot to gain ;-)

- Eric

On Fri, Aug 11, 2006 at 09:28:58AM -0700, Humberto Ramirez wrote:
> Thanks for replying (I thought nobody would bother.) 
> 
> So, If understand correctly, I won't give up ANYTHING available in 
> EVMS. LVM , Linux Raid -by going to ZFS and Raid -Z  Right ?
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Difficult to recursive-move ZFS filesystems to another server

2006-08-11 Thread Brad Plecs
Just wanted to point this out -- 

I have a large web tree that used to have UFS user quotas on it.  I converted 
to ZFS using 
the model that each user has their own ZFS filesystem quota instead.  I worked 
around some 
NFS/automounter issues, and it now seems to be working fine. 

Except now I have to move it to another server.   The problem is that there 
doesn't appear
to be any recursive dump/restore command that lets me do this easily.  'zfs 
send' and 'zfs receive' 
only appear to work within filesystem boundaries.  

What I want to do is move all of zfspool/www from server A to server B. 

Each user filesystem underneath zfspool/www: 

 zfspool/www/user-joe
 zfspool/www/user-john
 zfspool/www/user-mary 

...has a unique quota assigned to it. 

There doesn't appear to be a way to move zfspool/www and its decendants en 
masse to 
a new machine with those quotas intact.  I have to script the recreation of all 
of the 
descendant filesystems by hand. 

I can move the *data* with tar or rsync easily enough, but it seems silly that 
I have to recreate
all the descendant filesystems and their characteristics by hand. 

I know the "comprehensive dump" subject has been brought up before... I'd like 
to reiterate a suggestion that it'd be nice if the various commands (zfs 
send/receive, zfs snapshot) could optionally include a filesystem's 
descendants.  If zfs send could do this and included the filesystem quotas, it 
might solve this issue. 

Or maybe I'm missing something?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proposal expand raidz

2006-08-11 Thread homerun
Greetings

Have used zfs raidz for while and question rised is it possible to expand raidz 
with additional disks.
Got answer pool yes but raidz "group" no.

So very high level idea for you , maybe already know.
And i'm not detail level expert of zfs so here might be "trivial" things for 
you.

So could add operation enhanced so that it allow to add additional 
disks/devices to raidz.
some criterias : larger or equal with current pool devices
i assume tricky part is how to make that new device into use.
how about every write is pointed to all devices and while read is only 
"previous"
or brute force during add process write every file again using all devices.

i know in NetApp Filers this allready works , so i think this is just finding 
way how to make it work in zfs raidz 

Thanks for your time
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Jonathan Adams
On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang wrote:
> >Once  the data  blocks are  on disk we  have the information
> >necessary to update the  indirect  blocks iteratively up  to
> >the  ueberblock. Those  are the  smaller I/Os;  I guess that
> >becauseof ditto blocks  they  go  to physically seperate
> >locations, by design.
> 
> We shouldn't have to wait for the data blocks to reach disk,
> though.  We know where they're going in advance.  One of the
> key advantages of the ?berblock scheme is that we can, in a
> sense, speculatively write to disk.  We don't need the tight
> ordering that UFS requires to avoid security exposures and
> allow the file system to be repaired.  We can lay out all of
> the data and metadata, write them all to disk, choose new
> locations if the writes fail, etc. and not worry about any
> ordering or state issues, because the on-disk image doesn't
> change until we commit it.

> You're right, the ditto block mechanism will mean that some
> writes will be spread around (at least when using a
> non-redundant pool like mine), but then we should have at
> most three writes followed by the ?berblock update, assuming
> three degrees of replication.

The problem is that you don't know the actual *contents* of the parent block
until *all* of its children have been written to their final locations.
(This is because the block pointer's value depends on the final location)
The ditto blocks don't really effect this, since they can all be written
out in parallel.

So you end up with the current N phases; data, it's parents,
it's parents, ..., uberblock.

> >But  I follow  you in that,  It  may be remotely possible to
> >reduce the number of Iterations  in the process by  assuming
> >that the I/O will  all succeed, then  if some fails, fix  up
> >the consequence and when all  done, update the ueberblock. I
> >would not hold my breath quite yet for that.
> 
> Hmmm.  I guess my point is that we shouldn't need to iterate
> at all.  There are no dependencies between these writes; only
> between the complete set of writes and the ?berblock update.

Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

1. assign locations for all blocks, and update the space bitmaps
   as necessary.
2. update all of the non-Uberdata blocks with their actual
   contents (which requires calculating checksums on all of the
   child blocks)
3. write everything out in parallel.
3a. if any write fails, re-do 1+2 for that block, and 2 for all of its
parents, then start over at 3 with all of the changed blocks.

4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly seems
possible.

Cheers,
- jonathan

(this is only my understanding of how ZFS works;  I could be mistaken)


-- 
Jonathan Adams, Solaris Kernel Development
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang

On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:

The problem is that you don't know the actual *contents* of the  
parent block
until *all* of its children have been written to their final  
locations.
(This is because the block pointer's value depends on the final  
location)


But I know where the children are going before I actually write  
them.  There
is a dependency of the parent's contents on the *address* of its  
children, but
not on the actual write.  We can compute everything that we are going  
to write

before we start to write.

(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)

The ditto blocks don't really effect this, since they can all be  
written

out in parallel.


The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the
ditto blocks are deliberately spread across the disk, so we can't  
collect
them into a single write (for a non-redundant pool, or at least a one- 
disk

pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).


Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

1. assign locations for all blocks, and update the space bitmaps
   as necessary.
2. update all of the non-Uberdata blocks with their actual
   contents (which requires calculating checksums on all of the
   child blocks)
3. write everything out in parallel.
	3a. if any write fails, re-do 1+2 for that block, and 2 for all of  
its

parents, then start over at 3 with all of the changed blocks.

4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly  
seems

possible.


(3a could actually be simplified to simply "mark the bad blocks as
unallocatable, and go to 1", but it's more efficient as you describe.)

The eventual advantage, though, is that we get the performance of a  
single

write (plus, always, the überblock update).  In a heavily loaded system,
the current approach (lots of small writes) won't scale so well.   
(Actually

we'd probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under  
fairly

heavy loads.)

As I pointed out earlier, this would require getting scatter/gather  
support
through the storage subsystem, but the potential win should be quite  
large.

Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance for  
streaming I/O.
We use an "allocate forward" policy, allow very large allocation  
blocks, and
separate the metadata from data.  This allows us to write (or read)  
data in

fairly large I/O requests, without unnecessary disk head motion.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-11 Thread eric kustarz

Leon Koll wrote:


On 8/11/06, eric kustarz <[EMAIL PROTECTED]> wrote:


Leon Koll wrote:

> <...>
>
>> So having 4 pools isn't a recommended config - i would destroy 
those 4

>> pools and just create 1 RAID-0 pool:
>> #zpool create sfsrocks c4t00173801014Bd0 c4t00173801014Cd0
>> c4t001738010140001Cd0 c4t0017380101400012d0
>>
>> each of those devices is a 64GB lun, right?
>
>
> I did it - created one pool, 4*64GB size, and running the benchmark 
now.
> I'll update you on results, but one pool is definitely not what I 
need.
> My target is - SunCluster with HA ZFS where I need 2 or 4 pools per 
node.

>
Why do you need 2 or 4 pools per node?

If you're doing HA-ZFS (which is SunCluster 3.2 - only available in beta
right now), then you should divide your storage up to the number of



I know, I run the 3.2  now.


*active* pools.  So say you have 2 nodes and 4 luns (each lun being
64GB), and only need one active node - then you can create one pool of



To have one active node doesn't look smart to me. I want to distribute
load between 2 nodes, not to have 1 active and 1 standby.
The LUN size in this test is 64GB but in real configuration it will be 
6TB



all 4 luns, and attach the 4 luns to both nodes.

The way HA-ZFS basically works is that when the "active" node fails, it
does a 'zpool export', and the takeover node does a 'zpool import'.  So
both nodes are using the same storage, but they cannot use the same
storage at the same time, see:
http://www.opensolaris.org/jive/thread.jspa?messageID=49617



Yes, it works this way.



If however, you have 2 nodes, 4 luns, and wish both nodes to be active,
then you can divy up the storage into two pools.  So each node has one
active pool of 2 luns.  All 4 luns are doubly attached to both nodes,
and when one node fails, the takeover node then has 2 active pools.



I agree with you - I can have 2 active pools, not 4 in case of
dual-node cluster.



So how many nodes do you have? and how many do you wish to be "active"
at a time?



Currently - 2 nodes, both active. If I define 4 pools, I can easily
expand the cluster to the 4-nodes configuration, that may be the good
reason to have 4 pools.



Ok, that makes sense.



And what was your configuration for VxFS and SVM/UFS?



4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in
SC3.1 that doesn't support EFI label) with UFS or VxFS on top.



So you have 2 nodes, 2 file systems (of either UFS or VxFS) on each node?

I'm just trying to make sure its a fair comparison bewteen ZFS, UFS, and 
VxFS.




And now comes the questions - my short test showed that 1-pool config
doesn't behave better than 4-pools one - with the first the box was
hung, with the second - didn't.
Why do you think the 1-pool config is better?



I suggested the 1 pool config before i knew you were doing HA-ZFS :)  
Purposely dividing up your storage (by creating separate pools) in a 
non-clustered environment usually doesn't make sense (root being one 
notable exception).


eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Proposal expand raidz

2006-08-11 Thread Brad Plecs
Just a data point -- our netapp filer actually creates additional raid groups 
that are added to the greater pool when you "add disks", much as zfs does now.  
 They aren't simply used to expand
the one large raid group of the volume.I've been meaning to rebuild the 
whole thing to 
get use of the multiple parity disks back.  

Ours is a few years old and isn't running the latest software rev, so maybe 
they've overcome
this now, but thought I'd mention it.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Difficult to recursive-move ZFS filesystems to another server

2006-08-11 Thread Matthew Ahrens
On Fri, Aug 11, 2006 at 10:02:41AM -0700, Brad Plecs wrote:
> There doesn't appear to be a way to move zfspool/www and its
> decendants en masse to a new machine with those quotas intact.  I have
> to script the recreation of all of the descendant filesystems by hand. 

Yep, you need

6421959 want zfs send to preserve properties ('zfs send -p')
6421958 want recursive zfs send ('zfs send -r')

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Looking for motherboard/chipset experience, again

2006-08-11 Thread David Dyer-Bennet

What about the Asus M2N-SLI Deluxe motherboard?  It has 7 SATA ports,
supports ECC memory, socket AM2, generally looks very attractive for
my home storage server.  Except that it, and the nvidia nForce 570-SLI
it's built on, don't seem to be on the HCL.  I'm hoping that's just
"yet", not reported yet.  Anybody run Solaris on it?  Or at least on
any nForce 570-SLI board?  Would you risk buying it to find out
yourself?

I've heard rumors of ZFS in one of the more obscure Linuxes, perhaps
Ubuntu; I suppose that could be a backup plan if I try and Solaris
doesn't work.

I have the general feeling that Linux runs on anything I can buy
today, pretty much, since I've been using it for over a decade and am
somewhat plugged into the community.  I don't yet have the impression
that Solaris runs on most anything, possibly after tracking down a few
drivers.  Does it, really?  Should I be not worrying about this so
much?
--
David Dyer-Bennet, , 
RKBA: 
Pics: 
Dragaera/Steven Brust: 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Proposal expand raidz

2006-08-11 Thread Darren Dunham
> Just a data point -- our netapp filer actually creates additional raid
> groups that are added to the greater pool when you "add disks", much
> as zfs does now.  They aren't simply used to expand the one large raid
> group of the volume.  I've been meaning to rebuild the whole thing to
> get use of the multiple parity disks back.

That should be a setting, though.  Not a limitation.

Netapp will create a new raid group when all existing groups are at a
particular point.  That way the raid groups don't become too large.
Otherwise, it can expand the group.

Today, ZFS doesn't have that option.

> Ours is a few years old and isn't running the latest software rev, so
> maybe they've overcome this now, but thought I'd mention it.

This isn't a new feature.  Back in the "old days"(TM), there was only
one volume and one raid group on a head.  So adding disks to a volume
required the raid group be expanded onto the disks.


-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
 < This line left intentionally blank to confuse you. >
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Importing a degraded storage pool.

2006-08-11 Thread Joe Stannard
Hey sorry if this is really basic, but I just started evaluating Solaris 10. 
Hated it at first but I'm sure that was just Windows withdrawal. The more I 
play the more I like.

Just started with Solaris 10 for x86 and testing out ZFS for perhaps a home 
server.

I have 4 SATA drives installed in my box. 1 drive for the OS and 3 Drives to 
play with storage pools. Here's whats happening.

1. Created a quick mirror zpool create MediaTest mirror c1d1 c2d0.
Copied a 5 gigs of data onto the pool.

2. Powered off, pulled c1d1 for fun. Powered on. Pool degraded.
This was expected.

3. zpool replace MediaTest c1d1 c2d1. Array reslivered and healthy. Copied 
another 10 gigs of data onto the pool.

4. Power off, reconnect c1d1. disconnect c2d1. Power on. Pool Degraded This was 
expected also.

5. zpool replace MediaTest c2d1 c1d1. Error involid vdev specification
dev/dsk/c1d1s0 contains a zfs filesystem. This was somewhat expected.
I powered down, reconnected c2d1. Powered on again MediaTest pool is healthy.

What I want to do now is import c1d1 as a new pool. It still had the original 5 
gigs of data on it, and this should be no different that moving this drive to a 
new machine. However when I try zpool import, it lists no pools available for 
import! Is this right? What am I doing wrong, or why is it not recognizing this 
pool as available for import?

Thanks for you help!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Irma Garcia

Hi All,

Sun Fire V440
Solaris 10
Solaris Resource Manager

Customer wrote the following:

I have a v490 with 4 zones:

tsunami:/#->zoneadm list -iv
ID NAME STATUS PATH
0 global running /
4 fmstage running /fmstage
12 fmprod running /fmprod
15 fmtest running /fmtest

fmtest has a pool assigned to it with acess
to 2 cpus. When I run the psstat -Z in the
fmtest zone I see;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 192 169G 163G 100% 0:29:55 96% fmtest

on the global zone (tsunami) I see with the
psstat -Z ;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. How
come when I run the top command I see
different result for memory usage.
What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??




Thanks in Advance
Irma


-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-11 Thread Leon Koll

On 8/11/06, eric kustarz <[EMAIL PROTECTED]> wrote:

Leon Koll wrote:

> On 8/11/06, eric kustarz <[EMAIL PROTECTED]> wrote:
>
>> Leon Koll wrote:
>>
>> > <...>
>> >
>> >> So having 4 pools isn't a recommended config - i would destroy
>> those 4
>> >> pools and just create 1 RAID-0 pool:
>> >> #zpool create sfsrocks c4t00173801014Bd0 c4t00173801014Cd0
>> >> c4t001738010140001Cd0 c4t0017380101400012d0
>> >>
>> >> each of those devices is a 64GB lun, right?
>> >
>> >
>> > I did it - created one pool, 4*64GB size, and running the benchmark
>> now.
>> > I'll update you on results, but one pool is definitely not what I
>> need.
>> > My target is - SunCluster with HA ZFS where I need 2 or 4 pools per
>> node.
>> >
>> Why do you need 2 or 4 pools per node?
>>
>> If you're doing HA-ZFS (which is SunCluster 3.2 - only available in beta
>> right now), then you should divide your storage up to the number of
>
>
> I know, I run the 3.2  now.
>
>> *active* pools.  So say you have 2 nodes and 4 luns (each lun being
>> 64GB), and only need one active node - then you can create one pool of
>
>
> To have one active node doesn't look smart to me. I want to distribute
> load between 2 nodes, not to have 1 active and 1 standby.
> The LUN size in this test is 64GB but in real configuration it will be
> 6TB
>
>> all 4 luns, and attach the 4 luns to both nodes.
>>
>> The way HA-ZFS basically works is that when the "active" node fails, it
>> does a 'zpool export', and the takeover node does a 'zpool import'.  So
>> both nodes are using the same storage, but they cannot use the same
>> storage at the same time, see:
>> http://www.opensolaris.org/jive/thread.jspa?messageID=49617
>
>
> Yes, it works this way.
>
>>
>> If however, you have 2 nodes, 4 luns, and wish both nodes to be active,
>> then you can divy up the storage into two pools.  So each node has one
>> active pool of 2 luns.  All 4 luns are doubly attached to both nodes,
>> and when one node fails, the takeover node then has 2 active pools.
>
>
> I agree with you - I can have 2 active pools, not 4 in case of
> dual-node cluster.
>
>>
>> So how many nodes do you have? and how many do you wish to be "active"
>> at a time?
>
>
> Currently - 2 nodes, both active. If I define 4 pools, I can easily
> expand the cluster to the 4-nodes configuration, that may be the good
> reason to have 4 pools.


Ok, that makes sense.

>>
>> And what was your configuration for VxFS and SVM/UFS?
>
>
> 4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in
> SC3.1 that doesn't support EFI label) with UFS or VxFS on top.


So you have 2 nodes, 2 file systems (of either UFS or VxFS) on each node?


I have 2 nodes, 2 file systems per node. One share is working via
bge0, the second one - via bge1.



I'm just trying to make sure its a fair comparison bewteen ZFS, UFS, and
VxFS.


After I saw that ZFS performance (when the box isn't stuck) is about 3
times lower than UFS/VxFS, I understood I should wait with ZFS for
Solaris 11official release.
I don't believe that it's possible to do some magic with my setup and
increase the ZFS performance 3 times. Fix me if I'm wrong.



>
> And now comes the questions - my short test showed that 1-pool config
> doesn't behave better than 4-pools one - with the first the box was
> hung, with the second - didn't.
> Why do you think the 1-pool config is better?


I suggested the 1 pool config before i knew you were doing HA-ZFS :)
Purposely dividing up your storage (by creating separate pools) in a
non-clustered environment usually doesn't make sense (root being one
notable exception).


I see.
Thanks,
-- Leon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-11 Thread eric kustarz



After I saw that ZFS performance (when the box isn't stuck) is about 3
times lower than UFS/VxFS, I understood I should wait with ZFS for
Solaris 11official release.
I don't believe that it's possible to do some magic with my setup and
increase the ZFS performance 3 times. Fix me if I'm wrong.



Yep, we're working on this right now, though you shouldn't have to wait 
until Solaris11 - hopefully a s10 update will be out earlier with the 
proper perf fixes.  U3 already has some improvements over U2 (which you 
were running).


I'm actually doing specSFS benchmarking right now, and i'll keep the 
list updated.


eric


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Proposal expand raidz

2006-08-11 Thread Anton B. Rang
That's the default, I think, but you can use 'vol add -g' to add disks to an 
existing RAID group. This is fairly new functionality (V6.2 I think). ZFS will 
probably not take so long to add this feature. :-)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for motherboard/chipset experience, again

2006-08-11 Thread Richard Elling - PAE

This is a great question for the Solaris forum at NVidia.
http://www.nvnews.net/vbulletin/forumdisplay.php?f=45

My experience has been that NVidia does a pretty good job keeping the
NForce software compatible with the hardware going forward.  For Solaris,
pre-NForce4 is a little spotty, but that is probably due to timing issues.
 -- richard

David Dyer-Bennet wrote:

What about the Asus M2N-SLI Deluxe motherboard?  It has 7 SATA ports,
supports ECC memory, socket AM2, generally looks very attractive for
my home storage server.  Except that it, and the nvidia nForce 570-SLI
it's built on, don't seem to be on the HCL.  I'm hoping that's just
"yet", not reported yet.  Anybody run Solaris on it?  Or at least on
any nForce 570-SLI board?  Would you risk buying it to find out
yourself?

I've heard rumors of ZFS in one of the more obscure Linuxes, perhaps
Ubuntu; I suppose that could be a backup plan if I try and Solaris
doesn't work.

I have the general feeling that Linux runs on anything I can buy
today, pretty much, since I've been using it for over a decade and am
somewhat plugged into the community.  I don't yet have the impression
that Solaris runs on most anything, possibly after tracking down a few
drivers.  Does it, really?  Should I be not worrying about this so
much?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Jeff Victor

Irma Garcia wrote:

Hi All,

Sun Fire V440
Solaris 10
Solaris Resource Manager

Customer wrote the following:

I have a v490 with 4 zones:

tsunami:/#->zoneadm list -iv
ID NAME STATUS PATH
0 global running /
4 fmstage running /fmstage
12 fmprod running /fmprod
15 fmtest running /fmtest

fmtest has a pool assigned to it with acess
to 2 cpus. When I run the psstat -Z in the
fmtest zone I see;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 192 169G 163G 100% 0:29:55 96% fmtest

on the global zone (tsunami) I see with the
psstat -Z ;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. 


Are they using rcapd?

Neither the man page nor a quick skim of the prstat source code at opensolaris.org 
provide a useful answer.  It is not clear if "all the memory" means "all of the 
virtual memory" (unlikely) or "all of the physical memory" or "all of the memory 
available to the zone."



How come when I run the top command I see
different result for memory usage.


A comparison of top and prstat source code would be useful, but someone familiar 
with those two programs would probably yield a solution more quickly.



What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??


Yes.

--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] user quotas vs filesystem quotas?

2006-08-11 Thread Frank Cusack

On August 11, 2006 10:31:50 AM -0400 "Jeff A. Earickson" <[EMAIL PROTECTED]> 
wrote:

Suggestions please?


Ideally you'd be able to move to mailboxes in $HOME instead of /var/mail.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Unreliable ZFS backups or....

2006-08-11 Thread Peter Looyenga
I looked into backing up ZFS and quite honostly I can't say I am convinced 
about its usefullness here when compared to the traditional ufsdump/restore. 
While snapshots are nice they can never substitute offline backups. And 
although you can keep quite some snapshots lying about it will consume 
diskspace, one of the reasons why people also keep offline backups.

However, while you can make one using 'zfs send' it somewhat worries me that 
the only way to perform a restore is by restoring the entire filesystem 
(/snapshot). I somewhat shudder at the thought of having to restore 
/export/home this way to retrieve but a single file/directory.

Am I overlooking something here or are people indeed resorting to tools like 
tar and the likes again to overcome all this? In my opinion ufsdump / 
ufsrestore was a major advantage over tar and I really would consider it a 
major drawback if that would be the only way to backup data in such a way where 
it can be more easily restored.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Jeff Victor
Follow-up: it looks to me like prstat displays the portion of the system's 
physical memory in use by the processes in that zone.


How much memory does that system have?  Something seems amiss, as a V490 can hold 
up to 32GB, and prstat is showing 163GB of physical memory just for fmtest.



Irma Garcia wrote:

Hi All,

Sun Fire V440
Solaris 10
Solaris Resource Manager

Customer wrote the following:

I have a v490 with 4 zones:

tsunami:/#->zoneadm list -iv
ID NAME STATUS PATH
0 global running /
4 fmstage running /fmstage
12 fmprod running /fmprod
15 fmtest running /fmtest

fmtest has a pool assigned to it with acess
to 2 cpus. When I run the psstat -Z in the
fmtest zone I see;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 192 169G 163G 100% 0:29:55 96% fmtest

on the global zone (tsunami) I see with the
psstat -Z ;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. How
come when I run the top command I see
different result for memory usage.
What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??




Thanks in Advance
Irma


-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unreliable ZFS backups or....

2006-08-11 Thread Frank Cusack

On August 11, 2006 5:25:11 PM -0700 Peter Looyenga <[EMAIL PROTECTED]> wrote:

I looked into backing up ZFS and quite honostly I can't say I am convinced 
about its usefullness
here when compared to the traditional ufsdump/restore. While snapshots are nice 
they can never
substitute offline backups.


It doesn't seem to me that they are meant to.


However, while you can make one using 'zfs send' it somewhat worries me that 
the only way to
perform a restore is by restoring the entire filesystem (/snapshot). I somewhat 
shudder at the
thought of having to restore /export/home this way to retrieve but a single 
file/directory.


You can mitigate this by creating more granular filesystems, e.g. a
filesystem per user homedir.  This has other advantages like per-user
quotas.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Mike Gerdts

On 8/11/06, Irma Garcia <[EMAIL PROTECTED]> wrote:

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. How
come when I run the top command I see
different result for memory usage.


The %mem column is the sum of the %mem that each process uses.
Unfortuantely, that value seems to include the pages that are shared
between many processes (e.g. database files, libc, etc.) without
dividing by the number of processes that have that memory mapped.  In
other words, if you have 50 database processes that have used mmap()
on the same 1 GB database, prstat will think that 50 GB of RAM is used
when only 1 GB is really used.

I have seen prstat report that oracle workloads on a 15k domain are
using well over a terabyte of memory.  This is kinda hard to do on a
domain with ~300 GB of RAM < 50 GB swap.


What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??


I *think* that rcapd suffers from the same problem that prstat does
and may cause undesirable behavior.  Because of the way that it works,
I fully expect that if rcapd begins to force pages out, the paging
activity for the piggy workload will cause severe performance
degredation for everything on the machine.  My personal opinion (not
backed by extensive testing) is that rcapd is more likely to do more
harm than good.

If the workload that you are trying to control is java-based, consider
using the various java flags to limit heap size.  This will not
protect you against memory leaks in the JVM, but it will protect
against a misbehaving app.  The same is likely true for the stack
size.

If the workload you are trying to control is some other single
process, consider using ulimit to limit the stack and heap size.

Set the size= option for all tmpfs file systems.

Bug the folks that are working on memory sets and swap sets to get
this code out sooner than later.

If running on sun4v, consider LDOM's when they are available (November?).

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for motherboard/chipset experience, again

2006-08-11 Thread David Dyer-Bennet

On 8/11/06, Richard Elling - PAE <[EMAIL PROTECTED]> wrote:

This is a great question for the Solaris forum at NVidia.
http://www.nvnews.net/vbulletin/forumdisplay.php?f=45


Thanks, I have asked there.


My experience has been that NVidia does a pretty good job keeping the
NForce software compatible with the hardware going forward.  For Solaris,
pre-NForce4 is a little spotty, but that is probably due to timing issues.


Is an nForce 570-SLI pre-4?  It's a brand-new board design (socket
AM2), so probably current hardware, not older components.  Sounding
hopeful, we'll see what the nvnews people turn up.  Thanks!
--
David Dyer-Bennet, , 
RKBA: 
Pics: 
Dragaera/Steven Brust: 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss