date:20071213

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Robert Milkowski

Hello can,

Thursday, December 13, 2007, 12:02:56 AM, you wrote:

cyg> On the other hand, there's always the possibility that someone
cyg> else learned something useful out of this.  And my question about

To be honest - there's basically nothing useful in the thread,
perhaps except one thing - doesn't make any sense to listen to you.

You're just unable to talk to people.





-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Auto backup and auto restore of ZFS via Firewire drive

2007-12-13 Thread Ross

Hey folks,

This may not be the best place to ask this, but I'm so new to Solaris I really 
don't know anywhere better.  If anybody can suggest a better forum I'm all ears 
:)

I've heard of Tim Foster's autobackup utility, that can automatically backup a 
ZFS filesystem to a USB drive as it's connected.  What I'd like to know is if 
there's a way of doing something similar for a firewire drive?

Also, is it possible to run several backups at once onto that drive?  Is there 
any way I can script a bunch of commands to run as the drive is connected?

thanks,

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS NAS Cluster

2007-12-13 Thread Vic Cornell

Dear All,

First of all thanks for a fascinating list - its my first read of the
morning.

Secondly I would like to ask a question. We currently have an EMC Celerra
NAS which we use for CIFS, NFS and iSCSI. Its not our favourite piece of
hardware and it is nearing the limits of its capacity (Tb) . We have two
options: 

1) Expand the solution. Spend £££s, double the number of heads, double
the capacity and carry on as before.

2) Look for something else.

I have been watching ZFS for some time and have implemented it in
several niche applications. I would like to be able to consider using ZFS as
the basis of a NAS solution based around SAN storage, T{2,5}000 servers and
Sun Cluster.

Here is my wish list:

Flexible provisioning (thin if possible)
Hardware resilience/Transparent Failover
Asynchronous Replication to remote site (1km) providing DR cover.
NFS/CIFS/iSCSI
Snaps/Cloning
No single point of failure
Integration with Active Directory/NFS
Ability to restripe data onto "widened" pools.
Ability to migrate data between storage pools.

As I understand it the combination of ZFS and SunCluster will give me all of
the above. Has anybody done this? How mature/stable is it. I understand that
SunCluster/HA-ZFS is supported but there seems to be little that I can find
on the web about it. Any information would be gratefully received.

Best Regards,

Vic

-- 
Vic Cornell
UNIX Systems Administrator 
Landmark Information Group Limited

5-7 Abbey Court, Eagle Way, Sowton, 
Exeter, Devon, EX2 7HY

T:  01392 888690 
M: 07900 660266
F:  01392 441709

www.landmarkinfo.co.uk 



Registered Office: 5-7 Abbey Court, Eagle Way, Sowton, Exeter, Devon, EX2 7HY
Registered Number 2892803 Registered in England & Wales 

The information contained in this e-mail is confidential and may be subject to
legal privilege. If you are not the intended recipient, you must not use,
copy, distribute or disclose the e-mail or any part of its contents or take
any action in reliance on it. If you have received this e-mail in error,
please e-mail the sender by replying to this message. All reasonable
precautions have been taken to ensure no viruses are present in this e-mail.
Landmark Information Group Limited cannot accept responsibility for loss or
damage arising from the use of this e-mail or attachments and recommend that
you subject these to your virus checking procedures prior to use. 

www.landmarkinfo.co.uk 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Auto backup and auto restore of ZFS via Firewire drive

2007-12-13 Thread Tim Foster

Hi there,

On Thu, 2007-12-13 at 02:17 -0800, Ross wrote:
> This may not be the best place to ask this, but I'm so new to Solaris
> I really don't know anywhere better.  If anybody can suggest a better
> forum I'm all ears :)

You could have just mailed me :-)

> I've heard of Tim Foster's autobackup utility, that can automatically
> backup a ZFS filesystem to a USB drive as it's connected.  What I'd
> like to know is if there's a way of doing something similar for a
> firewire drive?

Short answer: no idea, it should just work, but I haven't tested
firewire (oh the irony! [1])

Longer answer: here's how the software works:

1. When you enable the service, it starts a small python daemon, which
monitors the system D-Bus[2] watching for devices to be inserted. The
daemon is mostly just boiler-plate code, with the useful stuff near the
end.

2. When a device is inserted, the daemon kicks off a shell script,
passing it the device name, and the volume id of the device.

3. The shell script consults the SMF service's properties to see if that
volume id is "interesting", if it's not, it exits.

4. For "interesting" volumes, we wait for the volume to get mounted.
Then the shell script either gets the name of the zfs datasets to be
backed up from another SMF service property, or looks for datasets on
the system that have a given user-property (com.sun:auto-backup)

5. Those datasets then have a snapshot taken of them, and we send those
snapshot-streams to the backup volume, saved as flat files, split over
4gb boundaries (pcfs limitations)

The question is, does plugging a firewire mass-storage device into the
system generate a D-Bus event, and does the device get automatically
mounted ? I think it probably should, but I haven't tried it.

> Also, is it possible to run several backups at once onto that drive?
> Is there any way I can script a bunch of commands to run as the drive
> is connected?

If you configure multiple datasets to be backed up, I think that happens
in serial at the moment.

As for scripting other commands, it's probably feature-creep to put this
into the code, but download the software from:

http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people
(there's a README in there that might be useful)

have a look at:

zfs-auto-backup-0.2/src/usr/lib/zfs-auto-backupd.py
zfs-auto-backup-0.2/src/usr/lib/zfs-auto-backup.ksh

 - the python daemon and shell script respectively.  You should be able
to see how the daemon invokes the shell script and hack accordingly to
run your own scripts too.

One additional point - this doesn't do restore yet: restoring would mean
doing some sort of synchronisation between what's on your system and
what's saved to the backup disk.
(you can of course manually restore via "zfs recv")

Ideally, everyone would have ZFS on their removable devices, and it'd be
a simple case of rsync or somesuch between the two filesystems to do a
restore, resolving conflicts where they arise.

Likewise, I'd love to be able to browse the backup contents on my
removable disk, but since I'm assuming that all removable disks are pcfs
for now, I'm only storing flat-files (which no archiver can currently
read - the only way of getting at the contents is "zfs recv" to another
zfs dataset) 

Hope this helps?

cheers,
tim

[1] Yeah, I work in zfs-test, and haven't fully tested my own software -
clearly not a *real* tester

[2] see hal(5) and http://www.freedesktop.org/wiki/Software/dbus
-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS NAS Cluster

2007-12-13 Thread Robert Milkowski

Hello Vic,

Thursday, December 13, 2007, 10:29:57 AM, you wrote:

VC> Dear All,

VC> First of all thanks for a fascinating list - its my first read of the
VC> morning.

VC> Secondly I would like to ask a question. We currently have an EMC Celerra
VC> NAS which we use for CIFS, NFS and iSCSI. Its not our favourite piece of
VC> hardware and it is nearing the limits of its capacity (Tb) . We have two
VC> options: 

VC> 1) Expand the solution. Spend £££s, double the number of heads, double
VC> the capacity and carry on as before.

VC> 2) Look for something else.

VC> I have been watching ZFS for some time and have implemented it in
VC> several niche applications. I would like to be able to consider using ZFS as
VC> the basis of a NAS solution based around SAN storage, T{2,5}000 servers and
VC> Sun Cluster.



I've been using Cellera too - not that bad, but definitely not worth
$$$ in most cases.

I've been using SC+ZFS in a production for over a year now (first with
beta version). It works pretty good - I mean, it just works.


VC> Here is my wish list:

VC> Flexible provisioning (thin if possible)
VC> Hardware resilience/Transparent Failover
VC> Asynchronous Replication to remote site (1km) providing DR cover.
VC> NFS/CIFS/iSCSI
VC> Snaps/Cloning
VC> No single point of failure
VC> Integration with Active Directory/NFS
VC> Ability to restripe data onto "widened" pools.
VC> Ability to migrate data between storage pools.

VC> As I understand it the combination of ZFS and SunCluster will give me all of
VC> the above. Has anybody done this? How mature/stable is it. I understand that
VC> SunCluster/HA-ZFS is supported but there seems to be little that I can find
VC> on the web about it. Any information would be gratefully received.


Not exactly...
All answers to your wish list is yes, with following
exceptions/notices

NFS - no problem at all, it will work with SC+NFS+ZFS
CIFS - if you're happy with samba - then yes. If you are after Solaris
   native client - well, it was just integrated into Nevada so if you're
   happy with Nevada...
iSCSI - haven't really use it, not in a production at least. By using
shareiscsi zfs property I guess it should work. I don't think
there's a special SC agent for it.
AD - not sure about samba

"Restriping data" - nothing automated yet. If you re-write data
manually then it should just work.

Migrate data between pools - not entirely online

Asynchronous replication - you can try AVS, I've never tried it.
   Or you can go for zfs send -i





-- 
Best regards,
 Robert Milkowski   mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Brian Kolaci

Robert Milkowski wrote:
> Hello can,
> 
> Thursday, December 13, 2007, 12:02:56 AM, you wrote:
> 
> cyg> On the other hand, there's always the possibility that someone
> cyg> else learned something useful out of this.  And my question about
> 
> To be honest - there's basically nothing useful in the thread,
> perhaps except one thing - doesn't make any sense to listen to you.
> 
> You're just unable to talk to people.
> 

Have to agree 100%.  I did learn how to filter out things from CYG
in my email program though.  Never had the need to do so before.

Overall, the effect of fragmentation will be more and more negligible as
ssd drives become more prominent.  I think the ZFS developers are concentrating
on the more important issues.  Where performance is needed, technology will
overcome the effects of fragmentation.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread MP

> this anti-raid-card movement is puzzling. 

I think you've misinterpreted my questions.
I queried the necessity of paying extra for an seemingly unnecessary RAID card 
for zfs. I didn't doubt that it could perform better.
Wasn't one of the design briefs of zfs, that it would provide it's feature set 
without expensive RAID hardware?
Of course, if you have the money then you can always go faster, but this is a 
zfs discussion thread (I know I've perpetuated the extravagant cross-posting of 
the OP).
Cheers.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Shawn Joy

What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
slice just the completed disk.



Robert Milkowski wrote:
> Hello Shawn,
> 
> Thursday, December 13, 2007, 3:46:09 PM, you wrote:
> 
> SJ> Is it possible to bring one slice of a disk under zfs controller and 
> SJ> leave the others as ufs?
> 
> SJ> A customer is tryng to mirror one slice using zfs.
> 
> 
> Yes, it's - it just works.
> 
> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Eric Haycraft

People.. for the n-teenth time, there are only two ways to kill a troll. One 
involves a woodchipper and the possibility of an unwelcome visit from the FBI, 
and the other involves ignoring them. 

Internet Trolls:
http://en.wikipedia.org/wiki/Internet_troll
http://www.linuxextremist.com/?p=34

Another perspective:
http://sc.tri-bit.com/images/7/7e/greaterinternetfu#kwadtheory.jpg

The irony of this whole thing is that by feeding Bill's tollish tendencies, he 
has effectively eliminated himself from any job or contract where someone 
googles his name and thus will give him an enormous amount of time to troll 
forums. Who in their right mind would consciously hire someone who calls people 
idiots randomly to avoid the topic at hand. Being unemployed will just piss him 
off more and his trolling will only get worse. Hence, you don't feed trolls!!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] mirror a slice

2007-12-13 Thread Shawn Joy

Is it possible to bring one slice of a disk under zfs controller and 
leave the others as ufs?

A customer is tryng to mirror one slice using zfs.

Please respond to me directly and to the alias.

Thanks,
Shawn

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Robert Milkowski

Hello Shawn,

Thursday, December 13, 2007, 3:46:09 PM, you wrote:

SJ> Is it possible to bring one slice of a disk under zfs controller and 
SJ> leave the others as ufs?

SJ> A customer is tryng to mirror one slice using zfs.


Yes, it's - it just works.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Cindy . Swearingen

Shawn,

Using slices for ZFS pools is generally not recommended so I think
we minimized any command examples with slices:

# zpool create tank mirror c1t0d0s0 c1t1d0s0

Keep in mind that using the slices from the same disk for both UFS
and ZFS makes administration more complex. Please see the ZFS BP
section here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools

* The recovery process of replacing a failed disk is more complex when 
disks contain both ZFS and UFS file systems on
slices.
  * ZFS pools (and underlying disks) that also contain UFS file systems 
on slices cannot be easily migrated to other
systems by using zpool import and export features.
  * In general, maintaining slices increases administration time and 
cost. Lower your administration costs by
simplifying your storage pool configuration model.

Cindy

Shawn Joy wrote:
> What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
> slice just the completed disk.
> 
> 
> 
> Robert Milkowski wrote:
> 
>>Hello Shawn,
>>
>>Thursday, December 13, 2007, 3:46:09 PM, you wrote:
>>
>>SJ> Is it possible to bring one slice of a disk under zfs controller and 
>>SJ> leave the others as ufs?
>>
>>SJ> A customer is tryng to mirror one slice using zfs.
>>
>>
>>Yes, it's - it just works.
>>
>>
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?

> Hello can,
> 
> Thursday, December 13, 2007, 12:02:56 AM, you wrote:
> 
> cyg> On the other hand, there's always the
> possibility that someone
> cyg> else learned something useful out of this.  And
> my question about
> 
> To be honest - there's basically nothing useful in
> the thread,
> perhaps except one thing - doesn't make any sense to
> listen to you.

I'm afraid you don't qualify to have an opinion on that, Robert - because you 
so obviously *haven't* really listened.  Until it became obvious that you never 
would, I was willing to continue to attempt to carry on a technical discussion 
with you, while ignoring the morons here who had nothing whatsoever in the way 
of technical comments to offer (but continued to babble on anyway).

- bill

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Eric Haycraft

You may want to peek here first. Tim has some scripts already and if not 
exactly what you want, I am sure it could be reverse engineered. 

http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people


Eric
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Toby Thain


On 13-Dec-07, at 1:56 PM, Shawn Joy wrote:

> What are the commands? Everything I see is c1t0d0, c1t1d0.   no
> slice just the completed disk.


I have used the following HOWTO. (Markup is TWiki, FWIW.)



Device names are for a 2-drive X2100. Other machines may differ, for  
example, X4100 drives may be =c3t2d0= and =c3t3d0=.

---++ Partitioning

This is done before installing Solaris 10, or after installing a new  
disk to replace a failed mirror disk.
* Run *format*, choose the correct disk device
* Enter *fdisk* from menu
* Delete any diagnostic partition, and existing Solaris partition
* Create one Solaris2 partition over 100% of the disk
* Exit *fdisk*; quit *format*

---++ Slice layout

|slice 0| root| 8192M| <-- this is not really large enough :-)
|slice 1| swap| 2048M|
|slice 2| -||
|slice 3| SVM metadb| 16M|
|slice 4| zfs| 68200M|
|slice 5| SVM metadb| 16M|
|slice 6| -||
|slice 7| SVM metadb| 16M|

The final slice layout should be saved using =prtvtoc /dev/rdsk/ 
c1d0s2 >vtoc=

The second (mirror) disk can be forced into the same layout using  
=fmthard -s vtoc /dev/rdsk/c2d0s2=
(Replacement drives must be partitioned in exactly the same way, so  
it is recommended that a copy of the vtoc be kept in a file.)

GRUB must also be installed on the second disk:
=/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2d0s0=

---++ Solaris Volume Manager setup

The root and swap slices will be mirrored using SVM. See:
* http://www.solarisinternals.com/wiki/index.php/ 
ZFS_Best_Practices_Guide#UFS.2FSVM
* http://sunsolve.sun.com/search/document.do?assetkey=1-9-83605-1

(As of Sol10U2 (June 06), ZFS is not supported for root partition.)

At this point the system has been installed on, and booted from the  
first disk, c1d0s0 (as root) and with swap from the same disk. The  
following steps set up SVM but don't interfere with currently mounted  
partitions. The second disk has already been partitioned identically  
to the first, and the data will be copied to the mirror after  
=metattach= below. Changing =/etc/vfstab= sets the machine to boot  
from the SVM mirror device in future.

* Create SVM metadata (slice 3) with redundant copies on slices 5  
and 7: %BR% =metadb -a -f c1d0s3 c2d0s3 c1d0s5 c2d0s5 c1d0s7 c2d0s7=
* Create submirrors on first disk (root and swap slices): %BR%  
=metainit -f d10 1 1 c1d0s0= %BR% =metainit -f d11 1 1 c1d0s1=
* Create submirrors on second disk: %BR% =metainit -f d20 1 1  
c2d0s0= %BR% =metainit -f d21 1 1 c2d0s1=
* Create the mirrors: %BR% =metainit d0 -m d10= %BR% =metainit d1  
-m d11=
* Take a backup copy of =/etc/vfstab=
* Define root slice: =metaroot d0= (this alters the mount device  
for / in =/etc/vfstab=, it should now be =/dev/md/dsk/d0=)
* Edit =/etc/vfstab= (changing device for swap to =/dev/md/dsk/d1=)
* Reboot to test. If there is a problem, use single user mode and  
revert vfstab. Confirm that root and swap devices are now the  
mirrored devices with =df= and =swap -l=
* Attach second halves to mirror: %BR% =metattach d0 d20= %BR%  
=metattach d1 d21=

Mirror will now begin to sync; progress can be checked with =metastat  
-c=

---+++ Also see

* [[http://slacksite.com/solaris/disksuite/disksuite.html  
recipe]] at slacksite.com

---++ ZFS setup

Slice 4 is set aside for the ZFS pool - the system's active data.

* Create pool: =zpool create pool mirror c1d0s4 c2d0s4=
* Create filesystem for home directories: =zfs create pool/home= % 
BR% (To make this active, move any existing home directories from =/ 
home= and into =/pool/home=; then =zfs set mountpoint=/home pool/ 
home=; log out; and log back in.)
* Set up regular scrub - Add to =crontab= a line such as: =0 4 1  
* * zpool scrub pool=

bash-3.00# zpool create pool mirror c1d0s4 c2d0s4
bash-3.00# zpool status
   pool: pool
  state: ONLINE
  scrub: none requested
config:

 NAMESTATE READ WRITE CKSUM
 poolONLINE   0 0 0
   mirrorONLINE   0 0 0
 c1d0s4  ONLINE   0 0 0
 c2d0s4  ONLINE   0 0 0

errors: No known data errors
bash-3.00# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool  75.5K  65.5G  24.5K  /pool
bash-3.00#


---++ References
* [[http://docs.sun.com/app/docs/doc/819-5461 ZFS Admin Guide]]
* [[http://docs.sun.com/app/docs/doc/816-4520 SVM Admin Guide]]


>
>
>
> Robert Milkowski wrote:
>> Hello Shawn,
>>
>> Thursday, December 13, 2007, 3:46:09 PM, you wrote:
>>
>> SJ> Is it possible to bring one slice of a disk under zfs  
>> controller and
>> SJ> leave the others as ufs?
>>
>> SJ> A customer is tryng to mirror one slice using zfs.
>>
>>
>> Yes, it's - it just works.
>>
>>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread David Dyer-Bennet

Eric Haycraft wrote:
> You may want to peek here first. Tim has some scripts already and if not 
> exactly what you want, I am sure it could be reverse engineered. 
>
> http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people
>
>   

Thanks, but I already read those, and referred to those in my recent 
posts, and described in some detail why they're not terribly useful to 
what I'm trying to do with my backup schedule (briefly, I want 
*scheduled* backups that adjust their behavior somewhat based on what 
external disks are available; Tim's scripts are very cleverly set up to 
trigger suitable backups when an external device is connected, which as 
he says is particularly suitable for laptops).

-- 
David Dyer-Bennet, [EMAIL PROTECTED]; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread Jim Mauro

Would you two please SHUT THE F$%K UP.

Dear God, my kids don't go own like this.

Please - let it die already.

Thanks very much.

/jim


can you guess? wrote:
>> Hello can,
>>
>> Thursday, December 13, 2007, 12:02:56 AM, you wrote:
>>
>> cyg> On the other hand, there's always the
>> possibility that someone
>> cyg> else learned something useful out of this.  And
>> my question about
>>
>> To be honest - there's basically nothing useful in
>> the thread,
>> perhaps except one thing - doesn't make any sense to
>> listen to you.
>> 
>
> I'm afraid you don't qualify to have an opinion on that, Robert - because you 
> so obviously *haven't* really listened.  Until it became obvious that you 
> never would, I was willing to continue to attempt to carry on a technical 
> discussion with you, while ignoring the morons here who had nothing 
> whatsoever in the way of technical comments to offer (but continued to babble 
> on anyway).
>
> - bill
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe

I'm using an x4500 as a large data store for our VMware environment.  I
have mirrored the first 2 disks, and created a ZFS pool of the other 46:
22 pairs of mirrors, and 2 spares (optimizing for random I/O performance
rather than space).  Datasets are shared to the VMware ESX servers via
NFS.  We noticed that VMware mounts its NFS datastore with the SYNC
option, so every NFS write gets flagged with FILE_SYNC.  In testing,
syncronous writes are significantly slower than async, presumably
because of the strict ordering required for correctness (cache flushing
and ZIL).

Can anyone tell me if a ZFS snapshot taken when zil_disable=1 will be
crash-consistant with respect to the data written by VMware?  Are the
snapshot metadata updates serialized with pending non-metadata writes?
If an asyncronous write is issued before the snapshot is initiated, is
it guarenteed to be in the snapshot data, or can it be reordered to
after the snapshot?  Does a snapshot flush pending writes to disk?

To increase performance, the users are willing to "lose" an hour or two
of work (these are development/QA environments): In the event that the
x4500 crashes and loses the 16GB of cached (zil_disable=1) writes, we
roll back to the last hourly snapshot, and everyone's back to the way
they were.  However, I want to make sure that we will be able to boot a
crash-consistant VM from that rolled-back virtual disk.

Thanks for any knowledge you might have,
--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Steve McKinty

I have a couple of questions and concerns about using ZFS in an environment 
where the underlying LUNs are replicated at a block level using products like 
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
the explanation to be clear.

(I do realise that there are other possibilities such as zfs send/recv and 
there are technical and business pros and cons for the various options. I don't 
want to start a 'which is best' argument :) )

The CoW design of ZFS means that it goes to great lengths to always maintain 
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g 
not needing fsck) based on that.  This is the basis of my questions. 

1) First issue relates to the überblock.  Updates to it are assumed to be 
atomic, but if the replication block size is smaller than the überblock then we 
can't guarantee that the whole überblock is replicated as an entity.  That 
could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS 
just use an alternate überblock and rewrite the damaged one transparently?

2) Assuming that the replication maintains write-ordering, the secondary site 
will always have valid and self-consistent data, although it may be out-of-date 
compared to the primary if the replication is asynchronous, depending on link 
latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for 
one specific scenario.  If the replication is interrupted, for example 
secondary site down or unreachable due to a comms problem, the primary site 
will keep a list of changed blocks.  When contact between the sites is 
re-established there will be a period of 'catch-up' resynchronization.  In 
most, if not all, cases this is done on a simple block-order basis.  
Write-ordering is lost until the two sites are once again in sync and routine 
replication restarts. 

I can see this has having major ZFS impact.  It would be possible for 
intermediate blocks to be replicated before the data blocks they point to, and 
in the worst case an updated überblock could be replicated before the block 
chains that it references have been copied.  This breaks the assumption that 
the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized 
LUNs were imported into a zpool at the secondary site, what would/could happen? 
Refusal to accept the whole zpool? Rejection just of the files affected? System 
panic? How could recovery from this situation be achieved?

Obviously all filesystems can suffer with this scenario, but ones that expect 
less from their underlying storage (like UFS) can be fscked, and although data 
that was being updated is potentially corrupt, existing data should still be OK 
and usable.  My concern is that ZFS will handle this scenario less well. 

There are ways to mitigate this, of course, the most obvious being to take a 
snapshot of the (valid) secondary before starting resync, as a fallback.  This 
isn't always easy to do, especially since the resync is usually automatic; 
there is no clear trigger to use for the snapshot. It may also be difficult to 
synchronize the snapshot of all LUNs in a pool. I'd like to better understand 
the risks/behaviour of ZFS before starting to work on mitigation strategies. 

Thanks

Steve
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Richard Elling

MP wrote:
>> this anti-raid-card movement is puzzling. 
>> 
>
> I think you've misinterpreted my questions.
> I queried the necessity of paying extra for an seemingly unnecessary RAID 
> card for zfs. I didn't doubt that it could perform better.
> Wasn't one of the design briefs of zfs, that it would provide it's feature 
> set without expensive RAID hardware?
>   

In general, feature set != performance.  For example, a VIA 
x86-compatible processor
is not capable of beating the performance of a high-end Xeon, though the 
feature sets
are largely the same.  Additional examples abound.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to properly tell zfs of new GUID after a firmware upgrade changes the IDs

2007-12-13 Thread Shawn Ferry

Jill,

I was recently looking for a similar solution to try and reconnect a
renumbered device while the pool was live.

e.g. zpool online mypool  
As in zpool replace but with the indication that this isn't a new  
device.

What I have been doing to deal with the renumbering is exactly the
export, import and clear.  Although I have been dealing with  
significantly
smaller devices and can't speak to the delay issues.

Shawn



On Dec 13, 2007, at 12:16 PM, Jill Manfield wrote:

>
> My customer's zfs pools and their 6540 disk array had a firmware  
> upgrade that changed GUIDs so we need a procedure to let the zfs  
> know it changed. They are getting errors as if they replaced  
> drives.  But I need to make sure you know they have not "replaced"  
> any drives, and no drives have failed or are "bad". As such, they  
> have no interest in wiping any disks clean as indicated in 88130  
> info doc.
>
> Some background from customer:
>
> We have a large 6540 disk array, on which we have configured a  
> series of
> large RAID luns.  A few days ago, Sun sent a technician to upgrade the
> firmware of this array, which worked fine but which had the  
> deleterious
> effect of changing the "Volume IDs" associated with each lun.  So, the
> resulting luns now appear to our solaris 10 host (under mpxio) as  
> disks in
> /dev/rdsk with different 'target' components than they had before.
>
> Before the firmware upgrade we took the precaution of creating  
> duplicate
> luns on a different 6540 disk array, and using these to mirror each  
> of our
> zfs pools (as protection in case the firmware upgrade corrupted our  
> luns).
>
> Now, we simply want to ask zfs to find the devices under their new
> targets, recognize that they are existing zpool components, and have  
> it
> correct the configuration of each pool.  This would be similar to  
> having
> Veritas vxvm re-scan all disks with vxconfigd in the event of a
> "controller renumbering" event.
>
> The proper zfs method for doing this, I believe, is to simply do:
>
> zpool export mypool
> zpool import mypool
>
> Indeed, this has worked fine for me a few times today, and several  
> of our
> pools are now back to their original mirrored configuration.
>
> Here is a specific example, for the pool "ospf".
>
> The zpool status after the upgrade:
>
> diamond:root[1105]->zpool status ospf
>  pool: ospf
> state: DEGRADED
> status: One or more devices could not be opened.  Sufficient replicas
> exist for
>the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using 'zpool online'.
>   see: http://www.sun.com/msg/ZFS-8000-D3
> scrub: resilver completed with 0 errors on Tue Dec 11 18:26:53 2007
> config:
>
>NAMESTATE READ  
> WRITE CKSUM
>ospfDEGRADED  
> 0 0 0
>  mirrorDEGRADED  
> 0 0 0
>c27t600A0B8000292B024BDC4731A7B8d0  UNAVAIL   
> 0 0 0  cannot open
>c27t600A0B800032619A093747554A08d0  ONLINE
> 0 0 0
>
> errors: No known data errors
>
> This is due to the fact that the LUN which used to appear as
> c27t600A0B8000292B024BDC4731A7B8d0 is now actually
> c27t600A0B8000292B024D5B475E6E90d0.  It's the same LUN, but  
> since the
> firmware changed the Volume ID, the target portion is different.
>
> Rather than treating this as a "replaced" disk (which would incur an
> entire mirror resilvering, and would require the "trick" you sent of
> obliterating the disk label so the "in use" safeguard could be  
> avoided),
> we simply want to ask zfs to re-read its configuration to find this  
> disk.
>
> So we do this:
>
> diamond:root[1110]->zpool export -f ospf
> diamond:root[]->zpool import ospf
>
> and sure enough:
>
> diamond:root[1112]->zpool status ospf
>  pool: ospf
> state: ONLINE
> status: One or more devices is currently being resilvered.  The pool  
> will
>continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
> scrub: resilver in progress, 0.16% done, 2h53m to go
> config:
>
>NAMESTATE READ  
> WRITE CKSUM
>ospfONLINE
> 0 0 0
>  mirrorONLINE
> 0 0 0
>c27t600A0B8000292B024D5B475E6E90d0  ONLINE
> 0 0 0
>c27t600A0B800032619A093747554A08d0  ONLINE
> 0 0 0
>
> errors: No known data errors
>
> (Note that it has self-initiated a resilvering, since in this case the
> mirror has been changed by users since the firmware upgrade.)
>
> The problem that Robert had was that when he initiated an export of  
> a pool
> (called "bgp") it froze for quite some time.  The corresponding  
> "import"
> of the same

Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Tim Foster

On Wed, 2007-12-12 at 21:35 -0600, David Dyer-Bennet wrote:
> What are the approaches to finding what external USB disks are currently 
> connected?

Would "rmformat -l" or "eject -l" fit the bill ?

> The external USB backup disks in question have ZFS filesystems on them, 
> which may make a difference in finding them perhaps?

Nice.

 I dug around a bit with this a while back, and I'm not sure hal &
friends are doing the right thing with zpools on removable devices just
yet.  I'd expect that we'd have a "zpool import" triggered on a device
being plugged, analogous to the way we have pcfs disks automatically
mounted by the system. Indeed there's 

/usr/lib/hal/hal-storage-zpool-export
/usr/lib/hal/hal-storage-zpool-import and
/etc/hal/fdi/policy/10osvendor/20-zfs-methods.fdi

but I haven't seen them actually doing anything useful when I insert a
disk with a pool on it. Does anyone know whether these should be working
now ? I'm not a hal expert...

> I've glanced at Tim Foster's autobackup and related scripts, and they're 
> all about being triggered by the plug connection being made; which is 
> not what I need.

Yep, fair enough.

>   I don't actually want to start the big backup when I 
> plug in (or power on) the drive in the evening, it's supposed to wait 
> until late (to avoid competition with users).  (His autosnapshot script 
> may be just what I need for that part, though.)

The zfs-auto-snapshot service can perform a backup using a command set
in the "zfs/backup-save-cmd" property.

Setting that be a script that automagically selects a USB device (from a
known list, or one with free space?) and points the stream at a relevant
"zfs recv" command to the pool provided by your backup device might be
just what you're after.

Perhaps this is a project for the Christmas holidays :-)

cheers,
tim

-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] How to properly tell zfs of new GUID after a firmware upgrade changes the IDs

2007-12-13 Thread Jill Manfield


My customer's zfs pools and their 6540 disk array had a firmware upgrade that 
changed GUIDs so we need a procedure to let the zfs know it changed. They are 
getting errors as if they replaced drives.  But I need to make sure you know 
they have not "replaced" any drives, and no drives have failed or are "bad". As 
such, they have no interest in wiping any disks clean as indicated in 88130 
info doc.

Some background from customer:

We have a large 6540 disk array, on which we have configured a series of
large RAID luns.  A few days ago, Sun sent a technician to upgrade the
firmware of this array, which worked fine but which had the deleterious
effect of changing the "Volume IDs" associated with each lun.  So, the
resulting luns now appear to our solaris 10 host (under mpxio) as disks in
/dev/rdsk with different 'target' components than they had before.

Before the firmware upgrade we took the precaution of creating duplicate
luns on a different 6540 disk array, and using these to mirror each of our
zfs pools (as protection in case the firmware upgrade corrupted our luns).

Now, we simply want to ask zfs to find the devices under their new
targets, recognize that they are existing zpool components, and have it
correct the configuration of each pool.  This would be similar to having
Veritas vxvm re-scan all disks with vxconfigd in the event of a
"controller renumbering" event.

The proper zfs method for doing this, I believe, is to simply do:

zpool export mypool
zpool import mypool

Indeed, this has worked fine for me a few times today, and several of our
pools are now back to their original mirrored configuration.

Here is a specific example, for the pool "ospf".

The zpool status after the upgrade:

diamond:root[1105]->zpool status ospf
  pool: ospf
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas
exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Tue Dec 11 18:26:53 2007
config:

NAMESTATE READ WRITE CKSUM
ospfDEGRADED 0 0 0
  mirrorDEGRADED 0 0 0
c27t600A0B8000292B024BDC4731A7B8d0  UNAVAIL  0 0 0  
cannot open
c27t600A0B800032619A093747554A08d0  ONLINE   0 0 0

errors: No known data errors

This is due to the fact that the LUN which used to appear as
c27t600A0B8000292B024BDC4731A7B8d0 is now actually
c27t600A0B8000292B024D5B475E6E90d0.  It's the same LUN, but since the
firmware changed the Volume ID, the target portion is different.

Rather than treating this as a "replaced" disk (which would incur an
entire mirror resilvering, and would require the "trick" you sent of
obliterating the disk label so the "in use" safeguard could be avoided),
we simply want to ask zfs to re-read its configuration to find this disk.

So we do this:

diamond:root[1110]->zpool export -f ospf
diamond:root[]->zpool import ospf

and sure enough:

diamond:root[1112]->zpool status ospf
  pool: ospf
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 0.16% done, 2h53m to go
config:

NAMESTATE READ WRITE CKSUM
ospfONLINE   0 0 0
  mirrorONLINE   0 0 0
c27t600A0B8000292B024D5B475E6E90d0  ONLINE   0 0 0
c27t600A0B800032619A093747554A08d0  ONLINE   0 0 0

errors: No known data errors

(Note that it has self-initiated a resilvering, since in this case the
mirror has been changed by users since the firmware upgrade.)

The problem that Robert had was that when he initiated an export of a pool
(called "bgp") it froze for quite some time.  The corresponding "import"
of the same pool took 12 hours to complete.  I have not been able to
replicate this myself, but that was the essence of the problem.

So again, we do NOT want to "zero out" any of our disks, we are not trying
to forcibly use "replaced" disks.  We simply wanted zfs to re-read the
devices under /dev/rdsk and update each pool with the correct disk
targets.

If you can confirm that a simple export/import is the proper procedure for
this (followed by a "clear" once the resulting resilvering finishes), I
would appreciate it.  And, if you can postulate what may have caused the
"freeze" that Robert noticed, that would put our minds at ease.



TIA,

Any assistance on this would be greatly appreciated and or pointers on helpful 
documentation.

--

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Richard Elling

[EMAIL PROTECTED] wrote:
> Shawn,
>
> Using slices for ZFS pools is generally not recommended so I think
> we minimized any command examples with slices:
>
> # zpool create tank mirror c1t0d0s0 c1t1d0s0
>   

Cindy,
I think the term "generally not recommended" requires more context.  In 
the case
of a small system, particularly one which you would find on a laptop or 
desktop,
it is often the case that disks share multiple purposes, beyond ZFS.  I 
think the
way we have written this in the best practices wiki is fine, but perhaps 
we should
ask the group at large.  Thoughts anyone?

I do like the minimization for the examples, though.  If one were to 
actually
read any of the manuals, we clearly talk about how whole disks or slices
are fine.  However, on occasion someone will propagate the news that ZFS
only works with whole disks and we have to correct the confusion afterwards.
 -- richard
> Keep in mind that using the slices from the same disk for both UFS
> and ZFS makes administration more complex. Please see the ZFS BP
> section here:
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pools
>
> * The recovery process of replacing a failed disk is more complex when 
> disks contain both ZFS and UFS file systems on
> slices.
>   * ZFS pools (and underlying disks) that also contain UFS file systems 
> on slices cannot be easily migrated to other
> systems by using zpool import and export features.
>   * In general, maintaining slices increases administration time and 
> cost. Lower your administration costs by
> simplifying your storage pool configuration model.
>
> Cindy
>
> Shawn Joy wrote:
>   
>> What are the commands? Everything I see is c1t0d0, c1t1d0.   no 
>> slice just the completed disk.
>>
>>
>>
>> Robert Milkowski wrote:
>>
>> 
>>> Hello Shawn,
>>>
>>> Thursday, December 13, 2007, 3:46:09 PM, you wrote:
>>>
>>> SJ> Is it possible to bring one slice of a disk under zfs controller and 
>>> SJ> leave the others as ufs?
>>>
>>> SJ> A customer is tryng to mirror one slice using zfs.
>>>
>>>
>>> Yes, it's - it just works.
>>>
>>>
>>>   
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread MP

> Additional examples abound.

Doubtless :)

More usefully, can you confirm whether Solaris works on this chassis without 
the RAID controller?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mirror a slice

2007-12-13 Thread Toby Thain


On 13-Dec-07, at 3:54 PM, Richard Elling wrote:

> [EMAIL PROTECTED] wrote:
>> Shawn,
>>
>> Using slices for ZFS pools is generally not recommended so I think
>> we minimized any command examples with slices:
>>
>> # zpool create tank mirror c1t0d0s0 c1t1d0s0
>>
>
> Cindy,
> I think the term "generally not recommended" requires more  
> context.  In
> the case
> of a small system, particularly one which you would find on a  
> laptop or
> desktop,
> it is often the case that disks share multiple purposes, beyond ZFS.


In particular in a 2-disk system that boots from UFS (that was my  
situation).

--Toby

> I
> think the
> way we have written this in the best practices wiki is fine, but  
> perhaps
> we should
> ask the group at large.  Thoughts anyone?
>
> I do like the minimization for the examples, though.  If one were to
> actually
> read any of the manuals, we clearly talk about how whole disks or  
> slices
> are fine.  However, on occasion someone will propagate the news  
> that ZFS
> only works with whole disks and we have to correct the confusion  
> afterwards.
>  -- richard
>> Keep in mind that using the slices from the same disk for both UFS
>> and ZFS makes administration more complex. ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What does "dataset is busy" actually mean?

2007-12-13 Thread Jim Klimov

I've hit the problem myself recently, and mounting the filesystem cleared 
something in the brains of ZFS and alowed me to snapshot.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg00812.html

PS: I'll use Google before asking some questions,  a'la (C) Bart Simpson
That's how I found your question ;)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-12-13 Thread can you guess?

> Would you two please SHUT THE F$%K UP.

Just for future reference, if you're attempting to squelch a public 
conversation it's often more effective to use private email to do it rather 
than contribute to the continuance of that public conversation yourself.

Have a nice day!

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?

> Are there benchmarks somewhere showing a RAID10
> implemented on an LSI card with, say, 128MB of cache
> being beaten in terms of performance by a similar
> zraid configuration with no cache on the drive
> controller?
> 
> Somehow I don't think they exist. I'm all for data
> scrubbing, but this anti-raid-card movement is
> puzzling.

Oh, for joy - a chance for me to say something *good* about ZFS. rather than 
just try to balance out excessive enthusiasm.

Save for speeding up synchronous writes (if it has enough on-board NVRAM to 
hold them until it's convenient to destage them to disk), a RAID-10 card should 
not enjoy any noticeable performance advantage over ZFS mirroring.

By contrast, if extremely rare undetected and (other than via ZFS checksums) 
undetectable (or considerably more common undetected but detectable via disk 
ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is 
used to mirror the data there's a good chance that even ZFS's validation scans 
won't see the problem (because the card happens to access the good copy for the 
scan rather than the bad one) - in which case you'll lose that data if the disk 
with the good data fails.  And in the case of (extremely rare) 
otherwise-undetectable corruption, if the card *does* return the bad copy then 
IIRC ZFS (not knowing that a good copy also exists) will just claim that the 
data is gone (though I don't know if it will then flag it such that you'll 
never have an opportunity to find the good copy).

If the RAID card scrubs its disks the difference (now limited to the extremely 
rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm 
not sure how many RAIDs below the near-enterprise category perform such scrubs.

In other words, if you *don't* otherwise scrub your disks then ZFS's 
checksums-plus-internal-scrubbing mechanisms assume greater importance:  it's 
only the contention that other solutions that *do* offer scrubbing can't 
compete with ZFS in effectively protecting your data that's somewhat over the 
top.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding external USB disks

2007-12-13 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> What are the approaches to finding what external USB disks are currently
> connected?   I'm starting on backup scripts, and I need to check which
> volumes are present before I figure out what to back up to them.  I  
> . . .

In addition to what others have suggested so far, "cfgadm -l" lists usb-
and firewire-connected drives (even those plugged-in but not mounted).
So scripts can check that way as well.

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Richard Elling

Steve McKinty wrote:
> I have a couple of questions and concerns about using ZFS in an environment 
> where the underlying LUNs are replicated at a block level using products like 
> HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
> the explanation to be clear.
>
> (I do realise that there are other possibilities such as zfs send/recv and 
> there are technical and business pros and cons for the various options. I 
> don't want to start a 'which is best' argument :) )
>
> The CoW design of ZFS means that it goes to great lengths to always maintain 
> on-disk self-consistency, and ZFS can make certain assumptions about state 
> (e.g not needing fsck) based on that.  This is the basis of my questions. 
>
> 1) First issue relates to the überblock.  Updates to it are assumed to be 
> atomic, but if the replication block size is smaller than the überblock then 
> we can't guarantee that the whole überblock is replicated as an entity.  That 
> could in theory result in a corrupt überblock at the
> secondary. 
>   

The uberblock contains a circular queue of updates.  For all practical
purposes, this is COW.  The updates I measure are usually 1 block
(or, to put it another way, I don't recall seeing more than 1 block being
updated... I'd have to recheck my data)

> Will this be caught and handled by the normal ZFS checksumming? If so, does 
> ZFS just use an alternate überblock and rewrite the damaged one transparently?
>
>   

The checksum should catch it.  To be safe, there are 4 copies of the 
uberblock.

> 2) Assuming that the replication maintains write-ordering, the secondary site 
> will always have valid and self-consistent data, although it may be 
> out-of-date compared to the primary if the replication is asynchronous, 
> depending on link latency, buffering, etc. 
>
> Normally most replication systems do maintain write ordering, [i]except[/i] 
> for one specific scenario.  If the replication is interrupted, for example 
> secondary site down or unreachable due to a comms problem, the primary site 
> will keep a list of changed blocks.  When contact between the sites is 
> re-established there will be a period of 'catch-up' resynchronization.  In 
> most, if not all, cases this is done on a simple block-order basis.  
> Write-ordering is lost until the two sites are once again in sync and routine 
> replication restarts. 
>
> I can see this has having major ZFS impact.  It would be possible for 
> intermediate blocks to be replicated before the data blocks they point to, 
> and in the worst case an updated überblock could be replicated before the 
> block chains that it references have been copied.  This breaks the assumption 
> that the on-disk format is always self-consistent. 
>
> If a disaster happened during the 'catch-up', and the 
> partially-resynchronized LUNs were imported into a zpool at the secondary 
> site, what would/could happen? Refusal to accept the whole zpool? Rejection 
> just of the files affected? System panic? How could recovery from this 
> situation be achieved?
>   

I think all of these reactions to the double-failure mode are possible.
The version of ZFS used will also have an impact as the later versions
are more resilient.  I think that in most cases, only the affected files
will be impacted.  zpool scrub will ensure that everything is consistent
and mark those files which fail to checksum properly.

> Obviously all filesystems can suffer with this scenario, but ones that expect 
> less from their underlying storage (like UFS) can be fscked, and although 
> data that was being updated is potentially corrupt, existing data should 
> still be OK and usable.  My concern is that ZFS will handle this scenario 
> less well. 
>   

...databases too...
It might be easier to analyze this from the perspective of the transaction
group than an individual file.  Since ZFS is COW, you may have a
state where a transaction group is incomplete, but the previous data
state should be consistent.

> There are ways to mitigate this, of course, the most obvious being to take a 
> snapshot of the (valid) secondary before starting resync, as a fallback.  
> This isn't always easy to do, especially since the resync is usually 
> automatic; there is no clear trigger to use for the snapshot. It may also be 
> difficult to synchronize the snapshot of all LUNs in a pool. I'd like to 
> better understand the risks/behaviour of ZFS before starting to work on 
> mitigation strategies. 
>
>   

I don't see how snapshots would help.  The inherent transaction group 
commits
should be sufficient.  Or, to look at this another way, a snapshot is 
really just
a metadata change.

I am more worried about how the storage admin sets up the LUN groups.
The human factor can really ruin my day...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack

On December 13, 2007 9:47:00 AM -0800 MP <[EMAIL PROTECTED]> wrote:
>> Additional examples abound.
>
> Doubtless :)
>
> More usefully, can you confirm whether Solaris works on this chassis
> without the RAID controller?

way back, i had Solaris working with a promise j200s (jbod sas) chassis,
to the extent that the sas driver at the time worked.  i can't IMAGINE
why this chassis would be any different from Solaris' perspective.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack

On December 13, 2007 11:34:54 AM -0800 "can you guess?" 
<[EMAIL PROTECTED]> wrote:
> By contrast, if extremely rare undetected and (other than via ZFS
> checksums) undetectable (or considerably more common undetected but
> detectable via disk ECC codes, *if* the data is accessed) corruption
> occurs, if the RAID card is used to mirror the data there's a good chance
> that even ZFS's validation scans won't see the problem (because the card
> happens to access the good copy for the scan rather than the bad one) -
> in which case you'll lose that data if the disk with the good data fails.
> And in the case of (extremely rare) otherwise-undetectable corruption, if
> the card *does* return the bad copy then IIRC ZFS (not knowing that a
> good copy also exists) will just claim that the data is gone (though I
> don't know if it will then flag it such that you'll never have an
> opportunity to find the good copy).

i like this answer, except for what you are implying by "extremely rare".

> If the RAID card scrubs its disks the difference (now limited to the
> extremely rare undetectable-via-disk-ECC corruption) becomes pretty
> negligible - but I'm not sure how many RAIDs below the near-enterprise
> category perform such scrubs.
>
> In other words, if you *don't* otherwise scrub your disks then ZFS's
> checksums-plus-internal-scrubbing mechanisms assume greater importance:
> it's only the contention that other solutions that *do* offer scrubbing
> can't compete with ZFS in effectively protecting your data that's
> somewhat over the top.

the problem with your discounting of zfs checksums is that you aren't
taking into account that "extremely rare" is relative to the number of
transactions, which are "extremely high".  in such a case even "extremely
rare" errors do happen, and not just to "extremely few" folks, but i would
say to all enterprises.  hell it happens to home users.

when the difference between an unrecoverable single bit error is not just
1 bit but the entire file, or corruption of an entire database row (etc),
those small and infrequent errors are an "extremely big" deal.

considering all the pieces, i would much rather run zfs on a jbod than
on a raid, wherever i could.  it gives better data protection, and it
is ostensibly cheaper.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Toby Thain


On 13-Dec-07, at 6:28 PM, Frank Cusack wrote:

> On December 13, 2007 11:34:54 AM -0800 "can you guess?"
> <[EMAIL PROTECTED]> wrote:
>> By contrast, if extremely rare undetected and (other than via ZFS
>> checksums) undetectable (or considerably more common undetected but
>> detectable via disk ECC codes, *if* the data is accessed) corruption
>> occurs, if the RAID card is used to mirror the data there's a good  
>> chance
>> that even ZFS's validation scans won't see the problem (because  
>> the card
>> happens to access the good copy for the scan rather than the bad  
>> one) -
>> in which case you'll lose that data if the disk with the good data  
>> fails.

Which is exactly why ZFS should do the mirroring...

>> And in the case of (extremely rare) otherwise-undetectable  
>> corruption, if
>> the card *does* return the bad copy then IIRC ZFS (not knowing that a
>> good copy also exists) will just claim that the data is gone  
>> (though I
>> don't know if it will then flag it such that you'll never have an
>> opportunity to find the good copy).

Ditto.

>
> i like this answer, except for what you are implying by "extremely  
> rare".
>
>> If the RAID card scrubs its disks

A scrub without checksum puts a huge burden on disk firmware and  
error reporting paths :-)

--Toby

>> the difference (now limited to the
>> extremely rare undetectable-via-disk-ECC corruption) becomes pretty
>> negligible - but I'm not sure how many RAIDs below the near- 
>> enterprise
>> category perform such scrubs.
>>
>> In other words, if you *don't* otherwise scrub your disks then ZFS's
>> checksums-plus-internal-scrubbing mechanisms assume greater  
>> importance:
>> it's only the contention that other solutions that *do* offer  
>> scrubbing
>> can't compete with ZFS in effectively protecting your data that's
>> somewhat over the top.
>
> the problem with your discounting of zfs checksums is that you aren't
> taking into account that "extremely rare" is relative to the number of
> transactions, which are "extremely high". ...
>
> considering all the pieces, i would much rather run zfs on a jbod than
> on a raid, wherever i could.  it gives better data protection, and it
> is ostensibly cheaper.
>
> -frank
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool version 3 & Uberblock version 9 , zpool upgrade only half succeeded?

2007-12-13 Thread kristof

We are currently experiencing a very huge perfomance drop on our zfs storage 
server.

We have 2 pools, pool 1 stor is a raidz out of 7 iscsi nodes, home is a local 
mirror pool. Recently we had some issues with one of the storagenodes, because 
of that the pool was degraded. Since we did not succeed in bringing this 
storagenode back online (on zfs level) we upgraded our nashead from opensolaris 
b57 to b77. After upgrade we succesfully resilvered the pool (resilver took 1 
week! -> 14 TB). Finally we upgraded the pool to version 9 (comming from 
version 3). Now zpool is healty again, but perfomance realy s*cks. Accessing 
older data takes way to much time. Doing "dtruss -a find ." in a zfs filesystem 
on this b77 server is extremely slow, while it is fast in our backup location 
were we are still using opensolaris b57 and zpool version 3. 

Writing new data seems normal, we don't see huge issues here. The real problem 
is do ls, rm or find in filesystems with lots of files (+5, not in 1 
directory spread in multiple subfolders)

Today I found that not only zpool upgrade exist, but also zfs upgrade, most 
filesystems are still version 1 while some new are already version 3. 

Running zdb we also saw there is a mismatchs in version information, our 
storage pool is list as version 3 while the uberblock is at version 9, when we 
run zpool upgrade, it tells us all pools are upgraded to latest version.

below the zdb output: 

zdb stor 
version=3 
name='stor' 
state=0 
txg=6559447 
pool_guid=14464037545511218493 
hostid=341941495 
hostname='fileserver011' 
vdev_tree 
type='root' 
id=0 
guid=14464037545511218493 
children[0] 
type='raidz' 
id=0 
guid=179558698360846845 
nparity=1 
metaslab_array=13 
metaslab_shift=37 
ashift=9 
asize=20914156863488 
is_log=0 
children[0] 
type='disk' 
id=0 
guid=640233961847538260 
path='/dev/dsk/c2t3d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=36 
children[1] 
type='disk' 
id=1 
guid=7833573669820754721 
path='/dev/dsk/c2t4d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=22 
children[2] 
type='disk' 
id=2 
guid=13685988517147825972 
path='/dev/dsk/c2t5d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=17 
children[3] 
type='disk' 
id=3 
guid=13514021245008793227 
path='/dev/dsk/c2t6d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=21 
children[4] 
type='disk' 
id=4 
guid=15871506866153751690 
path='/dev/dsk/c2t9d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=20 
children[5] 
type='disk' 
id=5 
guid=11392907262189654902 
path='/dev/dsk/c2t7d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=19 
children[6] 
type='disk' 
id=6 
guid=8472117762643335828 
path='/dev/dsk/c2t8d0s0' 
devid='id1,[EMAIL PROTECTED]/a' 
phys_path='/iscsi/[EMAIL PROTECTED],0:a' 
whole_disk=1 
DTL=18 
Uberblock 

magic = 00bab10c 
version = 9 
txg = 6692849 
guid_sum = 12266969233845513474 
timestamp = 1197546530 UTC = Thu Dec 13 12:48:50 2007 
fileserver

If we compare with zpool home (this pool was craeted after installing

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?

...

> when the difference between an unrecoverable single
> bit error is not just
> 1 bit but the entire file, or corruption of an entire
> database row (etc),
> those small and infrequent errors are an "extremely
> big" deal.

You are confusing unrecoverable disk errors (which are rare but orders of 
magnitude more common) with otherwise *undetectable* errors (the occurrence of 
which is at most once in petabytes by the studies I've seen, rather than once 
in terabytes), despite my attempt to delineate the difference clearly.  
Conventional approaches using scrubbing provide as complete protection against 
unrecoverable disk errors as ZFS does:  it's only the far rarer otherwise 
*undetectable* errors that ZFS catches and they don't.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZIL and snapshots

2007-12-13 Thread Ross

Heh, interesting to see somebody else using the sheer number of disks in the 
Thumper to their advantage :)

Have you thought of solid state cache for the ZIL?  There's a 16GB battery 
backed PCI card out there, I don't know how much it costs, but the blog where I 
saw it mentioned a 20x improvement in performance for small random writes.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?

...

> >> If the RAID card scrubs its disks
> 
> A scrub without checksum puts a huge burden on disk
> firmware and  
> error reporting paths :-)

Actually, a scrub without checksum places far less burden on the disks and 
their firmware than ZFS-style scrubbing does, because it merely has to scan the 
disk sectors sequentially rather than follow a tree path to each relatively 
small leaf block.  Thus it also compromises runtime operation a lot less as 
well (though in both cases doing it infrequently in the background should 
usually reduce any impact to acceptable levels).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZIL and snapshots

2007-12-13 Thread Moore, Joe

> Have you thought of solid state cache for the ZIL?  There's a 
> 16GB battery backed PCI card out there, I don't know how much 
> it costs, but the blog where I saw it mentioned a 20x 
> improvement in performance for small random writes.

Thought about it, looked in the Sun Store, couldn't find one, and cut
the PO.

Haven't gone back to get a new approval.  I did put a couple of the
MTron 32GB SSD drives on the christmas wishlist (aka 2008 budget)

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread can you guess?

Great questions.

> 1) First issue relates to the überblock.  Updates to
> it are assumed to be atomic, but if the replication
> block size is smaller than the überblock then we
> can't guarantee that the whole überblock is
> replicated as an entity.  That could in theory result
> in a corrupt überblock at the
> secondary. 
> 
> Will this be caught and handled by the normal ZFS
> checksumming? If so, does ZFS just use an alternate
> überblock and rewrite the damaged one transparently?

ZFS already has to deal with potential uberblock partial writes if it contains 
multiple disk sectors (and it might be prudent even if it doesn't, as Richard's 
response seems to suggest).  Common ways of dealing with this problem include 
dumping it into the log (in which case the log with its own internal recovery 
procedure becomes the real root of all evil) or cycling around at least two 
locations per mirror copy (Richard's response suggests that there are 
considerably more, and that perhaps each one is written in quadruplicate) such 
that the previous uberblock would still be available if the new write tanked.  
ZFS-style snapshots complicate both approaches unless special provisions are 
taken - e.g., copying the current uberblock on each snapshot and hanging a list 
of these snapshot uberblock addresses off the current uberblock, though even 
that might run into interesting complications under the scenario which you 
describe below.  Just using the 'queue' that Richard describes to accumulate 
snapshot uberblocks would limit the number of concurrent snapshots to less than 
the size of that queue.

In any event, as long as writes to the secondary copy don't continue after a 
write failure of the kind that you describe has occurred (save for the kind of 
catch-up procedure that you mention later), ZFS's internal facilities should 
not be confused by encountering a partial uberblock update at the secondary, 
any more than they'd be confused by encountering it on an unreplicated system 
after restart.

> 
> 2) Assuming that the replication maintains
> write-ordering, the secondary site will always have
> valid and self-consistent data, although it may be
> out-of-date compared to the primary if the
> replication is asynchronous, depending on link
> latency, buffering, etc. 
> 
> Normally most replication systems do maintain write
> ordering, [i]except[/i] for one specific scenario.
> If the replication is interrupted, for example
> secondary site down or unreachable due to a comms
> problem, the primary site will keep a list of
> changed blocks.  When contact between the sites is
> re-established there will be a period of 'catch-up'
> resynchronization.  In most, if not all, cases this
> is done on a simple block-order basis.
> Write-ordering is lost until the two sites are once
>  again in sync and routine replication restarts. 
> 
> I can see this has having major ZFS impact.  It would
> be possible for intermediate blocks to be replicated
> before the data blocks they point to, and in the
> worst case an updated überblock could be replicated
> before the block chains that it references have been
> copied.  This breaks the assumption that the on-disk
> format is always self-consistent. 
> 
> If a disaster happened during the 'catch-up', and the
> partially-resynchronized LUNs were imported into a
> zpool at the secondary site, what would/could happen?
> Refusal to accept the whole zpool? Rejection just of
> the files affected? System panic? How could recovery
> from this situation be achieved?

My inclination is to say "By repopulating your environment from backups":  it 
is not reasonable to expect *any* file system to operate correctly, or to 
attempt any kind of comprehensive recovery (other than via something like fsck, 
with no guarantee of how much you'll get back), when the underlying hardware 
transparently reorders updates which the file system has explicitly ordered 
when it presented them.

But you may well be correct in suspecting that there's more potential for 
data loss should this occur in a ZFS environment than in update-in-place 
environments where only portions of the tree structure that were explicitly 
changed during the connection hiatus would likely be affected by such a 
recovery interruption (though even there if a directory changed enough to 
change its block structure on disk you could be in more trouble).

> 
> Obviously all filesystems can suffer with this
> scenario, but ones that expect less from their
> underlying storage (like UFS) can be fscked, and
> although data that was being updated is potentially
> corrupt, existing data should still be OK and usable.
> My concern is that ZFS will handle this scenario
>  less well. 
> 
> There are ways to mitigate this, of course, the most
> obvious being to take a snapshot of the (valid)
> secondary before starting resync, as a fallback.

You're talking about an HDS- or EMC-level snapshot, right?

> This isn't always easy to do, especially sinc

Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Henk Langeveld

J.P. King wrote:
>> Wow, that a neat idea, and crazy at the same time. But the mknod's minor
>> value can be 0-262143 so it probably would be doable with some loss of
>> memory and efficiency. But maybe not :) (I would need one lofi dev per
>> filesystem right?)
>>
>> Definitely worth remembering if I need to do something small/quick.
> 
> You're confusing lofi and lofs, I think.  Have a look at man lofs.
> 
> Now all _I_ would like is translucent options to that and I'd solve one of 
> my major headaches.

Check ast-open[1] for the 3d command that implements the nDFS, or 
multiple dimension file system, allowing you to overlay directories.
The 3d [2] utility allows you to run a command with all file system
calls intercepted.

Any writes will go into the top level directory, while reads pass
though until a matching file is found.

system calls are intercepted by an  LD_PRELOAD library, so each
process can have its own settings.

[1] http://www.research.att.com/~gsf/download/gen/ast-open.html
[2] http://www.research.att.com/~gsf/man/man1/3d.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Frank Cusack

On December 13, 2007 12:51:55 PM -0800 "can you guess?" 
<[EMAIL PROTECTED]> wrote:
> ...
>
>> when the difference between an unrecoverable single
>> bit error is not just
>> 1 bit but the entire file, or corruption of an entire
>> database row (etc),
>> those small and infrequent errors are an "extremely
>> big" deal.
>
> You are confusing unrecoverable disk errors (which are rare but orders of
> magnitude more common) with otherwise *undetectable* errors (the
> occurrence of which is at most once in petabytes by the studies I've
> seen, rather than once in terabytes), despite my attempt to delineate the
> difference clearly.

No I'm not.  I know exactly what you are talking about.

>  Conventional approaches using scrubbing provide as
> complete protection against unrecoverable disk errors as ZFS does:  it's
> only the far rarer otherwise *undetectable* errors that ZFS catches and
> they don't.

yes.  far rarer and yet home users still see them.

that the home user ever sees these extremely rare (undetectable) errors
may have more to do with poor connection (cables, etc) to the disk, and
less to do with disk media errors.  enterprise users probably have
better connectivity and see errors due to high i/o.  just thinking
out loud.

regardless, zfs on non-raid provides better protection than zfs on raid
(well, depending on raid configuration) so just from the data integrity
POV non-raid would generally be preferred.  the fact that the type of
error being prevented is rare doesn't change that and i was further
arguing that even though it's rare the impact can be high so you don't
want to write it off.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Marion Hakanson

[EMAIL PROTECTED] said:
> You are confusing unrecoverable disk errors (which are rare but orders of
> magnitude more common) with otherwise *undetectable* errors (the occurrence
> of which is at most once in petabytes by the studies I've seen, rather than
> once in terabytes), despite my attempt to delineate the difference clearly.

I could use a little clarification on how these unrecoverable disk errors
behave -- or maybe a lot, depending on one's point of view.

So, when one of these "once in around ten (or 100) terabytes read" events
occurs, my understanding is that a read error is returned by the drive,
and the corresponding data is lost as far as the drive is concerned.
Maybe just a bit is gone, maybe a byte, maybe a disk sector, it probably
depends on the disk, OS, driver, and/or the rest of the I/O hardware
chain.  Am I doing OK so far?


> Conventional approaches using scrubbing provide as complete protection
> against unrecoverable disk errors as ZFS does:  it's only the far rarer
> otherwise *undetectable* errors that ZFS catches and they don't. 

I found it helpful to my own understanding to try restating the above
in my own words.  Maybe others will as well.

If my assumptions are correct about how these unrecoverable disk errors
are manifested, then a "dumb" scrubber will find such errors by simply
trying to read everything on disk -- no additional checksum is required.
Without some form of parity or replication, the data is lost, but at
least somebody will know about it.

Now it seems to me that without parity/replication, there's not much
point in doing the scrubbing, because you could just wait for the error
to be detected when someone tries to read the data for real.  It's
only if you can repair such an error (before the data is needed) that
such scrubbing is useful.

For those well-versed in this stuff, apologies for stating the obvious.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Ricardo M. Correia

Steve McKinty wrote:

1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the
secondary.

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently?

Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening
the pool it uses the latest valid uberblock that it can find. So that
is not a problem.

2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc.

Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts.

I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent.

If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved?

I believe your understanding is correct. If you expect such a
double-failure, you cannot rely on being able to recover your pool at
the secondary site.

The newest uberblocks would be among the first blocks to be replicated
(2 of the uberblock arrays are situated at the start of the vdev) and
your whole block tree might be inaccessible if the latest Meta Object
Set blocks were not also replicated. You might be lucky and be able to
mount your filesystems because ZFS keeps 3 separate copies of the most
important metadata and it tries to keep apart each copy by about 1/8th
of the disk, but even then I wouldn't count on it.

If ZFS can't open the pool due to this kind of corruption, you would
get the following message:

status: The pool metadata is corrupted and the pool cannot be
opened.
action: Destroy and re-create the pool from a backup source.

At this point, you could try zeroing out the first 2 uberblock
arrays so that ZFS tries using an older uberblock from the last 2
arrays, but this might not work. As the message says, the only reliable
way to recover from this is restoring your pool from backups.

There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies.

If the replication process was interrupted for a sufficiently long time
and disaster strikes at the primary site *during resync*, I don't think
snapshots would save you even if you had took them at the right time.
Snapshots might increase your chances of recovery (by making ZFS not
free and reuse blocks), but AFAIK there wouldn't be any guarantee that
you'd be able to recover anything whatsoever since the most important
pool metadata is not part of the snapshots.

Regards,
Ricardo

Ricardo Manuel Correia

Lustre Engineering
Sun Microsystems, Inc.
Portugal

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Jorgen Lundman


NOC staff couldn't reboot it after the quotacheck crash, and I only just 
got around to going to the Datacenter.  This time I disabled NFS, and 
the rsync that was running, and ran just quotacheck and it completed 
successfully. The reason it didn't boot what that damned boot-archive 
again. Seriously!

Anyway, I did get a vmcore from the crash, but maybe it isn't so 
interesting. I will continue with the stress testing of UFS on zpool as 
it is the only solution that would be acceptable. Not given up yet, I 
have a few more weeks to keep trying. :)



-rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
-rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0

bash-3.00# adb -k unix.0 vmcore.0
physmem 3f9789
$c
top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8, ff1a0d942080,
ff001f175b20, fffedd6d2020)
common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4, 
f7c7ea78
, c06003d0)
rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
svc_run+0x171(ff62becb72a0)
svc_do_run+0x85(1)
nfssys+0x748(e, fecf0fc8)
sys_syscall32+0x101()


BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0 occurred in 
module
"" due to a NULL pointer dereference





-- 
Jorgen Lundman   | <[EMAIL PROTECTED]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Shawn Ferry

Jorgen,

You may want to try running 'bootadm update-archive'

Assuming that your boot-archive problem is an out of date boot-archive
message at boot and/or doing a clean reboot to let the system try to
write an up to date boot-archive.

I would also encourage you to connect the LOM to the network in case you
have such issues again, you should be able to recover remotely.

Shawn

On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote:

>
> NOC staff couldn't reboot it after the quotacheck crash, and I only  
> just
> got around to going to the Datacenter.  This time I disabled NFS, and
> the rsync that was running, and ran just quotacheck and it completed
> successfully. The reason it didn't boot what that damned boot-archive
> again. Seriously!
>
> Anyway, I did get a vmcore from the crash, but maybe it isn't so
> interesting. I will continue with the stress testing of UFS on zpool  
> as
> it is the only solution that would be acceptable. Not given up yet, I
> have a few more weeks to keep trying. :)
>
>
>
> -rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
> -rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0
>
> bash-3.00# adb -k unix.0 vmcore.0
> physmem 3f9789
> $c
> top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
> ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
> fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
> rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8,  
> ff1a0d942080,
> ff001f175b20, fffedd6d2020)
> common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4,
> f7c7ea78
> , c06003d0)
> rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
> svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
> svc_run+0x171(ff62becb72a0)
> svc_do_run+0x85(1)
> nfssys+0x748(e, fecf0fc8)
> sys_syscall32+0x101()
>
>
> BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0  
> occurred in
> module
> "" due to a NULL pointer dereference
>
>
>
>
>
> -- 
> Jorgen Lundman   | <[EMAIL PROTECTED]>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
> Japan| +81 (0)3 -3375-1767  (home)
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Shawn Ferry  shawn.ferry at sun.com
Senior Primary Systems Engineer
Sun Managed Operations
571.291.4898





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Anton B. Rang

> I could use a little clarification on how these unrecoverable disk errors
> behave -- or maybe a lot, depending on one's point of view.
> 
> So, when one of these "once in around ten (or 100) terabytes read" events
> occurs, my understanding is that a read error is returned by the drive,
> and the corresponding data is lost as far as the drive is concerned.

Yes -- the data being one or more disk blocks.  (You can't lose a smaller
amount of data, from the drive's point of view, since the error correction
code covers the whole block.)

> If my assumptions are correct about how these unrecoverable disk errors
> are manifested, then a "dumb" scrubber will find such errors by simply
> trying to read everything on disk -- no additional checksum is required.
> Without some form of parity or replication, the data is lost, but at
> least somebody will know about it.

Right.  Generally if you have replication and scrubbing, then you'll also
re-write any data which was found to be unreadable, thus fixing the
problem (and protecting yourself against future loss of the second copy).

> Now it seems to me that without parity/replication, there's not much
> point in doing the scrubbing, because you could just wait for the error
> to be detected when someone tries to read the data for real.  It's
> only if you can repair such an error (before the data is needed) that
> such scrubbing is useful.

Pretty much, though if you're keeping backups, you could recover the
data from backup at this point. Of course, backups could be considered
a form of replication, but most of us in file systems don't think of them
that way.

Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-13 Thread Jorgen Lundman



Shawn Ferry wrote:
> Jorgen,
> 
> You may want to try running 'bootadm update-archive'
> 
> Assuming that your boot-archive problem is an out of date boot-archive
> message at boot and/or doing a clean reboot to let the system try to
> write an up to date boot-archive.

Yeah, it is remembering to do so after something has changed that's 
hard. In this case, I had to break the mirror to install OpenSolaris. 
(shame that the CD/DVD, and miniroot, doesn't not have md driver).

It would be tempting to add the bootadm update-archive to the boot 
process, as I would rather have it come up half-assed, than not come up 
at all.

And yes, other servers are on remote access, but since was a temporary 
trial, we only ran 1 network cable, and 2x 200V cables. Should have done
a proper job at the start, I guess.

This time I made sure it was reboot-safe :)

Lund


> 
> I would also encourage you to connect the LOM to the network in case you
> have such issues again, you should be able to recover remotely.
> 
> Shawn
> 
> On Dec 13, 2007, at 10:33 PM, Jorgen Lundman wrote:
> 
>> NOC staff couldn't reboot it after the quotacheck crash, and I only  
>> just
>> got around to going to the Datacenter.  This time I disabled NFS, and
>> the rsync that was running, and ran just quotacheck and it completed
>> successfully. The reason it didn't boot what that damned boot-archive
>> again. Seriously!
>>
>> Anyway, I did get a vmcore from the crash, but maybe it isn't so
>> interesting. I will continue with the stress testing of UFS on zpool  
>> as
>> it is the only solution that would be acceptable. Not given up yet, I
>> have a few more weeks to keep trying. :)
>>
>>
>>
>> -rw-r--r--   1 root root 2345863 Dec 14 09:57 unix.0
>> -rw-r--r--   1 root root 4741623808 Dec 14 10:05 vmcore.0
>>
>> bash-3.00# adb -k unix.0 vmcore.0
>> physmem 3f9789
>> $c
>> top_end_sync+0xcb(ff0a5923d000, ff001f175524, b, 0)
>> ufs_fsync+0x1cb(ff62e757ad80, 1, fffedd6d2020)
>> fop_fsync+0x51(ff62e757ad80, 1, fffedd6d2020)
>> rfs3_setattr+0x3a3(ff001f1757c8, ff001f1758b8,  
>> ff1a0d942080,
>> ff001f175b20, fffedd6d2020)
>> common_dispatch+0x444(ff001f175b20, ff0a5a4baa80, 2, 4,
>> f7c7ea78
>> , c06003d0)
>> rfs_dispatch+0x2d(ff001f175b20, ff0a5a4baa80)
>> svc_getreq+0x1c6(ff0a5a4baa80, fffec7eda6c0)
>> svc_run+0x171(ff62becb72a0)
>> svc_do_run+0x85(1)
>> nfssys+0x748(e, fecf0fc8)
>> sys_syscall32+0x101()
>>
>>
>> BAD TRAP: type=e (#pf Page fault) rp=ff001f175320 addr=0  
>> occurred in
>> module
>> "" due to a NULL pointer dereference
>>
>>
>>
>>
>>
>> -- 
>> Jorgen Lundman   | <[EMAIL PROTECTED]>
>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>> Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
>> Japan| +81 (0)3 -3375-1767  (home)
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> --
> Shawn Ferry  shawn.ferry at sun.com
> Senior Primary Systems Engineer
> Sun Managed Operations
> 571.291.4898
> 
> 
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

-- 
Jorgen Lundman   | <[EMAIL PROTECTED]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?

> On December 13, 2007 12:51:55 PM -0800 "can you
> guess?" 
> <[EMAIL PROTECTED]> wrote:
> > ...
> >
> >> when the difference between an unrecoverable
> single
> >> bit error is not just
> >> 1 bit but the entire file, or corruption of an
> entire
> >> database row (etc),
> >> those small and infrequent errors are an
> "extremely
> >> big" deal.
> >
> > You are confusing unrecoverable disk errors (which
> are rare but orders of
> > magnitude more common) with otherwise
> *undetectable* errors (the
> > occurrence of which is at most once in petabytes by
> the studies I've
> > seen, rather than once in terabytes), despite my
> attempt to delineate the
> > difference clearly.
> 
> No I'm not.  I know exactly what you are talking
> about.

Then you misspoke in your previous post by referring to "an unrecoverable 
single bit error" rather than to "an undetected single-bit error", which I 
interpreted as a misunderstanding.

> 
> >  Conventional approaches using scrubbing provide as
> > complete protection against unrecoverable disk
> errors as ZFS does:  it's
> > only the far rarer otherwise *undetectable* errors
> that ZFS catches and
> > they don't.
> 
> yes.  far rarer and yet home users still see them.

I'd need to see evidence of that for current hardware.

> 
> that the home user ever sees these extremely rare
> (undetectable) errors
> may have more to do with poor connection (cables,
> etc) to the disk,

Unlikely, since transfers over those connections have been protected by 32-bit 
CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger 
protection), and SMART tracks the incidence of these errors (which result in 
retries when detected) such that very high error rates should be noticed before 
an error is likely to make it through the 2^-32 probability sieve (for that 
matter, you might well notice the performance degradation due to the frequent 
retries).  I can certainly believe that undetected transfer errors occurred in 
noticeable numbers in older hardware, though:  that's why they introduced the 
CRCs.

 and
> less to do with disk media errors.  enterprise users
> probably have
> better connectivity and see errors due to high i/o.

As I said, at most once in petabytes transferred.  It takes about 5 years for a 
contemporary ATA/SATA disk to transfer 10 PB if it's streaming data at top 
speed, 24/7; doing 8 KB random database accesses (the example that you used) 
flat out, 24/7, it takes about 500 years (though most such drives aren't speced 
for 24/7 operation, especially with such a seek-intensive workload) - and for a 
more realistic random-access database workload it would take many millennia.

So it would take an extremely large (on the order of 1,000 disks) and very 
active database before you'd be likely to see one of these errors within the 
lifetime of the disks involved.

>  just thinking
> ut loud.
> 
> regardless, zfs on non-raid provides better
> protection than zfs on raid
> (well, depending on raid configuration) so just from
> the data integrity
> POV non-raid would generally be preferred.

That was the point I made in my original post here - but *if* the hardware RAID 
is scrubbing its disks the difference in data integrity protection is unlikely 
to be of any real significance and one might reasonably elect to use the 
hardware RAID if it offered any noticeable performance advantage (e.g., by 
providing NVRAM that could expedite synchronous writes).

  the fact
> that the type of
> error being prevented is rare doesn't change that and
> i was further
> arguing that even though it's rare the impact can be
> high so you don't
> want to write it off.

All reliability involves trade-offs, and very seldom are "all other things 
equal".  Extremely low probability risks are often worth taking if it costs 
*anything* to avoid them (but of course are never worth taking if it costs 
*nothing* to avoid them).

- bill

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread can you guess?

...

> > Now it seems to me that without parity/replication,
> there's not much
> > point in doing the scrubbing, because you could
> just wait for the error
> > to be detected when someone tries to read the data
> for real.  It's
> > only if you can repair such an error (before the
> data is needed) that
> > such scrubbing is useful.
> 
> Pretty much

I think I've read (possibly in the 'MAID' descriptions) the contention that at 
least some unreadable sectors get there in stages, such that if you catch them 
early they will be only difficult to read rather than completely unreadable.  
In such a case, scrubbing is worthwhile even without replication, because it 
finds the problem early enough that the disk itself (or higher-level mechanisms 
if the disk gives up but the higher level is more persistent) will revector the 
sector when it finds it difficult (but not impossible) to read.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-13 Thread Will Murnane

On Dec 14, 2007 1:12 AM, can you guess? <[EMAIL PROTECTED]> wrote:
> > yes.  far rarer and yet home users still see them.
>
> I'd need to see evidence of that for current hardware.
What would constitute "evidence"?  Do anecdotal tales from home users
qualify?  I have two disks (and one controller!) that generate several
checksum errors per day each.  I've also seen intermittent checksum
fails that go away once all the cables are wiggled.

> Unlikely, since transfers over those connections have been protected by 
> 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger 
> protection)
The ATA/7 spec specifies a 32-bit CRC (older ones used a 16-bit CRC)
[1].  The serial ata protocol also specifies 32-bit CRCs beneath 8/10b
coding (1.0a p. 159)[2].  That's not much stronger at all.

Will

[1] http://www.t10.org/t13/project/d1532v3r4a-ATA-ATAPI-7.pdf
[2] http://www.ece.umd.edu/courses/enee759h.S2003/references/serialata10a.pdf
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

51 matches

Mail list logo